# Using Ansible + Terraform for Repeatable Infrastructure Experiments When conducting experiments that aim to collect benchmarking data to objectively compare different solutions under the same general circumstances, automation and defining as much of the setup "as code" as possible is a tremendous help. This blog post explains how I have used Ansible, Terraform, kube-burner, and kube-prometheus stack to compare the performance of cluster autoscalers in Kubernetes across different metrics. ## Setup Overview The goal is that every scenario follows the same baseline in terms of infrastructure, is repeatable, and once triggered, should run independently from start to finish without any manual intervention. This consistency is critical for obtaining reliable and comparable performance metrics when evaluating different autoscaling technologies like Cluster Autoscaler (CAS) and Karpenter. ## Experiment Structure The experiment workflow follows a systematic approach that can be broken down into several key phases: 1. **Infrastructure Provisioning**: Deploy Kubernetes clusters with Terraform - Create identical IKS (IBM Kubernetes Service) or EKS (Amazon EKS) clusters for testing - Configure appropriate worker node settings and networking 2. **Monitoring Setup**: Deploy the monitoring stack - Install Prometheus and Grafana for metrics collection - Configure custom metrics exporters for cloud resource usage 3. **Autoscaler Deployment**: Deploy either CAS or Karpenter - Configure each autoscaler with comparable settings - Ensure proper RBAC permissions and resource configurations 4. **Resource Monitoring**: Track infrastructure usage - Run a custom log exporter to monitor cloud resources - Collect baseline metrics before tests begin 5. **Load Testing**: Generate controlled workload - Use kube-burner to create predictable, repeatable load - Apply identical workload patterns to both clusters 6. **Data Collection**: Gather performance metrics - Export metrics from Prometheus - Collect cloud billing data for cost analysis - Store all data in a structured format for analysis 7. **Cleanup**: Decommission clusters - Terminate all resources to prevent ongoing costs - Archive experiment data for later analysis This structured approach ensures that each experiment follows the same pattern, making results comparable across different scenarios and autoscaling technologies. ## Ansible Playbook Structure The project leverages Ansible as the orchestration tool to coordinate the entire experiment workflow. The Ansible playbook is structured with distinct roles that handle specific aspects of the experiment: ### Project Organization ``` experiment-automation/ ├── ansible/ │ ├── playbook.yml # Main playbook that orchestrates the entire process │ ├── files/ │ │ ├── kubeconfig-*/ # Generated Kubernetes config files │ │ └── scripts/ # Helper scripts for data collection │ └── roles/ │ ├── cluster_bootstrap/ # Provisions clusters using Terraform │ ├── cluster_setup/ # Configures clusters with monitoring and autoscalers │ ├── benchmarking/ # Runs the load tests and collects metrics │ └── cluster_teardown/ # Cleans up resources after tests ├── terraform-eks/ # Terraform files for AWS EKS clusters └── terraform-iks/ # Terraform files for IBM IKS clusters ├── cas.tf # Defines the CAS cluster configuration ├── karpenter.tf # Defines the Karpenter cluster configuration └── variables.tf # Common variables for both cluster types ``` ### Integration Points The system integrates several technologies: 1. **Terraform and Ansible Integration**: Ansible uses the `community.general.terraform` module to execute Terraform configurations, passing variables like API keys, node counts, and unique tags for resource tracking. 2. **Kubernetes API Integration**: The `kubernetes.core.k8s` Ansible module is used to deploy workloads and configure resources directly in the clusters. 3. **Cloud Provider APIs**: Custom Python scripts fetch billing and usage data from IBM Cloud and AWS APIs, using tags to identify resources related to specific experiments. 4. **Metrics Collection**: Prometheus queries are executed via the monitoring stack to gather performance data during and after the tests. ### Custom Configuration The experiment framework supports customization through: 1. **Environment Variables**: Control which cloud provider to test (AWS, IBM Cloud, or both) ``` LOAD_TEST_TARGET=IBM_CLOUD,AWS # Run tests on both platforms WORKLOAD_SCENARIO=heterogeneous # Select workload pattern ``` 2. **Workload Scenarios**: Define different test patterns - `homogeneous`: Deploy identical CPU-intensive workloads - `heterogeneous`: Deploy a mix of CPU and memory-intensive workloads to test autoscaler efficiency with diverse requirements 3. **Unique Experiment Identifiers**: Each experiment run generates a UUID that tags all cloud resources, making them easily identifiable for billing analysis and cleanup. ## Running Different Scenarios To run an experiment, you simply execute the Ansible playbook with appropriate tags to control which phases of the experiment to run: ```bash # Run a complete experiment on IBM Cloud with homogeneous workloads ansible-playbook playbook.yml -e "LOAD_TEST_TARGET=IBM_CLOUD WORKLOAD_SCENARIO=homogeneous" --tags "setup,config,benchmark,teardown" # Run only the benchmark phase on AWS ansible-playbook playbook.yml -e "LOAD_TEST_TARGET=AWS WORKLOAD_SCENARIO=heterogeneous" --tags "benchmark" # Run a complete experiment on both cloud providers ansible-playbook playbook.yml -e "LOAD_TEST_TARGET=IBM_CLOUD,AWS WORKLOAD_SCENARIO=homogeneous" --tags "all" ``` The playbook includes error handling with a rescue block, ensuring that clusters are properly decommissioned even if an experiment fails, preventing unexpected cloud costs. ## Key Implementation Details ### Terraform Configuration The Terraform files define the infrastructure with specific configurations for each autoscaler. For example, the CAS cluster configuration includes: ```terraform resource "ibm_container_vpc_cluster" "cas_cluster" { name = "cas-iks" vpc_id = var.vpc_id flavor = var.flavour # e.g., "bx2.2x8" with 2 vCPUs and 8GB RAM kube_version = var.kube_version worker_count = 1 # Start with a minimal node count resource_group_id = var.resource_group dynamic "zones" { for_each = var.zones content { subnet_id = zones.value["subnet_id"] name = zones.value["name"] } } tags = ["cas-${var.tag_uuid}"] # Tag for resource tracking } ``` ### Workload Deployment The benchmarking role deploys standardized workloads to test scaling behavior: ```yaml - name: Deploy homogeneous workload to clusters kubernetes.core.k8s: definition: apiVersion: apps/v1 kind: Deployment metadata: name: "homogeneous-workload-{{ item | lower }}" namespace: "{{ homogeneous_workload.namespace }}" spec: replicas: 40 # High replica count to trigger scaling selector: matchLabels: app: "homogeneous-workload-{{ item | lower }}" template: metadata: labels: app: "homogeneous-workload-{{ item | lower }}" spec: containers: - name: workload-container image: "{{ homogeneous_workload.container.image }}" resources: requests: cpu: "{{ homogeneous_workload.container.resources.cpu }}" memory: "{{ homogeneous_workload.container.resources.memory_request }}" limits: cpu: "{{ homogeneous_workload.container.resources.cpu }}" memory: "{{ homogeneous_workload.container.resources.memory_limit }}" command: ["stress"] args: "{{ homogeneous_workload.container.args }}" ``` ## Data Analysis Approach After running the experiments, the collected data is analyzed to answer the key research questions: 1. **Scaling Speed Comparison**: How quickly does each autoscaler respond to increased workload demands? 2. **Resource Efficiency**: How effectively does each autoscaler manage CPU and memory resources? 3. **Cost Analysis**: What are the operational cost differences between the autoscaling approaches? The data analysis is performed using Jupyter notebooks that process the JSON metrics files and create visualizations to highlight the performance differences. ## Challenges and Lessons Learned Implementing this experiment framework revealed several challenges: 1. **Cloud Provider Differences**: IBM Cloud and AWS have different APIs, resource models, and provisioning speeds, requiring careful normalization of results. 2. **Consistent Timing**: Ensuring that experiments run for consistent durations and that metrics are collected at precisely the right times required careful orchestration. 3. **Error Handling**: Cloud resources sometimes fail to provision or configure correctly, requiring robust error handling and retry mechanisms. 4. **Cost Control**: Accidentally leaving resources running can quickly lead to unexpected costs, making the automatic cleanup functionality crucial. ## Next Steps Future work on this project includes: 1. **Enhanced Data Analysis**: Using RHOSAI (Red Hat OpenShift AI) and Jupyter notebooks to perform deeper statistical analysis of the collected metrics. 2. **Additional Cloud Providers**: Extending the framework to support other Kubernetes platforms like Google GKE or Azure AKS. 3. **More Complex Workloads**: Testing with real-world application patterns beyond the synthetic CPU and memory loads. 4. **Auto-scaling Policy Optimization**: Developing and testing more sophisticated autoscaling policies based on the findings from the experiments. By using this infrastructure-as-code approach to experimentation, we've been able to gather objective, repeatable measurements of autoscaler performance across different cloud platforms, providing valuable insights for organizations looking to optimize their Kubernetes infrastructure costs and performance.