Skip to content
Published on

AWS Dynamic Infrastructure Complete Guide: Auto Scaling, Spot Instances, and IaC for 90% Cost Savings

Authors

1. Why Dynamic Infrastructure? (Static vs Dynamic)

The Fundamental Problem with Fixed Servers

Many companies still provision servers based on peak traffic. If traffic increases 10x during Black Friday, they run 10x the servers year-round. This means wasting money more than 80% of the time.

Let us look at real-world examples:

  • E-commerce Company A: Traffic spikes 5x only during peak hours (8-10 PM) meaning servers sit idle the other 22 hours
  • Media Company B: Weekend traffic is 3x weekdays meaning 66% server idle time on weekdays
  • SaaS Company C: Batch processing surges at month-end meaning over-provisioned for 27 days per month

Static infrastructure in these patterns causes the following problems:

ProblemImpactWaste Rate
Peak-based provisioningExcess resources most of the time60-80%
Manual scalingDelayed response to traffic spikesDowntime risk
Difficult infrastructure changesDelayed feature deploymentOpportunity cost
Single points of failureService outage on server failureRevenue loss

Core Values of Dynamic Infrastructure

Dynamic infrastructure means an architecture where resources automatically scale up and down based on workload.

Three core principles:

  1. Elasticity: Traffic increases lead to auto scale-out, traffic decreases lead to auto scale-in
  2. Cost Optimization: Pay only for what you use
  3. High Availability: Automatic recovery on failure

The Three Pillars of AWS Cost Optimization

Cost Optimization = Right-sizing + Auto Scaling + Spot Instances
  • Right-sizing: Choosing the appropriate instance type for your workload (t3.medium might be sufficient instead of m5.xlarge)
  • Auto Scaling: Automatically adjusting instance count based on traffic
  • Spot Instances: Using spare capacity at up to 90% discount compared to On-Demand

Combining these three can reduce monthly cloud costs by 60-90%.


2. Mastering EC2 Auto Scaling

Core Components of ASG

An Auto Scaling Group (ASG) consists of three core components:

  1. Launch Template: Defines what instances to create (AMI, instance type, security groups, key pair)
  2. Scaling Policy: Rules defining when and how to scale
  3. Health Check: Monitors instance health and replaces unhealthy instances

Writing a Launch Template

Here is an example of creating a Launch Template with AWS CLI:

{
  "LaunchTemplateName": "web-server-template",
  "LaunchTemplateData": {
    "ImageId": "ami-0abcdef1234567890",
    "InstanceType": "t3.medium",
    "KeyName": "my-key-pair",
    "SecurityGroupIds": ["sg-0123456789abcdef0"],
    "UserData": "IyEvYmluL2Jhc2gKeXVtIHVwZGF0ZSAteQp5dW0gaW5zdGFsbCAteSBodHRwZA==",
    "TagSpecifications": [
      {
        "ResourceType": "instance",
        "Tags": [
          {
            "Key": "Environment",
            "Value": "production"
          },
          {
            "Key": "Project",
            "Value": "web-app"
          }
        ]
      }
    ],
    "BlockDeviceMappings": [
      {
        "DeviceName": "/dev/xvda",
        "Ebs": {
          "VolumeSize": 30,
          "VolumeType": "gp3",
          "Encrypted": true
        }
      }
    ]
  }
}

The same Launch Template defined in Terraform:

resource "aws_launch_template" "web_server" {
  name_prefix   = "web-server-"
  image_id      = data.aws_ami.amazon_linux_2.id
  instance_type = "t3.medium"
  key_name      = "my-key-pair"

  vpc_security_group_ids = [aws_security_group.web.id]

  user_data = base64encode(<<-EOF
    #!/bin/bash
    yum update -y
    yum install -y httpd
    systemctl start httpd
    systemctl enable httpd
    echo "Hello from $(hostname)" > /var/www/html/index.html
  EOF
  )

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size = 30
      volume_type = "gp3"
      encrypted   = true
    }
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Environment = "production"
      Project     = "web-app"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

Four Scaling Policy Strategies

2-1. Simple Scaling

The most basic approach that links a single CloudWatch alarm to a single adjustment action.

resource "aws_autoscaling_policy" "scale_up" {
  name                   = "scale-up"
  autoscaling_group_name = aws_autoscaling_group.web.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 2
  cooldown               = 300
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-alarm"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 80

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }

  alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}

Limitation: No additional scaling during the cooldown period after a scaling action, making it slow to respond to sudden traffic surges.

2-2. Step Scaling

Performs different-sized adjustments based on alarm threshold ranges.

resource "aws_autoscaling_policy" "step_scaling" {
  name                   = "step-scaling-policy"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "StepScaling"
  adjustment_type        = "ChangeInCapacity"

  step_adjustment {
    scaling_adjustment          = 1
    metric_interval_lower_bound = 0
    metric_interval_upper_bound = 20
  }

  step_adjustment {
    scaling_adjustment          = 3
    metric_interval_lower_bound = 20
    metric_interval_upper_bound = 40
  }

  step_adjustment {
    scaling_adjustment          = 5
    metric_interval_lower_bound = 40
  }
}

This policy adds 1, 3, or 5 instances depending on how far CPU exceeds the threshold.

2-3. Target Tracking Scaling

The most recommended approach. The ASG automatically adjusts instances to keep a specific metric at its target value.

resource "aws_autoscaling_policy" "target_tracking" {
  name                   = "target-tracking-cpu"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

# ALB request count based Target Tracking
resource "aws_autoscaling_policy" "target_tracking_alb" {
  name                   = "target-tracking-alb"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb_target_group.web.arn_suffix}/${aws_lb.web.arn_suffix}"
    }
    target_value = 1000.0
  }
}

Why Target Tracking is recommended:

  • No need to manually manage alarms and policies
  • Automatically balances scale-in and scale-out
  • Multiple Target Tracking policies can be applied simultaneously

2-4. Predictive Scaling

An ML model analyzes the past 14 days of traffic patterns to predict future traffic and pre-provision instances.

resource "aws_autoscaling_policy" "predictive" {
  name                   = "predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70

      predefined_scaling_metric_specification {
        predefined_metric_type = "ASGAverageCPUUtilization"
        resource_label         = ""
      }

      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
        resource_label         = ""
      }
    }

    mode                          = "ForecastAndScale"
    scheduling_buffer_time        = 300
    max_capacity_breach_behavior  = "HonorMaxCapacity"
  }
}

Predictive Scaling use cases:

  • Daily traffic spikes at the same times (commute hours, lunch time)
  • Services with clear weekly/monthly repeating patterns
  • Synergy effect when combined with Target Tracking

Cooldown Period and Warm-up Configuration

resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  desired_capacity    = 2
  max_size            = 20
  min_size            = 1
  vpc_zone_identifier = [aws_subnet.private_a.id, aws_subnet.private_c.id]

  launch_template {
    id      = aws_launch_template.web_server.id
    version = "$Latest"
  }

  # Default cooldown: wait time after a scaling action
  default_cooldown = 300

  # Instance warm-up: time for new instances to become fully ready
  default_instance_warmup = 120

  health_check_type         = "ELB"
  health_check_grace_period = 300

  tag {
    key                 = "Name"
    value               = "web-server"
    propagate_at_launch = true
  }
}

Difference between cooldown and warm-up:

  • Cooldown: Wait time between scaling actions (prevents excessive scaling)
  • Warm-up: Time for newly launched instances to become ready to serve traffic (excluded from ASG metrics)

Mixed Instances Policy (On-Demand + Spot Combined)

A key strategy for dramatically reducing costs while maintaining stability:

resource "aws_autoscaling_group" "web_mixed" {
  name                = "web-mixed-asg"
  desired_capacity    = 6
  max_size            = 30
  min_size            = 2
  vpc_zone_identifier = [
    aws_subnet.private_a.id,
    aws_subnet.private_b.id,
    aws_subnet.private_c.id
  ]

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web_server.id
        version            = "$Latest"
      }

      override {
        instance_type = "t3.medium"
      }
      override {
        instance_type = "t3a.medium"
      }
      override {
        instance_type = "m5.large"
      }
      override {
        instance_type = "m5a.large"
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
      spot_max_price                           = ""  # Allow up to On-Demand price
    }
  }
}

What this configuration means:

  • Base 2 instances: Always On-Demand (stability guarantee)
  • 80% of additional instances: Spot Instances (cost savings)
  • 20% of additional instances: On-Demand (stability supplement)
  • Multiple instance types: Automatic fallback to alternative types when Spot capacity is unavailable

Lifecycle Hooks

You can perform custom actions when instances launch or terminate:

resource "aws_autoscaling_lifecycle_hook" "launch_hook" {
  name                   = "launch-setup-hook"
  autoscaling_group_name = aws_autoscaling_group.web.name
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_LAUNCHING"
  heartbeat_timeout      = 600
  default_result         = "CONTINUE"

  notification_target_arn = aws_sns_topic.asg_notifications.arn
  role_arn                = aws_iam_role.asg_hook_role.arn
}

resource "aws_autoscaling_lifecycle_hook" "terminate_hook" {
  name                   = "terminate-cleanup-hook"
  autoscaling_group_name = aws_autoscaling_group.web.name
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  heartbeat_timeout      = 300
  default_result         = "CONTINUE"

  notification_target_arn = aws_sns_topic.asg_notifications.arn
  role_arn                = aws_iam_role.asg_hook_role.arn
}

Lifecycle Hook use cases:

  • Launch Hook: Verify initial configuration completion with configuration management tools (Ansible, Chef)
  • Terminate Hook: Log backup, connection draining, service discovery deregistration before termination

3. Spot Instance Master Class

What Are Spot Instances?

Spot Instances provide unused EC2 capacity at up to 90% discount compared to On-Demand pricing.

Price comparison (m5.xlarge, us-east-1):

Purchase OptionHourly PriceMonthly Cost (730 hrs)Discount
On-Demand$0.192$140.16-
Reserved (1yr, all upfront)$0.120$87.6037%
Savings Plan (1yr)$0.125$91.2535%
Spot (average)$0.058$42.3470%
Spot (lowest)$0.019$13.8790%

Spot Price History Analysis

You can query Spot price history using the AWS CLI:

aws ec2 describe-spot-price-history \
  --instance-types m5.xlarge m5a.xlarge m5d.xlarge \
  --product-descriptions "Linux/UNIX" \
  --start-time "2026-03-16T00:00:00" \
  --end-time "2026-03-23T00:00:00" \
  --query 'SpotPriceHistory[*].[InstanceType,AvailabilityZone,SpotPrice,Timestamp]' \
  --output table

Strategies to increase Spot price stability:

  • Specify multiple instance types (m5.xlarge, m5a.xlarge, m5d.xlarge, m5n.xlarge)
  • Use multiple Availability Zones (AZs)
  • Use the capacity-optimized allocation strategy (allocates from pools with most spare capacity)

Spot Interruption Handling

Spot Instances can be reclaimed by AWS with a 2-minute warning when AWS needs the capacity back.

Detecting Interruptions via Metadata Polling

#!/bin/bash
# spot-interruption-handler.sh

METADATA_TOKEN=$(curl -s -X PUT \
  "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

while true; do
  INTERRUPTION=$(curl -s -H "X-aws-ec2-metadata-token: $METADATA_TOKEN" \
    http://169.254.169.254/latest/meta-data/spot/instance-action 2>/dev/null)

  if [ "$INTERRUPTION" != "" ] && echo "$INTERRUPTION" | grep -q "action"; then
    echo "Spot interruption detected! Starting graceful shutdown..."

    # 1. Deregister from ALB (stop receiving new requests)
    INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $METADATA_TOKEN" \
      http://169.254.169.254/latest/meta-data/instance-id)

    aws elbv2 deregister-targets \
      --target-group-arn "$TARGET_GROUP_ARN" \
      --targets "Id=$INSTANCE_ID"

    # 2. Wait for in-flight requests to complete (Connection Draining)
    sleep 30

    # 3. Backup logs
    aws s3 sync /var/log/app/ "s3://my-logs-bucket/spot-terminated/$INSTANCE_ID/"

    # 4. Graceful application shutdown
    systemctl stop my-app

    echo "Graceful shutdown completed."
    break
  fi

  sleep 5
done

Lambda-based Interruption Handler

Process Spot interruption events via EventBridge rules with Lambda:

import json
import boto3

ec2 = boto3.client('ec2')
elbv2 = boto3.client('elbv2')
sns = boto3.client('sns')
asg = boto3.client('autoscaling')

def lambda_handler(event, context):
    """
    Receives and processes Spot Interruption Warning events from EventBridge
    """
    detail = event.get('detail', {})
    instance_id = detail.get('instance-id')
    action = detail.get('instance-action')

    print(f"Spot interruption: instance={instance_id}, action={action}")

    # 1. Mark instance as unhealthy in ASG (triggers immediate replacement)
    try:
        asg.set_instance_health(
            InstanceId=instance_id,
            HealthStatus='Unhealthy',
            ShouldRespectGracePeriod=False
        )
        print(f"Marked {instance_id} as unhealthy in ASG")
    except Exception as e:
        print(f"ASG health update failed: {e}")

    # 2. Send SNS notification
    sns.publish(
        TopicArn='arn:aws:sns:us-east-1:123456789012:spot-alerts',
        Subject=f'Spot Interruption: {instance_id}',
        Message=json.dumps({
            'instance_id': instance_id,
            'action': action,
            'region': event.get('region'),
            'time': event.get('time')
        }, indent=2)
    )

    return {
        'statusCode': 200,
        'body': f'Handled interruption for {instance_id}'
    }

EventBridge rule configuration (Terraform):

resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "spot-interruption-rule"
  description = "Capture EC2 Spot Instance Interruption Warning"

  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Spot Instance Interruption Warning"]
  })
}

resource "aws_cloudwatch_event_target" "spot_handler_lambda" {
  rule      = aws_cloudwatch_event_rule.spot_interruption.name
  target_id = "spot-interruption-handler"
  arn       = aws_lambda_function.spot_handler.arn
}

Spot Fleet: Multiple Instance Types + Multiple AZs

resource "aws_spot_fleet_request" "batch_processing" {
  iam_fleet_role                      = aws_iam_role.spot_fleet_role.arn
  target_capacity                     = 10
  terminate_instances_with_expiration = true
  allocation_strategy                 = "capacityOptimized"
  fleet_type                          = "maintain"

  launch_template_config {
    launch_template_specification {
      id      = aws_launch_template.batch.id
      version = "$Latest"
    }

    overrides {
      instance_type     = "c5.xlarge"
      availability_zone = "us-east-1a"
    }
    overrides {
      instance_type     = "c5a.xlarge"
      availability_zone = "us-east-1a"
    }
    overrides {
      instance_type     = "c5.xlarge"
      availability_zone = "us-east-1b"
    }
    overrides {
      instance_type     = "c5a.xlarge"
      availability_zone = "us-east-1b"
    }
    overrides {
      instance_type     = "c6i.xlarge"
      availability_zone = "us-east-1a"
    }
    overrides {
      instance_type     = "c6i.xlarge"
      availability_zone = "us-east-1b"
    }
  }
}

Suitable and Unsuitable Workloads for Spot

Suitable workloads:

  • CI/CD pipelines (builds, test runners)
  • Batch data processing (ETL, log analysis)
  • ML model training (with checkpoint support)
  • Load testing / performance testing
  • Big data processing (EMR, Spark)
  • Image/video encoding
  • Web servers (when using ASG Mixed Instances Policy)

Unsuitable workloads:

  • Single-instance databases (use RDS Multi-AZ instead)
  • Real-time payment systems (cannot tolerate interruptions)
  • Workloads requiring long-running stateful processes
  • Mission-critical services requiring 99.99%+ SLA (use On-Demand or Reserved)

4. Serverless Dynamic Infrastructure

Lambda: Event-Driven Auto Scaling

AWS Lambda is the extreme form of dynamic infrastructure. Zero cost when there are no requests, and millisecond-level resource allocation when requests arrive.

import json
import time

def lambda_handler(event, context):
    """
    Lambda function invoked from API Gateway
    Concurrent execution: auto-scales from 0 to thousands
    """
    start = time.time()

    # Business logic
    body = event.get('body', '{}')
    data = json.loads(body) if body else {}

    result = process_request(data)

    duration = (time.time() - start) * 1000
    print(f"Processing took {duration:.2f}ms")

    return {
        'statusCode': 200,
        'headers': {
            'Content-Type': 'application/json',
            'X-Processing-Time': f'{duration:.2f}ms'
        },
        'body': json.dumps(result)
    }

def process_request(data):
    # Actual business logic
    return {'status': 'success', 'data': data}

Lambda Concurrency Management

# Reserved Concurrency: ensures this function has dedicated concurrent execution capacity
resource "aws_lambda_function_event_invoke_config" "example" {
  function_name = aws_lambda_function.api.function_name

  maximum_event_age_in_seconds = 60
  maximum_retry_attempts       = 0
}

# Provisioned Concurrency: keeps instances warm (eliminates Cold Start)
resource "aws_lambda_provisioned_concurrency_config" "api" {
  function_name                  = aws_lambda_function.api.function_name
  provisioned_concurrent_executions = 50
  qualifier                      = aws_lambda_alias.live.name
}

Concurrency type comparison:

  • Reserved Concurrency: Guarantees capacity for this function, preventing other functions from using it. No additional cost.
  • Provisioned Concurrency: Keeps execution environments warm. Eliminates Cold Start. Incurs additional cost.

Cold Start Minimization Strategies

  1. Use Provisioned Concurrency (most reliable method)
  2. Minimize package size (optimize dependencies, use Layers)
  3. Runtime selection: Python/Node.js have faster Cold Starts than Java
  4. Optimize initialization code: Initialize DB connections outside the handler
  5. Use SnapStart (Java runtime only, 90% Cold Start reduction)

Fargate: Serverless Containers

resource "aws_ecs_service" "api" {
  name            = "api-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = [aws_subnet.private_a.id, aws_subnet.private_c.id]
    security_groups  = [aws_security_group.ecs.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8080
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1
    base              = 2  # Minimum 2 on Fargate On-Demand
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 4  # 80% of additional capacity on Fargate Spot
  }
}

# Fargate Auto Scaling
resource "aws_appautoscaling_target" "ecs" {
  max_capacity       = 20
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_cpu" {
  name               = "ecs-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

Lambda vs Fargate vs EC2 Comparison

AspectLambdaFargateEC2 (ASG)
Scaling speedMilliseconds30s-2min2-5min
Max execution time15 minutesUnlimitedUnlimited
Memory128MB-10GB512MB-120GBDepends on instance type
vCPUUp to 6Up to 16Depends on instance type
Cost modelRequest count + execution timevCPU + memory hoursInstance hours
Cold StartYesYes (longer)No (already running)
Management overheadMinimalMediumHigh
Container supportImage deployment availableNativeManage Docker directly
Spot supportN/AFargate Spot (70%)Spot Instance (90%)
Best forEvent processing, APIsMicroservices, web appsHigh-performance, stateful

5. Building Dynamic Infrastructure with Terraform

Complete ASG + ALB Infrastructure Example

# provider.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "web-app/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
      Project     = var.project_name
    }
  }
}

# variables.tf
variable "aws_region" {
  default = "us-east-1"
}

variable "environment" {
  default = "production"
}

variable "project_name" {
  default = "web-app"
}

variable "vpc_cidr" {
  default = "10.0.0.0/16"
}
# vpc.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "${var.project_name}-vpc"
  }
}

resource "aws_subnet" "public_a" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "${var.aws_region}a"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.project_name}-public-a"
    Tier = "public"
  }
}

resource "aws_subnet" "public_b" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.2.0/24"
  availability_zone       = "${var.aws_region}b"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.project_name}-public-b"
    Tier = "public"
  }
}

resource "aws_subnet" "private_a" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.10.0/24"
  availability_zone = "${var.aws_region}a"

  tags = {
    Name = "${var.project_name}-private-a"
    Tier = "private"
  }
}

resource "aws_subnet" "private_b" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.11.0/24"
  availability_zone = "${var.aws_region}b"

  tags = {
    Name = "${var.project_name}-private-b"
    Tier = "private"
  }
}
# alb.tf
resource "aws_lb" "web" {
  name               = "${var.project_name}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = [aws_subnet.public_a.id, aws_subnet.public_b.id]

  enable_deletion_protection = true

  access_logs {
    bucket  = aws_s3_bucket.alb_logs.bucket
    prefix  = "alb-logs"
    enabled = true
  }
}

resource "aws_lb_target_group" "web" {
  name     = "${var.project_name}-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    enabled             = true
    healthy_threshold   = 2
    interval            = 15
    matcher             = "200"
    path                = "/health"
    port                = "traffic-port"
    protocol            = "HTTP"
    timeout             = 5
    unhealthy_threshold = 3
  }

  deregistration_delay = 30

  stickiness {
    type    = "lb_cookie"
    enabled = false
  }
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.web.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate.main.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.web.arn
  }
}
# asg.tf - Complete Mixed Instances ASG
resource "aws_autoscaling_group" "web" {
  name                = "${var.project_name}-asg"
  desired_capacity    = 4
  max_size            = 30
  min_size            = 2
  vpc_zone_identifier = [aws_subnet.private_a.id, aws_subnet.private_b.id]

  target_group_arns         = [aws_lb_target_group.web.arn]
  health_check_type         = "ELB"
  health_check_grace_period = 300
  default_instance_warmup   = 120

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web_server.id
        version            = "$Latest"
      }

      override {
        instance_type     = "t3.medium"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "t3a.medium"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "m5.large"
        weighted_capacity = "2"
      }
      override {
        instance_type     = "m5a.large"
        weighted_capacity = "2"
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
    }
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 80
      instance_warmup        = 120
    }
  }

  tag {
    key                 = "Name"
    value               = "${var.project_name}-web"
    propagate_at_launch = true
  }
}

# Target Tracking Scaling
resource "aws_autoscaling_policy" "cpu_target" {
  name                   = "cpu-target-tracking"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

# Predictive Scaling
resource "aws_autoscaling_policy" "predictive" {
  name                   = "predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70

      predefined_scaling_metric_specification {
        predefined_metric_type = "ASGAverageCPUUtilization"
        resource_label         = ""
      }

      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
        resource_label         = ""
      }
    }

    mode                   = "ForecastAndScale"
    scheduling_buffer_time = 300
  }
}

Terraform Modules for Reusable Infrastructure

# modules/asg/main.tf
module "web_asg" {
  source = "./modules/asg"

  project_name       = "my-web-app"
  environment        = "production"
  vpc_id             = module.vpc.vpc_id
  private_subnet_ids = module.vpc.private_subnet_ids
  alb_target_group   = module.alb.target_group_arn

  instance_types = ["t3.medium", "t3a.medium", "m5.large"]
  min_size       = 2
  max_size        = 30
  desired_capacity = 4

  spot_percentage    = 80
  on_demand_base     = 2
  target_cpu         = 70

  tags = local.common_tags
}

State Management (S3 + DynamoDB)

# state-backend/main.tf (apply this locally first)
resource "aws_s3_bucket" "terraform_state" {
  bucket = "my-company-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Terraform Workflow

# 1. Initialize
terraform init

# 2. Validate code
terraform validate

# 3. Review change plan
terraform plan -out=tfplan

# 4. Apply changes
terraform apply tfplan

# 5. Check state
terraform state list
terraform state show aws_autoscaling_group.web

# 6. Destroy infrastructure (dev environments)
terraform destroy

6. Building Dynamic Infrastructure with AWS CDK

CDK TypeScript Example: ASG + ALB + RDS

import * as cdk from 'aws-cdk-lib'
import * as ec2 from 'aws-cdk-lib/aws-ec2'
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2'
import * as autoscaling from 'aws-cdk-lib/aws-autoscaling'
import * as rds from 'aws-cdk-lib/aws-rds'
import { Construct } from 'constructs'

export class WebAppStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props)

    // VPC
    const vpc = new ec2.Vpc(this, 'WebVpc', {
      maxAzs: 3,
      natGateways: 2,
      subnetConfiguration: [
        {
          cidrMask: 24,
          name: 'Public',
          subnetType: ec2.SubnetType.PUBLIC,
        },
        {
          cidrMask: 24,
          name: 'Private',
          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
        },
        {
          cidrMask: 24,
          name: 'Isolated',
          subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
        },
      ],
    })

    // ALB
    const alb = new elbv2.ApplicationLoadBalancer(this, 'WebAlb', {
      vpc,
      internetFacing: true,
      vpcSubnets: { subnetType: ec2.SubnetType.PUBLIC },
    })

    // ASG with Mixed Instances
    const asg = new autoscaling.AutoScalingGroup(this, 'WebAsg', {
      vpc,
      vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
      mixedInstancesPolicy: {
        instancesDistribution: {
          onDemandBaseCapacity: 2,
          onDemandPercentageAboveBaseCapacity: 20,
          spotAllocationStrategy: autoscaling.SpotAllocationStrategy.CAPACITY_OPTIMIZED,
        },
        launchTemplate: new ec2.LaunchTemplate(this, 'LaunchTemplate', {
          instanceType: ec2.InstanceType.of(ec2.InstanceClass.T3, ec2.InstanceSize.MEDIUM),
          machineImage: ec2.MachineImage.latestAmazonLinux2023(),
          userData: ec2.UserData.custom(`
            #!/bin/bash
            yum update -y
            yum install -y docker
            systemctl start docker
            docker run -d -p 80:8080 my-web-app:latest
          `),
        }),
        launchTemplateOverrides: [
          { instanceType: new ec2.InstanceType('t3.medium') },
          { instanceType: new ec2.InstanceType('t3a.medium') },
          { instanceType: new ec2.InstanceType('m5.large') },
          { instanceType: new ec2.InstanceType('m5a.large') },
        ],
      },
      minCapacity: 2,
      maxCapacity: 30,
      healthCheck: autoscaling.HealthCheck.elb({
        grace: cdk.Duration.seconds(300),
      }),
    })

    // Target Tracking Scaling
    asg.scaleOnCpuUtilization('CpuScaling', {
      targetUtilizationPercent: 70,
      cooldown: cdk.Duration.seconds(300),
    })

    asg.scaleOnRequestCount('RequestScaling', {
      targetRequestsPerMinute: 1000,
    })

    // ALB Listener
    const listener = alb.addListener('HttpsListener', {
      port: 443,
      certificates: [
        elbv2.ListenerCertificate.fromArn('arn:aws:acm:us-east-1:123456789012:certificate/abc-123'),
      ],
    })

    listener.addTargets('WebTarget', {
      port: 80,
      targets: [asg],
      healthCheck: {
        path: '/health',
        interval: cdk.Duration.seconds(15),
        healthyThresholdCount: 2,
        unhealthyThresholdCount: 3,
      },
      deregistrationDelay: cdk.Duration.seconds(30),
    })

    // RDS Multi-AZ
    const database = new rds.DatabaseCluster(this, 'Database', {
      engine: rds.DatabaseClusterEngine.auroraPostgres({
        version: rds.AuroraPostgresEngineVersion.VER_15_4,
      }),
      writer: rds.ClusterInstance.provisioned('Writer', {
        instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
      }),
      readers: [
        rds.ClusterInstance.provisioned('Reader1', {
          instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
        }),
      ],
      vpc,
      vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_ISOLATED },
      storageEncrypted: true,
      deletionProtection: true,
    })

    // Allow ASG to access RDS
    database.connections.allowDefaultPortFrom(asg)

    // Outputs
    new cdk.CfnOutput(this, 'AlbDnsName', {
      value: alb.loadBalancerDnsName,
      description: 'ALB DNS Name',
    })
  }
}

CDK Constructs: L1 vs L2 vs L3

LevelNameDescriptionExample
L1Cfn Resources1:1 mapping to CloudFormation resourcesCfnInstance, CfnVPC
L2CuratedReasonable defaults + helper methodsec2.Vpc, lambda.Function
L3PatternsArchitecture patterns combining multiple resourcesecs_patterns.ApplicationLoadBalancedFargateService

Using L3 patterns lets you define complex infrastructure in just a few lines:

import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns'

// L3 Pattern: ALB + Fargate service in one go
const fargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
  this,
  'FargateService',
  {
    cluster,
    desiredCount: 2,
    taskImageOptions: {
      image: ecs.ContainerImage.fromRegistry('my-app:latest'),
      containerPort: 8080,
    },
    publicLoadBalancer: true,
    capacityProviderStrategies: [
      { capacityProvider: 'FARGATE', weight: 1, base: 2 },
      { capacityProvider: 'FARGATE_SPOT', weight: 4 },
    ],
  }
)

fargateService.targetGroup.configureHealthCheck({
  path: '/health',
})

CDK vs Terraform vs CloudFormation Comparison

AspectCDKTerraformCloudFormation
LanguageTypeScript, Python, Java, GoHCLYAML/JSON
Learning curveMedium (leverages programming languages)Medium (learn HCL)Low (declarative YAML)
Abstraction levelHigh (L3 patterns)Medium (modules)Low (resource-level)
Multi-cloudAWS onlyMulti-cloudAWS only
State managementCloudFormation stacksS3 + DynamoDBAuto-managed
TestingUnit tests possibleTerratestLimited
Drift detectionVia CloudFormationterraform planSupported
EcosystemConstruct HubTerraform RegistryLimited

CDK Pipeline: Infrastructure Deployment via CI/CD

import { CodePipeline, CodePipelineSource, ShellStep } from 'aws-cdk-lib/pipelines'

const pipeline = new CodePipeline(this, 'Pipeline', {
  pipelineName: 'WebAppPipeline',
  synth: new ShellStep('Synth', {
    input: CodePipelineSource.gitHub('my-org/my-repo', 'main'),
    commands: ['npm ci', 'npm run build', 'npx cdk synth'],
  }),
})

// Deploy to staging
pipeline.addStage(
  new WebAppStage(this, 'Staging', {
    env: { account: '123456789012', region: 'us-east-1' },
  })
)

// Deploy to production (with manual approval)
pipeline.addStage(
  new WebAppStage(this, 'Production', {
    env: { account: '987654321098', region: 'us-east-1' },
  }),
  {
    pre: [new pipelines.ManualApprovalStep('PromoteToProduction')],
  }
)

7. Cost Optimization Practical Strategies

Reserved Instances vs Savings Plans vs Spot Comparison

AspectReserved InstancesSavings PlansSpot Instances
DiscountUp to 72%Up to 72%Up to 90%
Commitment1 or 3 years1 or 3 yearsNone
FlexibilityFixed instance type/regionFlexible compute typeInterruptible
Upfront optionsAll/partial/noneAll/partial/noneNone
Best forPredictable baseline workloadsDiverse compute usageInterruptible workloads
Lambda supportNoCompute SP applicableN/A
Fargate supportNoCompute SP applicableFargate Spot

Schedule-based Scaling

Apply to dev/staging environments or services with low off-hours traffic:

# Business hours (Mon-Fri 09:00-18:00): 4 instances
resource "aws_autoscaling_schedule" "scale_up_business_hours" {
  scheduled_action_name  = "scale-up-business"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 4
  max_size               = 30
  desired_capacity       = 4
  recurrence             = "0 9 * * 1-5"
  time_zone              = "America/New_York"
}

# Night (Mon-Fri after 18:00): 2 instances
resource "aws_autoscaling_schedule" "scale_down_night" {
  scheduled_action_name  = "scale-down-night"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 2
  max_size               = 10
  desired_capacity       = 2
  recurrence             = "0 18 * * 1-5"
  time_zone              = "America/New_York"
}

# Weekend: 1 instance
resource "aws_autoscaling_schedule" "scale_down_weekend" {
  scheduled_action_name  = "scale-down-weekend"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 1
  max_size               = 5
  desired_capacity       = 1
  recurrence             = "0 0 * * 6"
  time_zone              = "America/New_York"
}

Tag-based Cost Tracking

# Apply cost tracking tags to all resources
locals {
  cost_tags = {
    CostCenter  = "engineering"
    Team        = "platform"
    Project     = "web-app"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

# Enable tag-based cost analysis in AWS Cost Explorer
resource "aws_ce_cost_category" "team_costs" {
  name = "TeamCosts"

  rule {
    value = "Platform"
    rule {
      tags {
        key           = "Team"
        values        = ["platform"]
        match_options = ["EQUALS"]
      }
    }
  }

  rule {
    value = "Backend"
    rule {
      tags {
        key           = "Team"
        values        = ["backend"]
        match_options = ["EQUALS"]
      }
    }
  }
}

Real-world Case Study: From 10,000to10,000 to 2,500 per Month

Before (Static Infrastructure):

  • EC2 m5.xlarge x 10 (On-Demand, 24/7) = $1,401/month
  • EC2 m5.2xlarge x 5 (Batch servers, 24/7) = $1,401/month
  • RDS db.r5.xlarge Multi-AZ = $1,020/month
  • NAT Gateway x 2 = $130/month
  • ALB = $50/month
  • Other (EBS, S3, CloudWatch) = $500/month
  • Total monthly cost: approximately $4,502

After (Dynamic Infrastructure):

ChangeBeforeAfterSavings
Web servers 10 to ASG (2 OD + Spot)$1,401$42070%
Batch servers to Spot Fleet (on-demand only)$1,401$14090%
RDS with Savings Plan$1,020$66335%
NAT Gateway optimization$130$6550%
Schedule Scaling (night/weekend reduction)-Additional -30%-
Total$4,502Approx. $1,28871%

8. High Availability Architecture

Multi-AZ Deployment Architecture

                    Route 53 (DNS Failover)
                           |
                    CloudFront (CDN)
                           |
                    ALB (Multi-AZ)
                    /              \
            AZ-a (us-east-1a)        AZ-b (us-east-1b)
            +------------------+    +------------------+
            | EC2 (ASG)        |    | EC2 (ASG)        |
            | - Web Server x2  |    | - Web Server x2  |
            |                  |    |                  |
            | RDS (Primary)    |    | RDS (Standby)    |
            | ElastiCache      |    | ElastiCache      |
            +------------------+    +------------------+

Route 53 Health Check + Failover

resource "aws_route53_health_check" "primary" {
  fqdn              = "app.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 10

  regions = ["us-east-1", "eu-west-1", "ap-southeast-1"]

  tags = {
    Name = "primary-health-check"
  }
}

resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.web.dns_name
    zone_id                = aws_lb.web.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
}

Chaos Engineering: AWS Fault Injection Simulator

resource "aws_fis_experiment_template" "spot_interruption" {
  description = "Simulate Spot Instance interruptions"
  role_arn    = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.error_rate.arn
  }

  action {
    name      = "interrupt-spot-instances"
    action_id = "aws:ec2:send-spot-instance-interruptions"

    parameter {
      key   = "durationBeforeInterruption"
      value = "PT2M"
    }

    target {
      key   = "SpotInstances"
      value = "spot-instances-target"
    }
  }

  target {
    name           = "spot-instances-target"
    resource_type  = "aws:ec2:spot-instance"
    selection_mode = "COUNT(2)"

    resource_tag {
      key   = "Environment"
      value = "staging"
    }
  }
}

9. Three Patterns for Creating EC2 Instances On-Demand

9-1. EventBridge + Lambda to EC2 (Event-Driven)

A pattern that starts an EC2 instance when a file is uploaded to S3, processes it, and automatically terminates:

import boto3
import json
import time

ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    """
    S3 upload event -> Create EC2 instance -> Process -> Auto-terminate
    """
    # Extract file info from S3 event
    bucket = event['detail']['bucket']['name']
    key = event['detail']['object']['key']

    print(f"Processing file: s3://{bucket}/{key}")

    # Start EC2 instance
    user_data = f"""#!/bin/bash
set -e

# Perform work
aws s3 cp s3://{bucket}/{key} /tmp/input
python3 /opt/process.py /tmp/input /tmp/output
aws s3 cp /tmp/output s3://{bucket}-processed/{key}

# Self-terminate after completion
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
"""

    response = ec2.run_instances(
        ImageId='ami-0abcdef1234567890',
        InstanceType='c5.xlarge',
        MinCount=1,
        MaxCount=1,
        IamInstanceProfile={
            'Name': 'ec2-processing-role'
        },
        UserData=user_data,
        TagSpecifications=[
            {
                'ResourceType': 'instance',
                'Tags': [
                    {'Key': 'Name', 'Value': f'processor-{key[:20]}'},
                    {'Key': 'Purpose', 'Value': 'batch-processing'},
                    {'Key': 'AutoTerminate', 'Value': 'true'}
                ]
            }
        ],
        InstanceMarketOptions={
            'MarketType': 'spot',
            'SpotOptions': {
                'SpotInstanceType': 'one-time',
                'InstanceInterruptionBehavior': 'terminate'
            }
        }
    )

    instance_id = response['Instances'][0]['InstanceId']
    print(f"Started processing instance: {instance_id}")

    return {
        'statusCode': 200,
        'body': json.dumps({
            'instance_id': instance_id,
            'file': f's3://{bucket}/{key}'
        })
    }

9-2. Step Functions Orchestration

Manage complex workflows with Step Functions:

{
  "Comment": "EC2-based batch processing workflow",
  "StartAt": "CreateInstance",
  "States": {
    "CreateInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ec2:runInstances",
      "Parameters": {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "c5.2xlarge",
        "MinCount": 1,
        "MaxCount": 1,
        "IamInstanceProfile": {
          "Name": "batch-processing-role"
        },
        "TagSpecifications": [
          {
            "ResourceType": "instance",
            "Tags": [
              {
                "Key": "Purpose",
                "Value": "step-function-batch"
              }
            ]
          }
        ]
      },
      "ResultPath": "$.instanceInfo",
      "Next": "WaitForInstance"
    },
    "WaitForInstance": {
      "Type": "Wait",
      "Seconds": 60,
      "Next": "CheckInstanceStatus"
    },
    "CheckInstanceStatus": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeInstanceStatus",
      "Parameters": {
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)"
      },
      "ResultPath": "$.status",
      "Next": "IsInstanceReady"
    },
    "IsInstanceReady": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status.InstanceStatuses[0].InstanceState.Name",
          "StringEquals": "running",
          "Next": "RunProcessing"
        }
      ],
      "Default": "WaitForInstance"
    },
    "RunProcessing": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ssm:sendCommand.sync",
      "Parameters": {
        "DocumentName": "AWS-RunShellScript",
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)",
        "Parameters": {
          "commands": ["cd /opt/app && python3 process.py"]
        }
      },
      "ResultPath": "$.processingResult",
      "Next": "TerminateInstance",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "TerminateInstance",
          "ResultPath": "$.error"
        }
      ]
    },
    "TerminateInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:terminateInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)"
      },
      "End": true
    }
  }
}

9-3. Kubernetes Jobs + Karpenter

Karpenter is an AWS-optimized Kubernetes node autoscaler:

# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: batch-processing
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - c5.xlarge
            - c5a.xlarge
            - c5.2xlarge
            - c6i.xlarge
            - c6i.2xlarge
            - m5.xlarge
            - m5a.xlarge
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - us-east-1a
            - us-east-1b
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: '100'
    memory: 400Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        Tier: private
  securityGroupSelectorTerms:
    - tags:
        kubernetes.io/cluster/my-cluster: owned
  instanceProfile: KarpenterNodeInstanceProfile
# batch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing-job
spec:
  parallelism: 10
  completions: 100
  backoffLimit: 3
  template:
    metadata:
      labels:
        app: data-processor
    spec:
      containers:
        - name: processor
          image: my-registry/data-processor:v1.2
          resources:
            requests:
              cpu: '2'
              memory: '4Gi'
            limits:
              cpu: '4'
              memory: '8Gi'
          env:
            - name: BATCH_SIZE
              value: '1000'
      restartPolicy: OnFailure
      nodeSelector:
        karpenter.sh/capacity-type: spot
      tolerations:
        - key: 'karpenter.sh/disruption'
          operator: 'Exists'

Karpenter vs Cluster Autoscaler comparison:

AspectKarpenterCluster Autoscaler
Node provisioning speedSeconds (direct EC2 API calls)Minutes (via ASG)
Instance type selectionAutomatic based on workloadOnly types defined in ASG
Bin packingAutomatic optimizationLimited
Spot integrationNative supportASG Mixed Instances
Scale downImmediate (removes idle nodes after 30s)Default 10-min wait
AWS dependencyAWS onlyMulti-cloud

10. Monitoring and Alarms

CloudWatch Alarms + SNS

# ASG-related alarms
resource "aws_cloudwatch_metric_alarm" "asg_high_cpu" {
  alarm_name          = "asg-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 60
  statistic           = "Average"
  threshold           = 85
  alarm_description   = "ASG CPU utilization exceeded 85%"

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Spot Interruption count alarm
resource "aws_cloudwatch_metric_alarm" "spot_interruptions" {
  alarm_name          = "spot-interruption-count"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "SpotInterruptionCount"
  namespace           = "Custom/SpotMetrics"
  period              = 300
  statistic           = "Sum"
  threshold           = 3
  alarm_description   = "More than 3 Spot interruptions in 5 minutes"

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}

# ALB 5xx error rate alarm
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
  alarm_name          = "alb-5xx-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5

  metric_query {
    id          = "error_rate"
    expression  = "(errors / requests) * 100"
    label       = "5xx Error Rate"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "HTTPCode_Target_5XX_Count"
      namespace   = "AWS/ApplicationELB"
      period      = 60
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.web.arn_suffix
      }
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 60
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.web.arn_suffix
      }
    }
  }

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}

Grafana + Prometheus on EKS

Monitoring stack using Prometheus and Grafana in an EKS environment:

# prometheus-values.yaml (Helm)
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ['ReadWriteOnce']
          resources:
            requests:
              storage: 50Gi

    additionalScrapeConfigs:
      - job_name: karpenter
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - karpenter
        relabel_configs:
          - source_labels: [__meta_kubernetes_endpoint_port_name]
            regex: http-metrics
            action: keep

grafana:
  adminPassword: 'secure-password'
  persistence:
    enabled: true
    size: 10Gi
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: default
          orgId: 1
          folder: ''
          type: file
          options:
            path: /var/lib/grafana/dashboards/default

11. Quiz

Q1: What is the main reason Target Tracking Scaling is recommended over Simple Scaling in Auto Scaling Groups?

Answer: Target Tracking automatically maintains a target metric value, balances scale-in and scale-out, and eliminates the need to manually manage separate CloudWatch alarms.

Simple Scaling cannot perform additional scaling during the cooldown period and requires manually defining scaling amounts. Target Tracking lets AWS automatically perform optimal scaling, reducing operational overhead and enabling more accurate scaling.

Q2: In a Mixed Instances Policy with on_demand_base_capacity set to 2 and on_demand_percentage_above_base_capacity set to 20, if the ASG needs 12 total instances, what is the On-Demand to Spot ratio?

Answer: 4 On-Demand, 8 Spot

Calculation:

  • Base On-Demand: 2
  • Additional needed: 12 - 2 = 10
  • Additional On-Demand (20%): 10 x 0.2 = 2
  • Additional Spot (80%): 10 x 0.8 = 8
  • Total On-Demand: 2 + 2 = 4, Total Spot: 8
Q3: How many minutes before interruption does AWS send a warning for Spot Instances? What are 2 methods to detect this warning?

Answer: AWS sends a warning 2 minutes before interruption.

Two detection methods:

  1. EC2 Metadata Polling: Periodically check the http://169.254.169.254/latest/meta-data/spot/instance-action endpoint from within the instance
  2. EventBridge Rule: Receive EC2 Spot Instance Interruption Warning events through EventBridge and process them with Lambda or other targets
Q4: What is the fundamental reason Karpenter provisions nodes faster than Cluster Autoscaler?

Answer: Karpenter calls the EC2 API directly to create nodes, while Cluster Autoscaler goes through ASG (Auto Scaling Group) to create nodes.

Cluster Autoscaler detects Pending Pods, modifies the ASG desired count, and then ASG creates instances according to the Launch Template, which is an indirect path. Karpenter analyzes workload requirements, selects the optimal instance type, and creates nodes directly via the EC2 Fleet API, so nodes are ready within seconds.

Q5: When managing Terraform State with S3 + DynamoDB, what is the role of DynamoDB?

Answer: DynamoDB handles State Locking.

When multiple team members run terraform apply simultaneously, the State file can conflict. DynamoDB creates lock records in the table to ensure only one person can modify the State at a time. This prevents race conditions and ensures infrastructure consistency.


12. References

  1. AWS Auto Scaling Official Documentation
  2. EC2 Spot Instances Best Practices
  3. AWS Well-Architected Framework - Cost Optimization Pillar
  4. Terraform AWS Provider Documentation
  5. AWS CDK Developer Guide
  6. Karpenter Documentation
  7. AWS Compute Optimizer User Guide
  8. AWS Savings Plans User Guide
  9. AWS Fault Injection Simulator User Guide
  10. AWS Step Functions Developer Guide
  11. Amazon ECS on Fargate Best Practices
  12. AWS Lambda Pricing
  13. Spot Instance Advisor
  14. AWS CloudWatch User Guide
  15. Terraform Best Practices
  16. AWS re:Invent - Cost Optimization at Scale
  17. Karpenter vs Cluster Autoscaler Deep Dive