AWS Dynamic Infrastructure Complete Guide: Auto Scaling, Spot Instances, and IaC for 90% Cost Savings

1. Why Dynamic Infrastructure? (Static vs Dynamic)
2. Mastering EC2 Auto Scaling
3. Spot Instance Master Class
4. Serverless Dynamic Infrastructure
5. Building Dynamic Infrastructure with Terraform
6. Building Dynamic Infrastructure with AWS CDK
7. Cost Optimization Practical Strategies
8. High Availability Architecture
9. Three Patterns for Creating EC2 Instances On-Demand
10. Monitoring and Alarms
- CloudWatch Alarms + SNS
- Grafana + Prometheus on EKS
11. Quiz
12. References

1. Why Dynamic Infrastructure? (Static vs Dynamic)

The Fundamental Problem with Fixed Servers

Many companies still provision servers based on peak traffic. If traffic increases 10x during Black Friday, they run 10x the servers year-round. This means wasting money more than 80% of the time.

Let us look at real-world examples:

E-commerce Company A: Traffic spikes 5x only during peak hours (8-10 PM) meaning servers sit idle the other 22 hours
Media Company B: Weekend traffic is 3x weekdays meaning 66% server idle time on weekdays
SaaS Company C: Batch processing surges at month-end meaning over-provisioned for 27 days per month

Static infrastructure in these patterns causes the following problems:

Problem	Impact	Waste Rate
Peak-based provisioning	Excess resources most of the time	60-80%
Manual scaling	Delayed response to traffic spikes	Downtime risk
Difficult infrastructure changes	Delayed feature deployment	Opportunity cost
Single points of failure	Service outage on server failure	Revenue loss

Core Values of Dynamic Infrastructure

Dynamic infrastructure means an architecture where resources automatically scale up and down based on workload.

Three core principles:

Elasticity: Traffic increases lead to auto scale-out, traffic decreases lead to auto scale-in
Cost Optimization: Pay only for what you use
High Availability: Automatic recovery on failure

The Three Pillars of AWS Cost Optimization

Cost Optimization = Right-sizing + Auto Scaling + Spot Instances

Right-sizing: Choosing the appropriate instance type for your workload (t3.medium might be sufficient instead of m5.xlarge)
Auto Scaling: Automatically adjusting instance count based on traffic
Spot Instances: Using spare capacity at up to 90% discount compared to On-Demand

Combining these three can reduce monthly cloud costs by 60-90%.

2. Mastering EC2 Auto Scaling

Core Components of ASG

An Auto Scaling Group (ASG) consists of three core components:

Launch Template: Defines what instances to create (AMI, instance type, security groups, key pair)
Scaling Policy: Rules defining when and how to scale
Health Check: Monitors instance health and replaces unhealthy instances

Writing a Launch Template

Here is an example of creating a Launch Template with AWS CLI:

{
  "LaunchTemplateName": "web-server-template",
  "LaunchTemplateData": {
    "ImageId": "ami-0abcdef1234567890",
    "InstanceType": "t3.medium",
    "KeyName": "my-key-pair",
    "SecurityGroupIds": ["sg-0123456789abcdef0"],
    "UserData": "IyEvYmluL2Jhc2gKeXVtIHVwZGF0ZSAteQp5dW0gaW5zdGFsbCAteSBodHRwZA==",
    "TagSpecifications": [
      {
        "ResourceType": "instance",
        "Tags": [
          {
            "Key": "Environment",
            "Value": "production"
          },
          {
            "Key": "Project",
            "Value": "web-app"
          }
        ]
      }
    ],
    "BlockDeviceMappings": [
      {
        "DeviceName": "/dev/xvda",
        "Ebs": {
          "VolumeSize": 30,
          "VolumeType": "gp3",
          "Encrypted": true
        }
      }
    ]
  }
}

The same Launch Template defined in Terraform:

resource "aws_launch_template" "web_server" {
  name_prefix   = "web-server-"
  image_id      = data.aws_ami.amazon_linux_2.id
  instance_type = "t3.medium"
  key_name      = "my-key-pair"

  vpc_security_group_ids = [aws_security_group.web.id]

  user_data = base64encode(<<-EOF
    #!/bin/bash
    yum update -y
    yum install -y httpd
    systemctl start httpd
    systemctl enable httpd
    echo "Hello from $(hostname)" > /var/www/html/index.html
  EOF
  )

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size = 30
      volume_type = "gp3"
      encrypted   = true
    }
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Environment = "production"
      Project     = "web-app"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

Four Scaling Policy Strategies

2-1. Simple Scaling

The most basic approach that links a single CloudWatch alarm to a single adjustment action.

resource "aws_autoscaling_policy" "scale_up" {
  name                   = "scale-up"
  autoscaling_group_name = aws_autoscaling_group.web.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 2
  cooldown               = 300
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-alarm"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 80

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }

  alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}

Limitation: No additional scaling during the cooldown period after a scaling action, making it slow to respond to sudden traffic surges.

2-2. Step Scaling

Performs different-sized adjustments based on alarm threshold ranges.

resource "aws_autoscaling_policy" "step_scaling" {
  name                   = "step-scaling-policy"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "StepScaling"
  adjustment_type        = "ChangeInCapacity"

  step_adjustment {
    scaling_adjustment          = 1
    metric_interval_lower_bound = 0
    metric_interval_upper_bound = 20
  }

  step_adjustment {
    scaling_adjustment          = 3
    metric_interval_lower_bound = 20
    metric_interval_upper_bound = 40
  }

  step_adjustment {
    scaling_adjustment          = 5
    metric_interval_lower_bound = 40
  }
}

This policy adds 1, 3, or 5 instances depending on how far CPU exceeds the threshold.

2-3. Target Tracking Scaling

The most recommended approach. The ASG automatically adjusts instances to keep a specific metric at its target value.

resource "aws_autoscaling_policy" "target_tracking" {
  name                   = "target-tracking-cpu"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

# ALB request count based Target Tracking
resource "aws_autoscaling_policy" "target_tracking_alb" {
  name                   = "target-tracking-alb"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb_target_group.web.arn_suffix}/${aws_lb.web.arn_suffix}"
    }
    target_value = 1000.0
  }
}

Why Target Tracking is recommended:

No need to manually manage alarms and policies
Automatically balances scale-in and scale-out
Multiple Target Tracking policies can be applied simultaneously

2-4. Predictive Scaling

An ML model analyzes the past 14 days of traffic patterns to predict future traffic and pre-provision instances.

resource "aws_autoscaling_policy" "predictive" {
  name                   = "predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70

      predefined_scaling_metric_specification {
        predefined_metric_type = "ASGAverageCPUUtilization"
        resource_label         = ""
      }

      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
        resource_label         = ""
      }
    }

    mode                          = "ForecastAndScale"
    scheduling_buffer_time        = 300
    max_capacity_breach_behavior  = "HonorMaxCapacity"
  }
}

Predictive Scaling use cases:

Daily traffic spikes at the same times (commute hours, lunch time)
Services with clear weekly/monthly repeating patterns
Synergy effect when combined with Target Tracking

Cooldown Period and Warm-up Configuration

resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  desired_capacity    = 2
  max_size            = 20
  min_size            = 1
  vpc_zone_identifier = [aws_subnet.private_a.id, aws_subnet.private_c.id]

  launch_template {
    id      = aws_launch_template.web_server.id
    version = "$Latest"
  }

  # Default cooldown: wait time after a scaling action
  default_cooldown = 300

  # Instance warm-up: time for new instances to become fully ready
  default_instance_warmup = 120

  health_check_type         = "ELB"
  health_check_grace_period = 300

  tag {
    key                 = "Name"
    value               = "web-server"
    propagate_at_launch = true
  }
}

Difference between cooldown and warm-up:

Cooldown: Wait time between scaling actions (prevents excessive scaling)
Warm-up: Time for newly launched instances to become ready to serve traffic (excluded from ASG metrics)

Mixed Instances Policy (On-Demand + Spot Combined)

A key strategy for dramatically reducing costs while maintaining stability:

resource "aws_autoscaling_group" "web_mixed" {
  name                = "web-mixed-asg"
  desired_capacity    = 6
  max_size            = 30
  min_size            = 2
  vpc_zone_identifier = [
    aws_subnet.private_a.id,
    aws_subnet.private_b.id,
    aws_subnet.private_c.id
  ]

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web_server.id
        version            = "$Latest"
      }

      override {
        instance_type = "t3.medium"
      }
      override {
        instance_type = "t3a.medium"
      }
      override {
        instance_type = "m5.large"
      }
      override {
        instance_type = "m5a.large"
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
      spot_max_price                           = ""  # Allow up to On-Demand price
    }
  }
}

What this configuration means:

Base 2 instances: Always On-Demand (stability guarantee)
80% of additional instances: Spot Instances (cost savings)
20% of additional instances: On-Demand (stability supplement)
Multiple instance types: Automatic fallback to alternative types when Spot capacity is unavailable

Lifecycle Hooks

You can perform custom actions when instances launch or terminate:

resource "aws_autoscaling_lifecycle_hook" "launch_hook" {
  name                   = "launch-setup-hook"
  autoscaling_group_name = aws_autoscaling_group.web.name
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_LAUNCHING"
  heartbeat_timeout      = 600
  default_result         = "CONTINUE"

  notification_target_arn = aws_sns_topic.asg_notifications.arn
  role_arn                = aws_iam_role.asg_hook_role.arn
}

resource "aws_autoscaling_lifecycle_hook" "terminate_hook" {
  name                   = "terminate-cleanup-hook"
  autoscaling_group_name = aws_autoscaling_group.web.name
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  heartbeat_timeout      = 300
  default_result         = "CONTINUE"

  notification_target_arn = aws_sns_topic.asg_notifications.arn
  role_arn                = aws_iam_role.asg_hook_role.arn
}

Lifecycle Hook use cases:

Launch Hook: Verify initial configuration completion with configuration management tools (Ansible, Chef)
Terminate Hook: Log backup, connection draining, service discovery deregistration before termination

3. Spot Instance Master Class

What Are Spot Instances?

Spot Instances provide unused EC2 capacity at up to 90% discount compared to On-Demand pricing.

Price comparison (m5.xlarge, us-east-1):

Purchase Option	Hourly Price	Monthly Cost (730 hrs)	Discount
On-Demand	$0.192	$140.16	-
Reserved (1yr, all upfront)	$0.120	$87.60	37%
Savings Plan (1yr)	$0.125	$91.25	35%
Spot (average)	$0.058	$42.34	70%
Spot (lowest)	$0.019	$13.87	90%

Spot Price History Analysis

You can query Spot price history using the AWS CLI:

aws ec2 describe-spot-price-history \
  --instance-types m5.xlarge m5a.xlarge m5d.xlarge \
  --product-descriptions "Linux/UNIX" \
  --start-time "2026-03-16T00:00:00" \
  --end-time "2026-03-23T00:00:00" \
  --query 'SpotPriceHistory[*].[InstanceType,AvailabilityZone,SpotPrice,Timestamp]' \
  --output table

Strategies to increase Spot price stability:

Specify multiple instance types (m5.xlarge, m5a.xlarge, m5d.xlarge, m5n.xlarge)
Use multiple Availability Zones (AZs)
Use the capacity-optimized allocation strategy (allocates from pools with most spare capacity)

Spot Interruption Handling

Spot Instances can be reclaimed by AWS with a 2-minute warning when AWS needs the capacity back.

Detecting Interruptions via Metadata Polling

#!/bin/bash
# spot-interruption-handler.sh

METADATA_TOKEN=$(curl -s -X PUT \
  "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

while true; do
  INTERRUPTION=$(curl -s -H "X-aws-ec2-metadata-token: $METADATA_TOKEN" \
    http://169.254.169.254/latest/meta-data/spot/instance-action 2>/dev/null)

  if [ "$INTERRUPTION" != "" ] && echo "$INTERRUPTION" | grep -q "action"; then
    echo "Spot interruption detected! Starting graceful shutdown..."

    # 1. Deregister from ALB (stop receiving new requests)
    INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $METADATA_TOKEN" \
      http://169.254.169.254/latest/meta-data/instance-id)

    aws elbv2 deregister-targets \
      --target-group-arn "$TARGET_GROUP_ARN" \
      --targets "Id=$INSTANCE_ID"

    # 2. Wait for in-flight requests to complete (Connection Draining)
    sleep 30

    # 3. Backup logs
    aws s3 sync /var/log/app/ "s3://my-logs-bucket/spot-terminated/$INSTANCE_ID/"

    # 4. Graceful application shutdown
    systemctl stop my-app

    echo "Graceful shutdown completed."
    break
  fi

  sleep 5
done

Lambda-based Interruption Handler

Process Spot interruption events via EventBridge rules with Lambda:

import json
import boto3

ec2 = boto3.client('ec2')
elbv2 = boto3.client('elbv2')
sns = boto3.client('sns')
asg = boto3.client('autoscaling')

def lambda_handler(event, context):
    """
    Receives and processes Spot Interruption Warning events from EventBridge
    """
    detail = event.get('detail', {})
    instance_id = detail.get('instance-id')
    action = detail.get('instance-action')

    print(f"Spot interruption: instance={instance_id}, action={action}")

    # 1. Mark instance as unhealthy in ASG (triggers immediate replacement)
    try:
        asg.set_instance_health(
            InstanceId=instance_id,
            HealthStatus='Unhealthy',
            ShouldRespectGracePeriod=False
        )
        print(f"Marked {instance_id} as unhealthy in ASG")
    except Exception as e:
        print(f"ASG health update failed: {e}")

    # 2. Send SNS notification
    sns.publish(
        TopicArn='arn:aws:sns:us-east-1:123456789012:spot-alerts',
        Subject=f'Spot Interruption: {instance_id}',
        Message=json.dumps({
            'instance_id': instance_id,
            'action': action,
            'region': event.get('region'),
            'time': event.get('time')
        }, indent=2)
    )

    return {
        'statusCode': 200,
        'body': f'Handled interruption for {instance_id}'
    }

EventBridge rule configuration (Terraform):

resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "spot-interruption-rule"
  description = "Capture EC2 Spot Instance Interruption Warning"

  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Spot Instance Interruption Warning"]
  })
}

resource "aws_cloudwatch_event_target" "spot_handler_lambda" {
  rule      = aws_cloudwatch_event_rule.spot_interruption.name
  target_id = "spot-interruption-handler"
  arn       = aws_lambda_function.spot_handler.arn
}

Spot Fleet: Multiple Instance Types + Multiple AZs

resource "aws_spot_fleet_request" "batch_processing" {
  iam_fleet_role                      = aws_iam_role.spot_fleet_role.arn
  target_capacity                     = 10
  terminate_instances_with_expiration = true
  allocation_strategy                 = "capacityOptimized"
  fleet_type                          = "maintain"

  launch_template_config {
    launch_template_specification {
      id      = aws_launch_template.batch.id
      version = "$Latest"
    }

    overrides {
      instance_type     = "c5.xlarge"
      availability_zone = "us-east-1a"
    }
    overrides {
      instance_type     = "c5a.xlarge"
      availability_zone = "us-east-1a"
    }
    overrides {
      instance_type     = "c5.xlarge"
      availability_zone = "us-east-1b"
    }
    overrides {
      instance_type     = "c5a.xlarge"
      availability_zone = "us-east-1b"
    }
    overrides {
      instance_type     = "c6i.xlarge"
      availability_zone = "us-east-1a"
    }
    overrides {
      instance_type     = "c6i.xlarge"
      availability_zone = "us-east-1b"
    }
  }
}

Suitable and Unsuitable Workloads for Spot

Suitable workloads:

CI/CD pipelines (builds, test runners)
Batch data processing (ETL, log analysis)
ML model training (with checkpoint support)
Load testing / performance testing
Big data processing (EMR, Spark)
Image/video encoding
Web servers (when using ASG Mixed Instances Policy)

Unsuitable workloads:

Single-instance databases (use RDS Multi-AZ instead)
Real-time payment systems (cannot tolerate interruptions)
Workloads requiring long-running stateful processes
Mission-critical services requiring 99.99%+ SLA (use On-Demand or Reserved)

4. Serverless Dynamic Infrastructure

Lambda: Event-Driven Auto Scaling

AWS Lambda is the extreme form of dynamic infrastructure. Zero cost when there are no requests, and millisecond-level resource allocation when requests arrive.

import json
import time

def lambda_handler(event, context):
    """
    Lambda function invoked from API Gateway
    Concurrent execution: auto-scales from 0 to thousands
    """
    start = time.time()

    # Business logic
    body = event.get('body', '{}')
    data = json.loads(body) if body else {}

    result = process_request(data)

    duration = (time.time() - start) * 1000
    print(f"Processing took {duration:.2f}ms")

    return {
        'statusCode': 200,
        'headers': {
            'Content-Type': 'application/json',
            'X-Processing-Time': f'{duration:.2f}ms'
        },
        'body': json.dumps(result)
    }

def process_request(data):
    # Actual business logic
    return {'status': 'success', 'data': data}

Lambda Concurrency Management

# Reserved Concurrency: ensures this function has dedicated concurrent execution capacity
resource "aws_lambda_function_event_invoke_config" "example" {
  function_name = aws_lambda_function.api.function_name

  maximum_event_age_in_seconds = 60
  maximum_retry_attempts       = 0
}

# Provisioned Concurrency: keeps instances warm (eliminates Cold Start)
resource "aws_lambda_provisioned_concurrency_config" "api" {
  function_name                  = aws_lambda_function.api.function_name
  provisioned_concurrent_executions = 50
  qualifier                      = aws_lambda_alias.live.name
}

Concurrency type comparison:

Reserved Concurrency: Guarantees capacity for this function, preventing other functions from using it. No additional cost.
Provisioned Concurrency: Keeps execution environments warm. Eliminates Cold Start. Incurs additional cost.

Cold Start Minimization Strategies

Use Provisioned Concurrency (most reliable method)
Minimize package size (optimize dependencies, use Layers)
Runtime selection: Python/Node.js have faster Cold Starts than Java
Optimize initialization code: Initialize DB connections outside the handler
Use SnapStart (Java runtime only, 90% Cold Start reduction)

Fargate: Serverless Containers

resource "aws_ecs_service" "api" {
  name            = "api-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = [aws_subnet.private_a.id, aws_subnet.private_c.id]
    security_groups  = [aws_security_group.ecs.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8080
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1
    base              = 2  # Minimum 2 on Fargate On-Demand
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 4  # 80% of additional capacity on Fargate Spot
  }
}

# Fargate Auto Scaling
resource "aws_appautoscaling_target" "ecs" {
  max_capacity       = 20
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_cpu" {
  name               = "ecs-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

Lambda vs Fargate vs EC2 Comparison

Aspect	Lambda	Fargate	EC2 (ASG)
Scaling speed	Milliseconds	30s-2min	2-5min
Max execution time	15 minutes	Unlimited	Unlimited
Memory	128MB-10GB	512MB-120GB	Depends on instance type
vCPU	Up to 6	Up to 16	Depends on instance type
Cost model	Request count + execution time	vCPU + memory hours	Instance hours
Cold Start	Yes	Yes (longer)	No (already running)
Management overhead	Minimal	Medium	High
Container support	Image deployment available	Native	Manage Docker directly
Spot support	N/A	Fargate Spot (70%)	Spot Instance (90%)
Best for	Event processing, APIs	Microservices, web apps	High-performance, stateful

5. Building Dynamic Infrastructure with Terraform

Complete ASG + ALB Infrastructure Example

# provider.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "web-app/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
      Project     = var.project_name
    }
  }
}

# variables.tf
variable "aws_region" {
  default = "us-east-1"
}

variable "environment" {
  default = "production"
}

variable "project_name" {
  default = "web-app"
}

variable "vpc_cidr" {
  default = "10.0.0.0/16"
}

# vpc.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "${var.project_name}-vpc"
  }
}

resource "aws_subnet" "public_a" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "${var.aws_region}a"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.project_name}-public-a"
    Tier = "public"
  }
}

resource "aws_subnet" "public_b" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.2.0/24"
  availability_zone       = "${var.aws_region}b"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.project_name}-public-b"
    Tier = "public"
  }
}

resource "aws_subnet" "private_a" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.10.0/24"
  availability_zone = "${var.aws_region}a"

  tags = {
    Name = "${var.project_name}-private-a"
    Tier = "private"
  }
}

resource "aws_subnet" "private_b" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.11.0/24"
  availability_zone = "${var.aws_region}b"

  tags = {
    Name = "${var.project_name}-private-b"
    Tier = "private"
  }
}

# alb.tf
resource "aws_lb" "web" {
  name               = "${var.project_name}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = [aws_subnet.public_a.id, aws_subnet.public_b.id]

  enable_deletion_protection = true

  access_logs {
    bucket  = aws_s3_bucket.alb_logs.bucket
    prefix  = "alb-logs"
    enabled = true
  }
}

resource "aws_lb_target_group" "web" {
  name     = "${var.project_name}-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    enabled             = true
    healthy_threshold   = 2
    interval            = 15
    matcher             = "200"
    path                = "/health"
    port                = "traffic-port"
    protocol            = "HTTP"
    timeout             = 5
    unhealthy_threshold = 3
  }

  deregistration_delay = 30

  stickiness {
    type    = "lb_cookie"
    enabled = false
  }
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.web.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate.main.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.web.arn
  }
}

# asg.tf - Complete Mixed Instances ASG
resource "aws_autoscaling_group" "web" {
  name                = "${var.project_name}-asg"
  desired_capacity    = 4
  max_size            = 30
  min_size            = 2
  vpc_zone_identifier = [aws_subnet.private_a.id, aws_subnet.private_b.id]

  target_group_arns         = [aws_lb_target_group.web.arn]
  health_check_type         = "ELB"
  health_check_grace_period = 300
  default_instance_warmup   = 120

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web_server.id
        version            = "$Latest"
      }

      override {
        instance_type     = "t3.medium"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "t3a.medium"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "m5.large"
        weighted_capacity = "2"
      }
      override {
        instance_type     = "m5a.large"
        weighted_capacity = "2"
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
    }
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 80
      instance_warmup        = 120
    }
  }

  tag {
    key                 = "Name"
    value               = "${var.project_name}-web"
    propagate_at_launch = true
  }
}

# Target Tracking Scaling
resource "aws_autoscaling_policy" "cpu_target" {
  name                   = "cpu-target-tracking"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

# Predictive Scaling
resource "aws_autoscaling_policy" "predictive" {
  name                   = "predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70

      predefined_scaling_metric_specification {
        predefined_metric_type = "ASGAverageCPUUtilization"
        resource_label         = ""
      }

      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
        resource_label         = ""
      }
    }

    mode                   = "ForecastAndScale"
    scheduling_buffer_time = 300
  }
}

Terraform Modules for Reusable Infrastructure

# modules/asg/main.tf
module "web_asg" {
  source = "./modules/asg"

  project_name       = "my-web-app"
  environment        = "production"
  vpc_id             = module.vpc.vpc_id
  private_subnet_ids = module.vpc.private_subnet_ids
  alb_target_group   = module.alb.target_group_arn

  instance_types = ["t3.medium", "t3a.medium", "m5.large"]
  min_size       = 2
  max_size        = 30
  desired_capacity = 4

  spot_percentage    = 80
  on_demand_base     = 2
  target_cpu         = 70

  tags = local.common_tags
}

State Management (S3 + DynamoDB)

# state-backend/main.tf (apply this locally first)
resource "aws_s3_bucket" "terraform_state" {
  bucket = "my-company-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Terraform Workflow

# 1. Initialize
terraform init

# 2. Validate code
terraform validate

# 3. Review change plan
terraform plan -out=tfplan

# 4. Apply changes
terraform apply tfplan

# 5. Check state
terraform state list
terraform state show aws_autoscaling_group.web

# 6. Destroy infrastructure (dev environments)
terraform destroy

6. Building Dynamic Infrastructure with AWS CDK

CDK TypeScript Example: ASG + ALB + RDS

import * as cdk from 'aws-cdk-lib'
import * as ec2 from 'aws-cdk-lib/aws-ec2'
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2'
import * as autoscaling from 'aws-cdk-lib/aws-autoscaling'
import * as rds from 'aws-cdk-lib/aws-rds'
import { Construct } from 'constructs'

export class WebAppStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props)

    // VPC
    const vpc = new ec2.Vpc(this, 'WebVpc', {
      maxAzs: 3,
      natGateways: 2,
      subnetConfiguration: [
        {
          cidrMask: 24,
          name: 'Public',
          subnetType: ec2.SubnetType.PUBLIC,
        },
        {
          cidrMask: 24,
          name: 'Private',
          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
        },
        {
          cidrMask: 24,
          name: 'Isolated',
          subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
        },
      ],
    })

    // ALB
    const alb = new elbv2.ApplicationLoadBalancer(this, 'WebAlb', {
      vpc,
      internetFacing: true,
      vpcSubnets: { subnetType: ec2.SubnetType.PUBLIC },
    })

    // ASG with Mixed Instances
    const asg = new autoscaling.AutoScalingGroup(this, 'WebAsg', {
      vpc,
      vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
      mixedInstancesPolicy: {
        instancesDistribution: {
          onDemandBaseCapacity: 2,
          onDemandPercentageAboveBaseCapacity: 20,
          spotAllocationStrategy: autoscaling.SpotAllocationStrategy.CAPACITY_OPTIMIZED,
        },
        launchTemplate: new ec2.LaunchTemplate(this, 'LaunchTemplate', {
          instanceType: ec2.InstanceType.of(ec2.InstanceClass.T3, ec2.InstanceSize.MEDIUM),
          machineImage: ec2.MachineImage.latestAmazonLinux2023(),
          userData: ec2.UserData.custom(`
            #!/bin/bash
            yum update -y
            yum install -y docker
            systemctl start docker
            docker run -d -p 80:8080 my-web-app:latest
          `),
        }),
        launchTemplateOverrides: [
          { instanceType: new ec2.InstanceType('t3.medium') },
          { instanceType: new ec2.InstanceType('t3a.medium') },
          { instanceType: new ec2.InstanceType('m5.large') },
          { instanceType: new ec2.InstanceType('m5a.large') },
        ],
      },
      minCapacity: 2,
      maxCapacity: 30,
      healthCheck: autoscaling.HealthCheck.elb({
        grace: cdk.Duration.seconds(300),
      }),
    })

    // Target Tracking Scaling
    asg.scaleOnCpuUtilization('CpuScaling', {
      targetUtilizationPercent: 70,
      cooldown: cdk.Duration.seconds(300),
    })

    asg.scaleOnRequestCount('RequestScaling', {
      targetRequestsPerMinute: 1000,
    })

    // ALB Listener
    const listener = alb.addListener('HttpsListener', {
      port: 443,
      certificates: [
        elbv2.ListenerCertificate.fromArn('arn:aws:acm:us-east-1:123456789012:certificate/abc-123'),
      ],
    })

    listener.addTargets('WebTarget', {
      port: 80,
      targets: [asg],
      healthCheck: {
        path: '/health',
        interval: cdk.Duration.seconds(15),
        healthyThresholdCount: 2,
        unhealthyThresholdCount: 3,
      },
      deregistrationDelay: cdk.Duration.seconds(30),
    })

    // RDS Multi-AZ
    const database = new rds.DatabaseCluster(this, 'Database', {
      engine: rds.DatabaseClusterEngine.auroraPostgres({
        version: rds.AuroraPostgresEngineVersion.VER_15_4,
      }),
      writer: rds.ClusterInstance.provisioned('Writer', {
        instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
      }),
      readers: [
        rds.ClusterInstance.provisioned('Reader1', {
          instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
        }),
      ],
      vpc,
      vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_ISOLATED },
      storageEncrypted: true,
      deletionProtection: true,
    })

    // Allow ASG to access RDS
    database.connections.allowDefaultPortFrom(asg)

    // Outputs
    new cdk.CfnOutput(this, 'AlbDnsName', {
      value: alb.loadBalancerDnsName,
      description: 'ALB DNS Name',
    })
  }
}

CDK Constructs: L1 vs L2 vs L3

Level	Name	Description	Example
L1	Cfn Resources	1:1 mapping to CloudFormation resources	`CfnInstance`, `CfnVPC`
L2	Curated	Reasonable defaults + helper methods	`ec2.Vpc`, `lambda.Function`
L3	Patterns	Architecture patterns combining multiple resources	`ecs_patterns.ApplicationLoadBalancedFargateService`

Using L3 patterns lets you define complex infrastructure in just a few lines:

import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns'

// L3 Pattern: ALB + Fargate service in one go
const fargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
  this,
  'FargateService',
  {
    cluster,
    desiredCount: 2,
    taskImageOptions: {
      image: ecs.ContainerImage.fromRegistry('my-app:latest'),
      containerPort: 8080,
    },
    publicLoadBalancer: true,
    capacityProviderStrategies: [
      { capacityProvider: 'FARGATE', weight: 1, base: 2 },
      { capacityProvider: 'FARGATE_SPOT', weight: 4 },
    ],
  }
)

fargateService.targetGroup.configureHealthCheck({
  path: '/health',
})

CDK vs Terraform vs CloudFormation Comparison

Aspect	CDK	Terraform	CloudFormation
Language	TypeScript, Python, Java, Go	HCL	YAML/JSON
Learning curve	Medium (leverages programming languages)	Medium (learn HCL)	Low (declarative YAML)
Abstraction level	High (L3 patterns)	Medium (modules)	Low (resource-level)
Multi-cloud	AWS only	Multi-cloud	AWS only
State management	CloudFormation stacks	S3 + DynamoDB	Auto-managed
Testing	Unit tests possible	Terratest	Limited
Drift detection	Via CloudFormation	terraform plan	Supported
Ecosystem	Construct Hub	Terraform Registry	Limited

CDK Pipeline: Infrastructure Deployment via CI/CD

import { CodePipeline, CodePipelineSource, ShellStep } from 'aws-cdk-lib/pipelines'

const pipeline = new CodePipeline(this, 'Pipeline', {
  pipelineName: 'WebAppPipeline',
  synth: new ShellStep('Synth', {
    input: CodePipelineSource.gitHub('my-org/my-repo', 'main'),
    commands: ['npm ci', 'npm run build', 'npx cdk synth'],
  }),
})

// Deploy to staging
pipeline.addStage(
  new WebAppStage(this, 'Staging', {
    env: { account: '123456789012', region: 'us-east-1' },
  })
)

// Deploy to production (with manual approval)
pipeline.addStage(
  new WebAppStage(this, 'Production', {
    env: { account: '987654321098', region: 'us-east-1' },
  }),
  {
    pre: [new pipelines.ManualApprovalStep('PromoteToProduction')],
  }
)

7. Cost Optimization Practical Strategies

Reserved Instances vs Savings Plans vs Spot Comparison

Aspect	Reserved Instances	Savings Plans	Spot Instances
Discount	Up to 72%	Up to 72%	Up to 90%
Commitment	1 or 3 years	1 or 3 years	None
Flexibility	Fixed instance type/region	Flexible compute type	Interruptible
Upfront options	All/partial/none	All/partial/none	None
Best for	Predictable baseline workloads	Diverse compute usage	Interruptible workloads
Lambda support	No	Compute SP applicable	N/A
Fargate support	No	Compute SP applicable	Fargate Spot

Schedule-based Scaling

Apply to dev/staging environments or services with low off-hours traffic:

# Business hours (Mon-Fri 09:00-18:00): 4 instances
resource "aws_autoscaling_schedule" "scale_up_business_hours" {
  scheduled_action_name  = "scale-up-business"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 4
  max_size               = 30
  desired_capacity       = 4
  recurrence             = "0 9 * * 1-5"
  time_zone              = "America/New_York"
}

# Night (Mon-Fri after 18:00): 2 instances
resource "aws_autoscaling_schedule" "scale_down_night" {
  scheduled_action_name  = "scale-down-night"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 2
  max_size               = 10
  desired_capacity       = 2
  recurrence             = "0 18 * * 1-5"
  time_zone              = "America/New_York"
}

# Weekend: 1 instance
resource "aws_autoscaling_schedule" "scale_down_weekend" {
  scheduled_action_name  = "scale-down-weekend"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 1
  max_size               = 5
  desired_capacity       = 1
  recurrence             = "0 0 * * 6"
  time_zone              = "America/New_York"
}

Tag-based Cost Tracking

# Apply cost tracking tags to all resources
locals {
  cost_tags = {
    CostCenter  = "engineering"
    Team        = "platform"
    Project     = "web-app"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

# Enable tag-based cost analysis in AWS Cost Explorer
resource "aws_ce_cost_category" "team_costs" {
  name = "TeamCosts"

  rule {
    value = "Platform"
    rule {
      tags {
        key           = "Team"
        values        = ["platform"]
        match_options = ["EQUALS"]
      }
    }
  }

  rule {
    value = "Backend"
    rule {
      tags {
        key           = "Team"
        values        = ["backend"]
        match_options = ["EQUALS"]
      }
    }
  }
}

Real-world Case Study: From $10,000 to$ 2,500 per Month

Before (Static Infrastructure):

EC2 m5.xlarge x 10 (On-Demand, 24/7) = $1,401/month
EC2 m5.2xlarge x 5 (Batch servers, 24/7) = $1,401/month
RDS db.r5.xlarge Multi-AZ = $1,020/month
NAT Gateway x 2 = $130/month
ALB = $50/month
Other (EBS, S3, CloudWatch) = $500/month
Total monthly cost: approximately $4,502

After (Dynamic Infrastructure):

Change	Before	After	Savings
Web servers 10 to ASG (2 OD + Spot)	$1,401	$420	70%
Batch servers to Spot Fleet (on-demand only)	$1,401	$140	90%
RDS with Savings Plan	$1,020	$663	35%
NAT Gateway optimization	$130	$65	50%
Schedule Scaling (night/weekend reduction)	-	Additional -30%	-
Total	$4,502	Approx. $1,288	71%

8. High Availability Architecture

Multi-AZ Deployment Architecture

                    Route 53 (DNS Failover)
                           |
                    CloudFront (CDN)
                           |
                    ALB (Multi-AZ)
                    /              \
            AZ-a (us-east-1a)        AZ-b (us-east-1b)
            +------------------+    +------------------+
            | EC2 (ASG)        |    | EC2 (ASG)        |
            | - Web Server x2  |    | - Web Server x2  |
            |                  |    |                  |
            | RDS (Primary)    |    | RDS (Standby)    |
            | ElastiCache      |    | ElastiCache      |
            +------------------+    +------------------+

Route 53 Health Check + Failover

resource "aws_route53_health_check" "primary" {
  fqdn              = "app.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 10

  regions = ["us-east-1", "eu-west-1", "ap-southeast-1"]

  tags = {
    Name = "primary-health-check"
  }
}

resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.web.dns_name
    zone_id                = aws_lb.web.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
}

Chaos Engineering: AWS Fault Injection Simulator

resource "aws_fis_experiment_template" "spot_interruption" {
  description = "Simulate Spot Instance interruptions"
  role_arn    = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.error_rate.arn
  }

  action {
    name      = "interrupt-spot-instances"
    action_id = "aws:ec2:send-spot-instance-interruptions"

    parameter {
      key   = "durationBeforeInterruption"
      value = "PT2M"
    }

    target {
      key   = "SpotInstances"
      value = "spot-instances-target"
    }
  }

  target {
    name           = "spot-instances-target"
    resource_type  = "aws:ec2:spot-instance"
    selection_mode = "COUNT(2)"

    resource_tag {
      key   = "Environment"
      value = "staging"
    }
  }
}

9. Three Patterns for Creating EC2 Instances On-Demand

9-1. EventBridge + Lambda to EC2 (Event-Driven)

A pattern that starts an EC2 instance when a file is uploaded to S3, processes it, and automatically terminates:

import boto3
import json
import time

ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    """
    S3 upload event -> Create EC2 instance -> Process -> Auto-terminate
    """
    # Extract file info from S3 event
    bucket = event['detail']['bucket']['name']
    key = event['detail']['object']['key']

    print(f"Processing file: s3://{bucket}/{key}")

    # Start EC2 instance
    user_data = f"""#!/bin/bash
set -e

# Perform work
aws s3 cp s3://{bucket}/{key} /tmp/input
python3 /opt/process.py /tmp/input /tmp/output
aws s3 cp /tmp/output s3://{bucket}-processed/{key}

# Self-terminate after completion
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
"""

    response = ec2.run_instances(
        ImageId='ami-0abcdef1234567890',
        InstanceType='c5.xlarge',
        MinCount=1,
        MaxCount=1,
        IamInstanceProfile={
            'Name': 'ec2-processing-role'
        },
        UserData=user_data,
        TagSpecifications=[
            {
                'ResourceType': 'instance',
                'Tags': [
                    {'Key': 'Name', 'Value': f'processor-{key[:20]}'},
                    {'Key': 'Purpose', 'Value': 'batch-processing'},
                    {'Key': 'AutoTerminate', 'Value': 'true'}
                ]
            }
        ],
        InstanceMarketOptions={
            'MarketType': 'spot',
            'SpotOptions': {
                'SpotInstanceType': 'one-time',
                'InstanceInterruptionBehavior': 'terminate'
            }
        }
    )

    instance_id = response['Instances'][0]['InstanceId']
    print(f"Started processing instance: {instance_id}")

    return {
        'statusCode': 200,
        'body': json.dumps({
            'instance_id': instance_id,
            'file': f's3://{bucket}/{key}'
        })
    }

9-2. Step Functions Orchestration

Manage complex workflows with Step Functions:

{
  "Comment": "EC2-based batch processing workflow",
  "StartAt": "CreateInstance",
  "States": {
    "CreateInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ec2:runInstances",
      "Parameters": {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "c5.2xlarge",
        "MinCount": 1,
        "MaxCount": 1,
        "IamInstanceProfile": {
          "Name": "batch-processing-role"
        },
        "TagSpecifications": [
          {
            "ResourceType": "instance",
            "Tags": [
              {
                "Key": "Purpose",
                "Value": "step-function-batch"
              }
            ]
          }
        ]
      },
      "ResultPath": "$.instanceInfo",
      "Next": "WaitForInstance"
    },
    "WaitForInstance": {
      "Type": "Wait",
      "Seconds": 60,
      "Next": "CheckInstanceStatus"
    },
    "CheckInstanceStatus": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeInstanceStatus",
      "Parameters": {
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)"
      },
      "ResultPath": "$.status",
      "Next": "IsInstanceReady"
    },
    "IsInstanceReady": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status.InstanceStatuses[0].InstanceState.Name",
          "StringEquals": "running",
          "Next": "RunProcessing"
        }
      ],
      "Default": "WaitForInstance"
    },
    "RunProcessing": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ssm:sendCommand.sync",
      "Parameters": {
        "DocumentName": "AWS-RunShellScript",
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)",
        "Parameters": {
          "commands": ["cd /opt/app && python3 process.py"]
        }
      },
      "ResultPath": "$.processingResult",
      "Next": "TerminateInstance",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "TerminateInstance",
          "ResultPath": "$.error"
        }
      ]
    },
    "TerminateInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:terminateInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)"
      },
      "End": true
    }
  }
}

9-3. Kubernetes Jobs + Karpenter

Karpenter is an AWS-optimized Kubernetes node autoscaler:

# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: batch-processing
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - c5.xlarge
            - c5a.xlarge
            - c5.2xlarge
            - c6i.xlarge
            - c6i.2xlarge
            - m5.xlarge
            - m5a.xlarge
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - us-east-1a
            - us-east-1b
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: '100'
    memory: 400Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        Tier: private
  securityGroupSelectorTerms:
    - tags:
        kubernetes.io/cluster/my-cluster: owned
  instanceProfile: KarpenterNodeInstanceProfile

# batch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing-job
spec:
  parallelism: 10
  completions: 100
  backoffLimit: 3
  template:
    metadata:
      labels:
        app: data-processor
    spec:
      containers:
        - name: processor
          image: my-registry/data-processor:v1.2
          resources:
            requests:
              cpu: '2'
              memory: '4Gi'
            limits:
              cpu: '4'
              memory: '8Gi'
          env:
            - name: BATCH_SIZE
              value: '1000'
      restartPolicy: OnFailure
      nodeSelector:
        karpenter.sh/capacity-type: spot
      tolerations:
        - key: 'karpenter.sh/disruption'
          operator: 'Exists'

Karpenter vs Cluster Autoscaler comparison:

Aspect	Karpenter	Cluster Autoscaler
Node provisioning speed	Seconds (direct EC2 API calls)	Minutes (via ASG)
Instance type selection	Automatic based on workload	Only types defined in ASG
Bin packing	Automatic optimization	Limited
Spot integration	Native support	ASG Mixed Instances
Scale down	Immediate (removes idle nodes after 30s)	Default 10-min wait
AWS dependency	AWS only	Multi-cloud

10. Monitoring and Alarms

CloudWatch Alarms + SNS

# ASG-related alarms
resource "aws_cloudwatch_metric_alarm" "asg_high_cpu" {
  alarm_name          = "asg-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 60
  statistic           = "Average"
  threshold           = 85
  alarm_description   = "ASG CPU utilization exceeded 85%"

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Spot Interruption count alarm
resource "aws_cloudwatch_metric_alarm" "spot_interruptions" {
  alarm_name          = "spot-interruption-count"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "SpotInterruptionCount"
  namespace           = "Custom/SpotMetrics"
  period              = 300
  statistic           = "Sum"
  threshold           = 3
  alarm_description   = "More than 3 Spot interruptions in 5 minutes"

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}

# ALB 5xx error rate alarm
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
  alarm_name          = "alb-5xx-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5

  metric_query {
    id          = "error_rate"
    expression  = "(errors / requests) * 100"
    label       = "5xx Error Rate"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "HTTPCode_Target_5XX_Count"
      namespace   = "AWS/ApplicationELB"
      period      = 60
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.web.arn_suffix
      }
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 60
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.web.arn_suffix
      }
    }
  }

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}

Grafana + Prometheus on EKS

Monitoring stack using Prometheus and Grafana in an EKS environment:

# prometheus-values.yaml (Helm)
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ['ReadWriteOnce']
          resources:
            requests:
              storage: 50Gi

    additionalScrapeConfigs:
      - job_name: karpenter
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - karpenter
        relabel_configs:
          - source_labels: [__meta_kubernetes_endpoint_port_name]
            regex: http-metrics
            action: keep

grafana:
  adminPassword: 'secure-password'
  persistence:
    enabled: true
    size: 10Gi
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: default
          orgId: 1
          folder: ''
          type: file
          options:
            path: /var/lib/grafana/dashboards/default

11. Quiz

Q1: What is the main reason Target Tracking Scaling is recommended over Simple Scaling in Auto Scaling Groups?

Answer: Target Tracking automatically maintains a target metric value, balances scale-in and scale-out, and eliminates the need to manually manage separate CloudWatch alarms.

Simple Scaling cannot perform additional scaling during the cooldown period and requires manually defining scaling amounts. Target Tracking lets AWS automatically perform optimal scaling, reducing operational overhead and enabling more accurate scaling.

Q2: In a Mixed Instances Policy with on_demand_base_capacity set to 2 and on_demand_percentage_above_base_capacity set to 20, if the ASG needs 12 total instances, what is the On-Demand to Spot ratio?

Answer: 4 On-Demand, 8 Spot

Calculation:

Base On-Demand: 2
Additional needed: 12 - 2 = 10
Additional On-Demand (20%): 10 x 0.2 = 2
Additional Spot (80%): 10 x 0.8 = 8
Total On-Demand: 2 + 2 = 4, Total Spot: 8

Q3: How many minutes before interruption does AWS send a warning for Spot Instances? What are 2 methods to detect this warning?

Answer: AWS sends a warning 2 minutes before interruption.

Two detection methods:

EC2 Metadata Polling: Periodically check the http://169.254.169.254/latest/meta-data/spot/instance-action endpoint from within the instance
EventBridge Rule: Receive EC2 Spot Instance Interruption Warning events through EventBridge and process them with Lambda or other targets

Q4: What is the fundamental reason Karpenter provisions nodes faster than Cluster Autoscaler?

Answer: Karpenter calls the EC2 API directly to create nodes, while Cluster Autoscaler goes through ASG (Auto Scaling Group) to create nodes.

Cluster Autoscaler detects Pending Pods, modifies the ASG desired count, and then ASG creates instances according to the Launch Template, which is an indirect path. Karpenter analyzes workload requirements, selects the optimal instance type, and creates nodes directly via the EC2 Fleet API, so nodes are ready within seconds.

Q5: When managing Terraform State with S3 + DynamoDB, what is the role of DynamoDB?

Answer: DynamoDB handles State Locking.

When multiple team members run terraform apply simultaneously, the State file can conflict. DynamoDB creates lock records in the table to ensure only one person can modify the State at a time. This prevents race conditions and ensures infrastructure consistency.