- Published on
AWS Dynamic Infrastructure Complete Guide: Auto Scaling, Spot Instances, and IaC for 90% Cost Savings
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. Why Dynamic Infrastructure? (Static vs Dynamic)
- 2. Mastering EC2 Auto Scaling
- 3. Spot Instance Master Class
- 4. Serverless Dynamic Infrastructure
- 5. Building Dynamic Infrastructure with Terraform
- 6. Building Dynamic Infrastructure with AWS CDK
- 7. Cost Optimization Practical Strategies
- 8. High Availability Architecture
- 9. Three Patterns for Creating EC2 Instances On-Demand
- 10. Monitoring and Alarms
- 11. Quiz
- 12. References
1. Why Dynamic Infrastructure? (Static vs Dynamic)
The Fundamental Problem with Fixed Servers
Many companies still provision servers based on peak traffic. If traffic increases 10x during Black Friday, they run 10x the servers year-round. This means wasting money more than 80% of the time.
Let us look at real-world examples:
- E-commerce Company A: Traffic spikes 5x only during peak hours (8-10 PM) meaning servers sit idle the other 22 hours
- Media Company B: Weekend traffic is 3x weekdays meaning 66% server idle time on weekdays
- SaaS Company C: Batch processing surges at month-end meaning over-provisioned for 27 days per month
Static infrastructure in these patterns causes the following problems:
| Problem | Impact | Waste Rate |
|---|---|---|
| Peak-based provisioning | Excess resources most of the time | 60-80% |
| Manual scaling | Delayed response to traffic spikes | Downtime risk |
| Difficult infrastructure changes | Delayed feature deployment | Opportunity cost |
| Single points of failure | Service outage on server failure | Revenue loss |
Core Values of Dynamic Infrastructure
Dynamic infrastructure means an architecture where resources automatically scale up and down based on workload.
Three core principles:
- Elasticity: Traffic increases lead to auto scale-out, traffic decreases lead to auto scale-in
- Cost Optimization: Pay only for what you use
- High Availability: Automatic recovery on failure
The Three Pillars of AWS Cost Optimization
Cost Optimization = Right-sizing + Auto Scaling + Spot Instances
- Right-sizing: Choosing the appropriate instance type for your workload (t3.medium might be sufficient instead of m5.xlarge)
- Auto Scaling: Automatically adjusting instance count based on traffic
- Spot Instances: Using spare capacity at up to 90% discount compared to On-Demand
Combining these three can reduce monthly cloud costs by 60-90%.
2. Mastering EC2 Auto Scaling
Core Components of ASG
An Auto Scaling Group (ASG) consists of three core components:
- Launch Template: Defines what instances to create (AMI, instance type, security groups, key pair)
- Scaling Policy: Rules defining when and how to scale
- Health Check: Monitors instance health and replaces unhealthy instances
Writing a Launch Template
Here is an example of creating a Launch Template with AWS CLI:
{
"LaunchTemplateName": "web-server-template",
"LaunchTemplateData": {
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "t3.medium",
"KeyName": "my-key-pair",
"SecurityGroupIds": ["sg-0123456789abcdef0"],
"UserData": "IyEvYmluL2Jhc2gKeXVtIHVwZGF0ZSAteQp5dW0gaW5zdGFsbCAteSBodHRwZA==",
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "Environment",
"Value": "production"
},
{
"Key": "Project",
"Value": "web-app"
}
]
}
],
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"VolumeSize": 30,
"VolumeType": "gp3",
"Encrypted": true
}
}
]
}
}
The same Launch Template defined in Terraform:
resource "aws_launch_template" "web_server" {
name_prefix = "web-server-"
image_id = data.aws_ami.amazon_linux_2.id
instance_type = "t3.medium"
key_name = "my-key-pair"
vpc_security_group_ids = [aws_security_group.web.id]
user_data = base64encode(<<-EOF
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
systemctl enable httpd
echo "Hello from $(hostname)" > /var/www/html/index.html
EOF
)
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 30
volume_type = "gp3"
encrypted = true
}
}
tag_specifications {
resource_type = "instance"
tags = {
Environment = "production"
Project = "web-app"
}
}
lifecycle {
create_before_destroy = true
}
}
Four Scaling Policy Strategies
2-1. Simple Scaling
The most basic approach that links a single CloudWatch alarm to a single adjustment action.
resource "aws_autoscaling_policy" "scale_up" {
name = "scale-up"
autoscaling_group_name = aws_autoscaling_group.web.name
adjustment_type = "ChangeInCapacity"
scaling_adjustment = 2
cooldown = 300
}
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "high-cpu-alarm"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 120
statistic = "Average"
threshold = 80
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.web.name
}
alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}
Limitation: No additional scaling during the cooldown period after a scaling action, making it slow to respond to sudden traffic surges.
2-2. Step Scaling
Performs different-sized adjustments based on alarm threshold ranges.
resource "aws_autoscaling_policy" "step_scaling" {
name = "step-scaling-policy"
autoscaling_group_name = aws_autoscaling_group.web.name
policy_type = "StepScaling"
adjustment_type = "ChangeInCapacity"
step_adjustment {
scaling_adjustment = 1
metric_interval_lower_bound = 0
metric_interval_upper_bound = 20
}
step_adjustment {
scaling_adjustment = 3
metric_interval_lower_bound = 20
metric_interval_upper_bound = 40
}
step_adjustment {
scaling_adjustment = 5
metric_interval_lower_bound = 40
}
}
This policy adds 1, 3, or 5 instances depending on how far CPU exceeds the threshold.
2-3. Target Tracking Scaling
The most recommended approach. The ASG automatically adjusts instances to keep a specific metric at its target value.
resource "aws_autoscaling_policy" "target_tracking" {
name = "target-tracking-cpu"
autoscaling_group_name = aws_autoscaling_group.web.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 70.0
}
}
# ALB request count based Target Tracking
resource "aws_autoscaling_policy" "target_tracking_alb" {
name = "target-tracking-alb"
autoscaling_group_name = aws_autoscaling_group.web.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = "${aws_lb_target_group.web.arn_suffix}/${aws_lb.web.arn_suffix}"
}
target_value = 1000.0
}
}
Why Target Tracking is recommended:
- No need to manually manage alarms and policies
- Automatically balances scale-in and scale-out
- Multiple Target Tracking policies can be applied simultaneously
2-4. Predictive Scaling
An ML model analyzes the past 14 days of traffic patterns to predict future traffic and pre-provision instances.
resource "aws_autoscaling_policy" "predictive" {
name = "predictive-scaling"
autoscaling_group_name = aws_autoscaling_group.web.name
policy_type = "PredictiveScaling"
predictive_scaling_configuration {
metric_specification {
target_value = 70
predefined_scaling_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
resource_label = ""
}
predefined_load_metric_specification {
predefined_metric_type = "ASGTotalCPUUtilization"
resource_label = ""
}
}
mode = "ForecastAndScale"
scheduling_buffer_time = 300
max_capacity_breach_behavior = "HonorMaxCapacity"
}
}
Predictive Scaling use cases:
- Daily traffic spikes at the same times (commute hours, lunch time)
- Services with clear weekly/monthly repeating patterns
- Synergy effect when combined with Target Tracking
Cooldown Period and Warm-up Configuration
resource "aws_autoscaling_group" "web" {
name = "web-asg"
desired_capacity = 2
max_size = 20
min_size = 1
vpc_zone_identifier = [aws_subnet.private_a.id, aws_subnet.private_c.id]
launch_template {
id = aws_launch_template.web_server.id
version = "$Latest"
}
# Default cooldown: wait time after a scaling action
default_cooldown = 300
# Instance warm-up: time for new instances to become fully ready
default_instance_warmup = 120
health_check_type = "ELB"
health_check_grace_period = 300
tag {
key = "Name"
value = "web-server"
propagate_at_launch = true
}
}
Difference between cooldown and warm-up:
- Cooldown: Wait time between scaling actions (prevents excessive scaling)
- Warm-up: Time for newly launched instances to become ready to serve traffic (excluded from ASG metrics)
Mixed Instances Policy (On-Demand + Spot Combined)
A key strategy for dramatically reducing costs while maintaining stability:
resource "aws_autoscaling_group" "web_mixed" {
name = "web-mixed-asg"
desired_capacity = 6
max_size = 30
min_size = 2
vpc_zone_identifier = [
aws_subnet.private_a.id,
aws_subnet.private_b.id,
aws_subnet.private_c.id
]
mixed_instances_policy {
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.web_server.id
version = "$Latest"
}
override {
instance_type = "t3.medium"
}
override {
instance_type = "t3a.medium"
}
override {
instance_type = "m5.large"
}
override {
instance_type = "m5a.large"
}
}
instances_distribution {
on_demand_base_capacity = 2
on_demand_percentage_above_base_capacity = 20
spot_allocation_strategy = "capacity-optimized"
spot_max_price = "" # Allow up to On-Demand price
}
}
}
What this configuration means:
- Base 2 instances: Always On-Demand (stability guarantee)
- 80% of additional instances: Spot Instances (cost savings)
- 20% of additional instances: On-Demand (stability supplement)
- Multiple instance types: Automatic fallback to alternative types when Spot capacity is unavailable
Lifecycle Hooks
You can perform custom actions when instances launch or terminate:
resource "aws_autoscaling_lifecycle_hook" "launch_hook" {
name = "launch-setup-hook"
autoscaling_group_name = aws_autoscaling_group.web.name
lifecycle_transition = "autoscaling:EC2_INSTANCE_LAUNCHING"
heartbeat_timeout = 600
default_result = "CONTINUE"
notification_target_arn = aws_sns_topic.asg_notifications.arn
role_arn = aws_iam_role.asg_hook_role.arn
}
resource "aws_autoscaling_lifecycle_hook" "terminate_hook" {
name = "terminate-cleanup-hook"
autoscaling_group_name = aws_autoscaling_group.web.name
lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
heartbeat_timeout = 300
default_result = "CONTINUE"
notification_target_arn = aws_sns_topic.asg_notifications.arn
role_arn = aws_iam_role.asg_hook_role.arn
}
Lifecycle Hook use cases:
- Launch Hook: Verify initial configuration completion with configuration management tools (Ansible, Chef)
- Terminate Hook: Log backup, connection draining, service discovery deregistration before termination
3. Spot Instance Master Class
What Are Spot Instances?
Spot Instances provide unused EC2 capacity at up to 90% discount compared to On-Demand pricing.
Price comparison (m5.xlarge, us-east-1):
| Purchase Option | Hourly Price | Monthly Cost (730 hrs) | Discount |
|---|---|---|---|
| On-Demand | $0.192 | $140.16 | - |
| Reserved (1yr, all upfront) | $0.120 | $87.60 | 37% |
| Savings Plan (1yr) | $0.125 | $91.25 | 35% |
| Spot (average) | $0.058 | $42.34 | 70% |
| Spot (lowest) | $0.019 | $13.87 | 90% |
Spot Price History Analysis
You can query Spot price history using the AWS CLI:
aws ec2 describe-spot-price-history \
--instance-types m5.xlarge m5a.xlarge m5d.xlarge \
--product-descriptions "Linux/UNIX" \
--start-time "2026-03-16T00:00:00" \
--end-time "2026-03-23T00:00:00" \
--query 'SpotPriceHistory[*].[InstanceType,AvailabilityZone,SpotPrice,Timestamp]' \
--output table
Strategies to increase Spot price stability:
- Specify multiple instance types (m5.xlarge, m5a.xlarge, m5d.xlarge, m5n.xlarge)
- Use multiple Availability Zones (AZs)
- Use the capacity-optimized allocation strategy (allocates from pools with most spare capacity)
Spot Interruption Handling
Spot Instances can be reclaimed by AWS with a 2-minute warning when AWS needs the capacity back.
Detecting Interruptions via Metadata Polling
#!/bin/bash
# spot-interruption-handler.sh
METADATA_TOKEN=$(curl -s -X PUT \
"http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
while true; do
INTERRUPTION=$(curl -s -H "X-aws-ec2-metadata-token: $METADATA_TOKEN" \
http://169.254.169.254/latest/meta-data/spot/instance-action 2>/dev/null)
if [ "$INTERRUPTION" != "" ] && echo "$INTERRUPTION" | grep -q "action"; then
echo "Spot interruption detected! Starting graceful shutdown..."
# 1. Deregister from ALB (stop receiving new requests)
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $METADATA_TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id)
aws elbv2 deregister-targets \
--target-group-arn "$TARGET_GROUP_ARN" \
--targets "Id=$INSTANCE_ID"
# 2. Wait for in-flight requests to complete (Connection Draining)
sleep 30
# 3. Backup logs
aws s3 sync /var/log/app/ "s3://my-logs-bucket/spot-terminated/$INSTANCE_ID/"
# 4. Graceful application shutdown
systemctl stop my-app
echo "Graceful shutdown completed."
break
fi
sleep 5
done
Lambda-based Interruption Handler
Process Spot interruption events via EventBridge rules with Lambda:
import json
import boto3
ec2 = boto3.client('ec2')
elbv2 = boto3.client('elbv2')
sns = boto3.client('sns')
asg = boto3.client('autoscaling')
def lambda_handler(event, context):
"""
Receives and processes Spot Interruption Warning events from EventBridge
"""
detail = event.get('detail', {})
instance_id = detail.get('instance-id')
action = detail.get('instance-action')
print(f"Spot interruption: instance={instance_id}, action={action}")
# 1. Mark instance as unhealthy in ASG (triggers immediate replacement)
try:
asg.set_instance_health(
InstanceId=instance_id,
HealthStatus='Unhealthy',
ShouldRespectGracePeriod=False
)
print(f"Marked {instance_id} as unhealthy in ASG")
except Exception as e:
print(f"ASG health update failed: {e}")
# 2. Send SNS notification
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:spot-alerts',
Subject=f'Spot Interruption: {instance_id}',
Message=json.dumps({
'instance_id': instance_id,
'action': action,
'region': event.get('region'),
'time': event.get('time')
}, indent=2)
)
return {
'statusCode': 200,
'body': f'Handled interruption for {instance_id}'
}
EventBridge rule configuration (Terraform):
resource "aws_cloudwatch_event_rule" "spot_interruption" {
name = "spot-interruption-rule"
description = "Capture EC2 Spot Instance Interruption Warning"
event_pattern = jsonencode({
source = ["aws.ec2"]
detail-type = ["EC2 Spot Instance Interruption Warning"]
})
}
resource "aws_cloudwatch_event_target" "spot_handler_lambda" {
rule = aws_cloudwatch_event_rule.spot_interruption.name
target_id = "spot-interruption-handler"
arn = aws_lambda_function.spot_handler.arn
}
Spot Fleet: Multiple Instance Types + Multiple AZs
resource "aws_spot_fleet_request" "batch_processing" {
iam_fleet_role = aws_iam_role.spot_fleet_role.arn
target_capacity = 10
terminate_instances_with_expiration = true
allocation_strategy = "capacityOptimized"
fleet_type = "maintain"
launch_template_config {
launch_template_specification {
id = aws_launch_template.batch.id
version = "$Latest"
}
overrides {
instance_type = "c5.xlarge"
availability_zone = "us-east-1a"
}
overrides {
instance_type = "c5a.xlarge"
availability_zone = "us-east-1a"
}
overrides {
instance_type = "c5.xlarge"
availability_zone = "us-east-1b"
}
overrides {
instance_type = "c5a.xlarge"
availability_zone = "us-east-1b"
}
overrides {
instance_type = "c6i.xlarge"
availability_zone = "us-east-1a"
}
overrides {
instance_type = "c6i.xlarge"
availability_zone = "us-east-1b"
}
}
}
Suitable and Unsuitable Workloads for Spot
Suitable workloads:
- CI/CD pipelines (builds, test runners)
- Batch data processing (ETL, log analysis)
- ML model training (with checkpoint support)
- Load testing / performance testing
- Big data processing (EMR, Spark)
- Image/video encoding
- Web servers (when using ASG Mixed Instances Policy)
Unsuitable workloads:
- Single-instance databases (use RDS Multi-AZ instead)
- Real-time payment systems (cannot tolerate interruptions)
- Workloads requiring long-running stateful processes
- Mission-critical services requiring 99.99%+ SLA (use On-Demand or Reserved)
4. Serverless Dynamic Infrastructure
Lambda: Event-Driven Auto Scaling
AWS Lambda is the extreme form of dynamic infrastructure. Zero cost when there are no requests, and millisecond-level resource allocation when requests arrive.
import json
import time
def lambda_handler(event, context):
"""
Lambda function invoked from API Gateway
Concurrent execution: auto-scales from 0 to thousands
"""
start = time.time()
# Business logic
body = event.get('body', '{}')
data = json.loads(body) if body else {}
result = process_request(data)
duration = (time.time() - start) * 1000
print(f"Processing took {duration:.2f}ms")
return {
'statusCode': 200,
'headers': {
'Content-Type': 'application/json',
'X-Processing-Time': f'{duration:.2f}ms'
},
'body': json.dumps(result)
}
def process_request(data):
# Actual business logic
return {'status': 'success', 'data': data}
Lambda Concurrency Management
# Reserved Concurrency: ensures this function has dedicated concurrent execution capacity
resource "aws_lambda_function_event_invoke_config" "example" {
function_name = aws_lambda_function.api.function_name
maximum_event_age_in_seconds = 60
maximum_retry_attempts = 0
}
# Provisioned Concurrency: keeps instances warm (eliminates Cold Start)
resource "aws_lambda_provisioned_concurrency_config" "api" {
function_name = aws_lambda_function.api.function_name
provisioned_concurrent_executions = 50
qualifier = aws_lambda_alias.live.name
}
Concurrency type comparison:
- Reserved Concurrency: Guarantees capacity for this function, preventing other functions from using it. No additional cost.
- Provisioned Concurrency: Keeps execution environments warm. Eliminates Cold Start. Incurs additional cost.
Cold Start Minimization Strategies
- Use Provisioned Concurrency (most reliable method)
- Minimize package size (optimize dependencies, use Layers)
- Runtime selection: Python/Node.js have faster Cold Starts than Java
- Optimize initialization code: Initialize DB connections outside the handler
- Use SnapStart (Java runtime only, 90% Cold Start reduction)
Fargate: Serverless Containers
resource "aws_ecs_service" "api" {
name = "api-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.api.arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = [aws_subnet.private_a.id, aws_subnet.private_c.id]
security_groups = [aws_security_group.ecs.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.api.arn
container_name = "api"
container_port = 8080
}
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 1
base = 2 # Minimum 2 on Fargate On-Demand
}
capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 4 # 80% of additional capacity on Fargate Spot
}
}
# Fargate Auto Scaling
resource "aws_appautoscaling_target" "ecs" {
max_capacity = 20
min_capacity = 2
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "ecs_cpu" {
name = "ecs-cpu-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 70.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
Lambda vs Fargate vs EC2 Comparison
| Aspect | Lambda | Fargate | EC2 (ASG) |
|---|---|---|---|
| Scaling speed | Milliseconds | 30s-2min | 2-5min |
| Max execution time | 15 minutes | Unlimited | Unlimited |
| Memory | 128MB-10GB | 512MB-120GB | Depends on instance type |
| vCPU | Up to 6 | Up to 16 | Depends on instance type |
| Cost model | Request count + execution time | vCPU + memory hours | Instance hours |
| Cold Start | Yes | Yes (longer) | No (already running) |
| Management overhead | Minimal | Medium | High |
| Container support | Image deployment available | Native | Manage Docker directly |
| Spot support | N/A | Fargate Spot (70%) | Spot Instance (90%) |
| Best for | Event processing, APIs | Microservices, web apps | High-performance, stateful |
5. Building Dynamic Infrastructure with Terraform
Complete ASG + ALB Infrastructure Example
# provider.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "web-app/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-lock"
encrypt = true
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = var.project_name
}
}
}
# variables.tf
variable "aws_region" {
default = "us-east-1"
}
variable "environment" {
default = "production"
}
variable "project_name" {
default = "web-app"
}
variable "vpc_cidr" {
default = "10.0.0.0/16"
}
# vpc.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.project_name}-vpc"
}
}
resource "aws_subnet" "public_a" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "${var.aws_region}a"
map_public_ip_on_launch = true
tags = {
Name = "${var.project_name}-public-a"
Tier = "public"
}
}
resource "aws_subnet" "public_b" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.2.0/24"
availability_zone = "${var.aws_region}b"
map_public_ip_on_launch = true
tags = {
Name = "${var.project_name}-public-b"
Tier = "public"
}
}
resource "aws_subnet" "private_a" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.10.0/24"
availability_zone = "${var.aws_region}a"
tags = {
Name = "${var.project_name}-private-a"
Tier = "private"
}
}
resource "aws_subnet" "private_b" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.11.0/24"
availability_zone = "${var.aws_region}b"
tags = {
Name = "${var.project_name}-private-b"
Tier = "private"
}
}
# alb.tf
resource "aws_lb" "web" {
name = "${var.project_name}-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = [aws_subnet.public_a.id, aws_subnet.public_b.id]
enable_deletion_protection = true
access_logs {
bucket = aws_s3_bucket.alb_logs.bucket
prefix = "alb-logs"
enabled = true
}
}
resource "aws_lb_target_group" "web" {
name = "${var.project_name}-tg"
port = 80
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
enabled = true
healthy_threshold = 2
interval = 15
matcher = "200"
path = "/health"
port = "traffic-port"
protocol = "HTTP"
timeout = 5
unhealthy_threshold = 3
}
deregistration_delay = 30
stickiness {
type = "lb_cookie"
enabled = false
}
}
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.web.arn
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = aws_acm_certificate.main.arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.web.arn
}
}
# asg.tf - Complete Mixed Instances ASG
resource "aws_autoscaling_group" "web" {
name = "${var.project_name}-asg"
desired_capacity = 4
max_size = 30
min_size = 2
vpc_zone_identifier = [aws_subnet.private_a.id, aws_subnet.private_b.id]
target_group_arns = [aws_lb_target_group.web.arn]
health_check_type = "ELB"
health_check_grace_period = 300
default_instance_warmup = 120
mixed_instances_policy {
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.web_server.id
version = "$Latest"
}
override {
instance_type = "t3.medium"
weighted_capacity = "1"
}
override {
instance_type = "t3a.medium"
weighted_capacity = "1"
}
override {
instance_type = "m5.large"
weighted_capacity = "2"
}
override {
instance_type = "m5a.large"
weighted_capacity = "2"
}
}
instances_distribution {
on_demand_base_capacity = 2
on_demand_percentage_above_base_capacity = 20
spot_allocation_strategy = "capacity-optimized"
}
}
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 80
instance_warmup = 120
}
}
tag {
key = "Name"
value = "${var.project_name}-web"
propagate_at_launch = true
}
}
# Target Tracking Scaling
resource "aws_autoscaling_policy" "cpu_target" {
name = "cpu-target-tracking"
autoscaling_group_name = aws_autoscaling_group.web.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 70.0
}
}
# Predictive Scaling
resource "aws_autoscaling_policy" "predictive" {
name = "predictive-scaling"
autoscaling_group_name = aws_autoscaling_group.web.name
policy_type = "PredictiveScaling"
predictive_scaling_configuration {
metric_specification {
target_value = 70
predefined_scaling_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
resource_label = ""
}
predefined_load_metric_specification {
predefined_metric_type = "ASGTotalCPUUtilization"
resource_label = ""
}
}
mode = "ForecastAndScale"
scheduling_buffer_time = 300
}
}
Terraform Modules for Reusable Infrastructure
# modules/asg/main.tf
module "web_asg" {
source = "./modules/asg"
project_name = "my-web-app"
environment = "production"
vpc_id = module.vpc.vpc_id
private_subnet_ids = module.vpc.private_subnet_ids
alb_target_group = module.alb.target_group_arn
instance_types = ["t3.medium", "t3a.medium", "m5.large"]
min_size = 2
max_size = 30
desired_capacity = 4
spot_percentage = 80
on_demand_base = 2
target_cpu = 70
tags = local.common_tags
}
State Management (S3 + DynamoDB)
# state-backend/main.tf (apply this locally first)
resource "aws_s3_bucket" "terraform_state" {
bucket = "my-company-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Terraform Workflow
# 1. Initialize
terraform init
# 2. Validate code
terraform validate
# 3. Review change plan
terraform plan -out=tfplan
# 4. Apply changes
terraform apply tfplan
# 5. Check state
terraform state list
terraform state show aws_autoscaling_group.web
# 6. Destroy infrastructure (dev environments)
terraform destroy
6. Building Dynamic Infrastructure with AWS CDK
CDK TypeScript Example: ASG + ALB + RDS
import * as cdk from 'aws-cdk-lib'
import * as ec2 from 'aws-cdk-lib/aws-ec2'
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2'
import * as autoscaling from 'aws-cdk-lib/aws-autoscaling'
import * as rds from 'aws-cdk-lib/aws-rds'
import { Construct } from 'constructs'
export class WebAppStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props)
// VPC
const vpc = new ec2.Vpc(this, 'WebVpc', {
maxAzs: 3,
natGateways: 2,
subnetConfiguration: [
{
cidrMask: 24,
name: 'Public',
subnetType: ec2.SubnetType.PUBLIC,
},
{
cidrMask: 24,
name: 'Private',
subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
},
{
cidrMask: 24,
name: 'Isolated',
subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
},
],
})
// ALB
const alb = new elbv2.ApplicationLoadBalancer(this, 'WebAlb', {
vpc,
internetFacing: true,
vpcSubnets: { subnetType: ec2.SubnetType.PUBLIC },
})
// ASG with Mixed Instances
const asg = new autoscaling.AutoScalingGroup(this, 'WebAsg', {
vpc,
vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
mixedInstancesPolicy: {
instancesDistribution: {
onDemandBaseCapacity: 2,
onDemandPercentageAboveBaseCapacity: 20,
spotAllocationStrategy: autoscaling.SpotAllocationStrategy.CAPACITY_OPTIMIZED,
},
launchTemplate: new ec2.LaunchTemplate(this, 'LaunchTemplate', {
instanceType: ec2.InstanceType.of(ec2.InstanceClass.T3, ec2.InstanceSize.MEDIUM),
machineImage: ec2.MachineImage.latestAmazonLinux2023(),
userData: ec2.UserData.custom(`
#!/bin/bash
yum update -y
yum install -y docker
systemctl start docker
docker run -d -p 80:8080 my-web-app:latest
`),
}),
launchTemplateOverrides: [
{ instanceType: new ec2.InstanceType('t3.medium') },
{ instanceType: new ec2.InstanceType('t3a.medium') },
{ instanceType: new ec2.InstanceType('m5.large') },
{ instanceType: new ec2.InstanceType('m5a.large') },
],
},
minCapacity: 2,
maxCapacity: 30,
healthCheck: autoscaling.HealthCheck.elb({
grace: cdk.Duration.seconds(300),
}),
})
// Target Tracking Scaling
asg.scaleOnCpuUtilization('CpuScaling', {
targetUtilizationPercent: 70,
cooldown: cdk.Duration.seconds(300),
})
asg.scaleOnRequestCount('RequestScaling', {
targetRequestsPerMinute: 1000,
})
// ALB Listener
const listener = alb.addListener('HttpsListener', {
port: 443,
certificates: [
elbv2.ListenerCertificate.fromArn('arn:aws:acm:us-east-1:123456789012:certificate/abc-123'),
],
})
listener.addTargets('WebTarget', {
port: 80,
targets: [asg],
healthCheck: {
path: '/health',
interval: cdk.Duration.seconds(15),
healthyThresholdCount: 2,
unhealthyThresholdCount: 3,
},
deregistrationDelay: cdk.Duration.seconds(30),
})
// RDS Multi-AZ
const database = new rds.DatabaseCluster(this, 'Database', {
engine: rds.DatabaseClusterEngine.auroraPostgres({
version: rds.AuroraPostgresEngineVersion.VER_15_4,
}),
writer: rds.ClusterInstance.provisioned('Writer', {
instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
}),
readers: [
rds.ClusterInstance.provisioned('Reader1', {
instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
}),
],
vpc,
vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_ISOLATED },
storageEncrypted: true,
deletionProtection: true,
})
// Allow ASG to access RDS
database.connections.allowDefaultPortFrom(asg)
// Outputs
new cdk.CfnOutput(this, 'AlbDnsName', {
value: alb.loadBalancerDnsName,
description: 'ALB DNS Name',
})
}
}
CDK Constructs: L1 vs L2 vs L3
| Level | Name | Description | Example |
|---|---|---|---|
| L1 | Cfn Resources | 1:1 mapping to CloudFormation resources | CfnInstance, CfnVPC |
| L2 | Curated | Reasonable defaults + helper methods | ec2.Vpc, lambda.Function |
| L3 | Patterns | Architecture patterns combining multiple resources | ecs_patterns.ApplicationLoadBalancedFargateService |
Using L3 patterns lets you define complex infrastructure in just a few lines:
import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns'
// L3 Pattern: ALB + Fargate service in one go
const fargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
this,
'FargateService',
{
cluster,
desiredCount: 2,
taskImageOptions: {
image: ecs.ContainerImage.fromRegistry('my-app:latest'),
containerPort: 8080,
},
publicLoadBalancer: true,
capacityProviderStrategies: [
{ capacityProvider: 'FARGATE', weight: 1, base: 2 },
{ capacityProvider: 'FARGATE_SPOT', weight: 4 },
],
}
)
fargateService.targetGroup.configureHealthCheck({
path: '/health',
})
CDK vs Terraform vs CloudFormation Comparison
| Aspect | CDK | Terraform | CloudFormation |
|---|---|---|---|
| Language | TypeScript, Python, Java, Go | HCL | YAML/JSON |
| Learning curve | Medium (leverages programming languages) | Medium (learn HCL) | Low (declarative YAML) |
| Abstraction level | High (L3 patterns) | Medium (modules) | Low (resource-level) |
| Multi-cloud | AWS only | Multi-cloud | AWS only |
| State management | CloudFormation stacks | S3 + DynamoDB | Auto-managed |
| Testing | Unit tests possible | Terratest | Limited |
| Drift detection | Via CloudFormation | terraform plan | Supported |
| Ecosystem | Construct Hub | Terraform Registry | Limited |
CDK Pipeline: Infrastructure Deployment via CI/CD
import { CodePipeline, CodePipelineSource, ShellStep } from 'aws-cdk-lib/pipelines'
const pipeline = new CodePipeline(this, 'Pipeline', {
pipelineName: 'WebAppPipeline',
synth: new ShellStep('Synth', {
input: CodePipelineSource.gitHub('my-org/my-repo', 'main'),
commands: ['npm ci', 'npm run build', 'npx cdk synth'],
}),
})
// Deploy to staging
pipeline.addStage(
new WebAppStage(this, 'Staging', {
env: { account: '123456789012', region: 'us-east-1' },
})
)
// Deploy to production (with manual approval)
pipeline.addStage(
new WebAppStage(this, 'Production', {
env: { account: '987654321098', region: 'us-east-1' },
}),
{
pre: [new pipelines.ManualApprovalStep('PromoteToProduction')],
}
)
7. Cost Optimization Practical Strategies
Reserved Instances vs Savings Plans vs Spot Comparison
| Aspect | Reserved Instances | Savings Plans | Spot Instances |
|---|---|---|---|
| Discount | Up to 72% | Up to 72% | Up to 90% |
| Commitment | 1 or 3 years | 1 or 3 years | None |
| Flexibility | Fixed instance type/region | Flexible compute type | Interruptible |
| Upfront options | All/partial/none | All/partial/none | None |
| Best for | Predictable baseline workloads | Diverse compute usage | Interruptible workloads |
| Lambda support | No | Compute SP applicable | N/A |
| Fargate support | No | Compute SP applicable | Fargate Spot |
Schedule-based Scaling
Apply to dev/staging environments or services with low off-hours traffic:
# Business hours (Mon-Fri 09:00-18:00): 4 instances
resource "aws_autoscaling_schedule" "scale_up_business_hours" {
scheduled_action_name = "scale-up-business"
autoscaling_group_name = aws_autoscaling_group.web.name
min_size = 4
max_size = 30
desired_capacity = 4
recurrence = "0 9 * * 1-5"
time_zone = "America/New_York"
}
# Night (Mon-Fri after 18:00): 2 instances
resource "aws_autoscaling_schedule" "scale_down_night" {
scheduled_action_name = "scale-down-night"
autoscaling_group_name = aws_autoscaling_group.web.name
min_size = 2
max_size = 10
desired_capacity = 2
recurrence = "0 18 * * 1-5"
time_zone = "America/New_York"
}
# Weekend: 1 instance
resource "aws_autoscaling_schedule" "scale_down_weekend" {
scheduled_action_name = "scale-down-weekend"
autoscaling_group_name = aws_autoscaling_group.web.name
min_size = 1
max_size = 5
desired_capacity = 1
recurrence = "0 0 * * 6"
time_zone = "America/New_York"
}
Tag-based Cost Tracking
# Apply cost tracking tags to all resources
locals {
cost_tags = {
CostCenter = "engineering"
Team = "platform"
Project = "web-app"
Environment = var.environment
ManagedBy = "terraform"
}
}
# Enable tag-based cost analysis in AWS Cost Explorer
resource "aws_ce_cost_category" "team_costs" {
name = "TeamCosts"
rule {
value = "Platform"
rule {
tags {
key = "Team"
values = ["platform"]
match_options = ["EQUALS"]
}
}
}
rule {
value = "Backend"
rule {
tags {
key = "Team"
values = ["backend"]
match_options = ["EQUALS"]
}
}
}
}
Real-world Case Study: From 2,500 per Month
Before (Static Infrastructure):
- EC2 m5.xlarge x 10 (On-Demand, 24/7) = $1,401/month
- EC2 m5.2xlarge x 5 (Batch servers, 24/7) = $1,401/month
- RDS db.r5.xlarge Multi-AZ = $1,020/month
- NAT Gateway x 2 = $130/month
- ALB = $50/month
- Other (EBS, S3, CloudWatch) = $500/month
- Total monthly cost: approximately $4,502
After (Dynamic Infrastructure):
| Change | Before | After | Savings |
|---|---|---|---|
| Web servers 10 to ASG (2 OD + Spot) | $1,401 | $420 | 70% |
| Batch servers to Spot Fleet (on-demand only) | $1,401 | $140 | 90% |
| RDS with Savings Plan | $1,020 | $663 | 35% |
| NAT Gateway optimization | $130 | $65 | 50% |
| Schedule Scaling (night/weekend reduction) | - | Additional -30% | - |
| Total | $4,502 | Approx. $1,288 | 71% |
8. High Availability Architecture
Multi-AZ Deployment Architecture
Route 53 (DNS Failover)
|
CloudFront (CDN)
|
ALB (Multi-AZ)
/ \
AZ-a (us-east-1a) AZ-b (us-east-1b)
+------------------+ +------------------+
| EC2 (ASG) | | EC2 (ASG) |
| - Web Server x2 | | - Web Server x2 |
| | | |
| RDS (Primary) | | RDS (Standby) |
| ElastiCache | | ElastiCache |
+------------------+ +------------------+
Route 53 Health Check + Failover
resource "aws_route53_health_check" "primary" {
fqdn = "app.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
regions = ["us-east-1", "eu-west-1", "ap-southeast-1"]
tags = {
Name = "primary-health-check"
}
}
resource "aws_route53_record" "primary" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = aws_lb.web.dns_name
zone_id = aws_lb.web.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
set_identifier = "primary"
}
Chaos Engineering: AWS Fault Injection Simulator
resource "aws_fis_experiment_template" "spot_interruption" {
description = "Simulate Spot Instance interruptions"
role_arn = aws_iam_role.fis.arn
stop_condition {
source = "aws:cloudwatch:alarm"
value = aws_cloudwatch_metric_alarm.error_rate.arn
}
action {
name = "interrupt-spot-instances"
action_id = "aws:ec2:send-spot-instance-interruptions"
parameter {
key = "durationBeforeInterruption"
value = "PT2M"
}
target {
key = "SpotInstances"
value = "spot-instances-target"
}
}
target {
name = "spot-instances-target"
resource_type = "aws:ec2:spot-instance"
selection_mode = "COUNT(2)"
resource_tag {
key = "Environment"
value = "staging"
}
}
}
9. Three Patterns for Creating EC2 Instances On-Demand
9-1. EventBridge + Lambda to EC2 (Event-Driven)
A pattern that starts an EC2 instance when a file is uploaded to S3, processes it, and automatically terminates:
import boto3
import json
import time
ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')
def lambda_handler(event, context):
"""
S3 upload event -> Create EC2 instance -> Process -> Auto-terminate
"""
# Extract file info from S3 event
bucket = event['detail']['bucket']['name']
key = event['detail']['object']['key']
print(f"Processing file: s3://{bucket}/{key}")
# Start EC2 instance
user_data = f"""#!/bin/bash
set -e
# Perform work
aws s3 cp s3://{bucket}/{key} /tmp/input
python3 /opt/process.py /tmp/input /tmp/output
aws s3 cp /tmp/output s3://{bucket}-processed/{key}
# Self-terminate after completion
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
"""
response = ec2.run_instances(
ImageId='ami-0abcdef1234567890',
InstanceType='c5.xlarge',
MinCount=1,
MaxCount=1,
IamInstanceProfile={
'Name': 'ec2-processing-role'
},
UserData=user_data,
TagSpecifications=[
{
'ResourceType': 'instance',
'Tags': [
{'Key': 'Name', 'Value': f'processor-{key[:20]}'},
{'Key': 'Purpose', 'Value': 'batch-processing'},
{'Key': 'AutoTerminate', 'Value': 'true'}
]
}
],
InstanceMarketOptions={
'MarketType': 'spot',
'SpotOptions': {
'SpotInstanceType': 'one-time',
'InstanceInterruptionBehavior': 'terminate'
}
}
)
instance_id = response['Instances'][0]['InstanceId']
print(f"Started processing instance: {instance_id}")
return {
'statusCode': 200,
'body': json.dumps({
'instance_id': instance_id,
'file': f's3://{bucket}/{key}'
})
}
9-2. Step Functions Orchestration
Manage complex workflows with Step Functions:
{
"Comment": "EC2-based batch processing workflow",
"StartAt": "CreateInstance",
"States": {
"CreateInstance": {
"Type": "Task",
"Resource": "arn:aws:states:::ec2:runInstances",
"Parameters": {
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "c5.2xlarge",
"MinCount": 1,
"MaxCount": 1,
"IamInstanceProfile": {
"Name": "batch-processing-role"
},
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "Purpose",
"Value": "step-function-batch"
}
]
}
]
},
"ResultPath": "$.instanceInfo",
"Next": "WaitForInstance"
},
"WaitForInstance": {
"Type": "Wait",
"Seconds": 60,
"Next": "CheckInstanceStatus"
},
"CheckInstanceStatus": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeInstanceStatus",
"Parameters": {
"InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)"
},
"ResultPath": "$.status",
"Next": "IsInstanceReady"
},
"IsInstanceReady": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.status.InstanceStatuses[0].InstanceState.Name",
"StringEquals": "running",
"Next": "RunProcessing"
}
],
"Default": "WaitForInstance"
},
"RunProcessing": {
"Type": "Task",
"Resource": "arn:aws:states:::ssm:sendCommand.sync",
"Parameters": {
"DocumentName": "AWS-RunShellScript",
"InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)",
"Parameters": {
"commands": ["cd /opt/app && python3 process.py"]
}
},
"ResultPath": "$.processingResult",
"Next": "TerminateInstance",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "TerminateInstance",
"ResultPath": "$.error"
}
]
},
"TerminateInstance": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:terminateInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)"
},
"End": true
}
}
}
9-3. Kubernetes Jobs + Karpenter
Karpenter is an AWS-optimized Kubernetes node autoscaler:
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: batch-processing
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ['spot', 'on-demand']
- key: node.kubernetes.io/instance-type
operator: In
values:
- c5.xlarge
- c5a.xlarge
- c5.2xlarge
- c6i.xlarge
- c6i.2xlarge
- m5.xlarge
- m5a.xlarge
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: '100'
memory: 400Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
Tier: private
securityGroupSelectorTerms:
- tags:
kubernetes.io/cluster/my-cluster: owned
instanceProfile: KarpenterNodeInstanceProfile
# batch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing-job
spec:
parallelism: 10
completions: 100
backoffLimit: 3
template:
metadata:
labels:
app: data-processor
spec:
containers:
- name: processor
image: my-registry/data-processor:v1.2
resources:
requests:
cpu: '2'
memory: '4Gi'
limits:
cpu: '4'
memory: '8Gi'
env:
- name: BATCH_SIZE
value: '1000'
restartPolicy: OnFailure
nodeSelector:
karpenter.sh/capacity-type: spot
tolerations:
- key: 'karpenter.sh/disruption'
operator: 'Exists'
Karpenter vs Cluster Autoscaler comparison:
| Aspect | Karpenter | Cluster Autoscaler |
|---|---|---|
| Node provisioning speed | Seconds (direct EC2 API calls) | Minutes (via ASG) |
| Instance type selection | Automatic based on workload | Only types defined in ASG |
| Bin packing | Automatic optimization | Limited |
| Spot integration | Native support | ASG Mixed Instances |
| Scale down | Immediate (removes idle nodes after 30s) | Default 10-min wait |
| AWS dependency | AWS only | Multi-cloud |
10. Monitoring and Alarms
CloudWatch Alarms + SNS
# ASG-related alarms
resource "aws_cloudwatch_metric_alarm" "asg_high_cpu" {
alarm_name = "asg-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 60
statistic = "Average"
threshold = 85
alarm_description = "ASG CPU utilization exceeded 85%"
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.web.name
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Spot Interruption count alarm
resource "aws_cloudwatch_metric_alarm" "spot_interruptions" {
alarm_name = "spot-interruption-count"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "SpotInterruptionCount"
namespace = "Custom/SpotMetrics"
period = 300
statistic = "Sum"
threshold = 3
alarm_description = "More than 3 Spot interruptions in 5 minutes"
alarm_actions = [aws_sns_topic.critical_alerts.arn]
}
# ALB 5xx error rate alarm
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
alarm_name = "alb-5xx-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 5
metric_query {
id = "error_rate"
expression = "(errors / requests) * 100"
label = "5xx Error Rate"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 60
stat = "Sum"
dimensions = {
LoadBalancer = aws_lb.web.arn_suffix
}
}
}
metric_query {
id = "requests"
metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 60
stat = "Sum"
dimensions = {
LoadBalancer = aws_lb.web.arn_suffix
}
}
}
alarm_actions = [aws_sns_topic.critical_alerts.arn]
}
Grafana + Prometheus on EKS
Monitoring stack using Prometheus and Grafana in an EKS environment:
# prometheus-values.yaml (Helm)
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ['ReadWriteOnce']
resources:
requests:
storage: 50Gi
additionalScrapeConfigs:
- job_name: karpenter
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- karpenter
relabel_configs:
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: http-metrics
action: keep
grafana:
adminPassword: 'secure-password'
persistence:
enabled: true
size: 10Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: default
orgId: 1
folder: ''
type: file
options:
path: /var/lib/grafana/dashboards/default
11. Quiz
Q1: What is the main reason Target Tracking Scaling is recommended over Simple Scaling in Auto Scaling Groups?
Answer: Target Tracking automatically maintains a target metric value, balances scale-in and scale-out, and eliminates the need to manually manage separate CloudWatch alarms.
Simple Scaling cannot perform additional scaling during the cooldown period and requires manually defining scaling amounts. Target Tracking lets AWS automatically perform optimal scaling, reducing operational overhead and enabling more accurate scaling.
Q2: In a Mixed Instances Policy with on_demand_base_capacity set to 2 and on_demand_percentage_above_base_capacity set to 20, if the ASG needs 12 total instances, what is the On-Demand to Spot ratio?
Answer: 4 On-Demand, 8 Spot
Calculation:
- Base On-Demand: 2
- Additional needed: 12 - 2 = 10
- Additional On-Demand (20%): 10 x 0.2 = 2
- Additional Spot (80%): 10 x 0.8 = 8
- Total On-Demand: 2 + 2 = 4, Total Spot: 8
Q3: How many minutes before interruption does AWS send a warning for Spot Instances? What are 2 methods to detect this warning?
Answer: AWS sends a warning 2 minutes before interruption.
Two detection methods:
- EC2 Metadata Polling: Periodically check the
http://169.254.169.254/latest/meta-data/spot/instance-actionendpoint from within the instance - EventBridge Rule: Receive EC2 Spot Instance Interruption Warning events through EventBridge and process them with Lambda or other targets
Q4: What is the fundamental reason Karpenter provisions nodes faster than Cluster Autoscaler?
Answer: Karpenter calls the EC2 API directly to create nodes, while Cluster Autoscaler goes through ASG (Auto Scaling Group) to create nodes.
Cluster Autoscaler detects Pending Pods, modifies the ASG desired count, and then ASG creates instances according to the Launch Template, which is an indirect path. Karpenter analyzes workload requirements, selects the optimal instance type, and creates nodes directly via the EC2 Fleet API, so nodes are ready within seconds.
Q5: When managing Terraform State with S3 + DynamoDB, what is the role of DynamoDB?
Answer: DynamoDB handles State Locking.
When multiple team members run terraform apply simultaneously, the State file can conflict. DynamoDB creates lock records in the table to ensure only one person can modify the State at a time. This prevents race conditions and ensures infrastructure consistency.
12. References
- AWS Auto Scaling Official Documentation
- EC2 Spot Instances Best Practices
- AWS Well-Architected Framework - Cost Optimization Pillar
- Terraform AWS Provider Documentation
- AWS CDK Developer Guide
- Karpenter Documentation
- AWS Compute Optimizer User Guide
- AWS Savings Plans User Guide
- AWS Fault Injection Simulator User Guide
- AWS Step Functions Developer Guide
- Amazon ECS on Fargate Best Practices
- AWS Lambda Pricing
- Spot Instance Advisor
- AWS CloudWatch User Guide
- Terraform Best Practices
- AWS re:Invent - Cost Optimization at Scale
- Karpenter vs Cluster Autoscaler Deep Dive