AWS動的インフラ完全ガイド：Auto Scaling、Spotインスタンス、IaCでコスト90%削減の実践戦略

1. なぜ動的（どうてき）インフラなのか？（静的 vs 動的）
2. EC2 Auto Scaling完全（かんぜん）マスター
3. Spotインスタンスマスタークラス
4. サーバーレス動的インフラ
5. Terraformで動的インフラ構築
6. AWS CDKで動的インフラ構築
7. コスト最適化の実践戦略
8. 高可用性アーキテクチャ
9. 必要な時にEC2を作成する3つのパターン
10. モニタリングとアラーム
- CloudWatch Alarms + SNS
- Grafana + Prometheus on EKS
11. クイズ
12. 参考資料

1. なぜ動的（どうてき）インフラなのか？（静的 vs 動的）

固定（こてい）サーバーの根本的（こんぽんてき）な問題

多くの企業（きぎょう）が依然（いぜん）として「ピークトラフィック基準（きじゅん）」でサーバーをプロビジョニングしています。ブラックフライデーにトラフィックが10倍（ばい）になるなら、1年中（ねんじゅう）10倍規模（きぼ）のサーバーを運用（うんよう）する方式（ほうしき）です。これは80%以上の時間（じかん）、コストを浪費（ろうひ）していることと同じです。

実際（じっさい）の事例（じれい）を見てみましょう：

ECサイトA社（しゃ）：ピーク時間（じかん）（夜8-10時）のみトラフィック5倍増加（ぞうか） → 残り22時間はサーバーアイドル状態
メディアB社：週末（しゅうまつ）トラフィックが平日（へいじつ）の3倍 → 平日は66%サーバーアイドル
SaaS C社：月末（げつまつ）精算（せいさん）時にバッチ処理（しょり）殺到（さっとう） → 月27日は過剰（かじょう）プロビジョニング

このようなパターンでは、静的（せいてき）インフラは以下の問題を引き起こします：

問題	影響（えいきょう）	浪費率（ろうひりつ）
ピーク基準プロビジョニング	通常時（つうじょうじ）のリソース過剰	60-80%
手動（しゅどう）スケーリング	トラフィック急増（きゅうぞう）時の対応（たいおう）遅延	ダウンタイム発生
インフラ変更（へんこう）困難（こんなん）	新機能（しんきのう）デプロイ遅延	機会（きかい）コスト
単一（たんいつ）障害点（しょうがいてん）	サーバー障害時（しょうがいじ）サービス停止	売上損失（そんしつ）

動的インフラの核心的価値（かちんてきかち）

動的インフラとは、ワークロードに応じてリソースが自動的（じどうてき）にスケールアウト・スケールインするアーキテクチャのことです。

核心3原則（げんそく）：

弾力性（だんりょくせい）（Elasticity）：トラフィック増加 → 自動拡張（かくちょう）、トラフィック減少 → 自動縮小（しゅくしょう）
コスト最適化（さいてきか）（Cost Optimization）：使った分だけ支払い
高可用性（こうかようせい）（High Availability）：障害発生時（しょうがいはっせいじ）の自動復旧（ふっきゅう）

AWSコスト最適化の3本柱（ほんばしら）

コスト最適化 = Right-sizing + Auto Scaling + Spot Instances
               （適正サイズ）    （自動拡張）     （割引インスタンス）

Right-sizing：ワークロードに適した適切なインスタンスタイプの選択（m5.xlargeではなくt3.mediumで十分かも）
Auto Scaling：トラフィックに応じてインスタンス数を自動調整（ちょうせい）
Spot Instances：オンデマンド比最大90%割引で余剰（よじょう）容量（ようりょう）を活用

この3つを組み合わせると、月間クラウドコストを60-90%まで削減（さくげん）できます。

2. EC2 Auto Scaling完全（かんぜん）マスター

ASGの核心構成要素（こうせいようそ）

Auto Scaling Group（ASG）は3つの核心コンポーネントで構成されます：

Launch Template：どのインスタンスを作成するか定義（AMI、インスタンスタイプ、セキュリティグループ、キーペア）
Scaling Policy：いつ、どのようにスケーリングするかのルール定義
Health Check：インスタンスの状態（じょうたい）を監視（かんし）し、異常（いじょう）なインスタンスを交換（こうかん）

Launch Templateの作成

まずAWS CLIでLaunch Templateを作成する例です：

{
  "LaunchTemplateName": "web-server-template",
  "LaunchTemplateData": {
    "ImageId": "ami-0abcdef1234567890",
    "InstanceType": "t3.medium",
    "KeyName": "my-key-pair",
    "SecurityGroupIds": ["sg-0123456789abcdef0"],
    "UserData": "IyEvYmluL2Jhc2gKeXVtIHVwZGF0ZSAteQp5dW0gaW5zdGFsbCAteSBodHRwZA==",
    "TagSpecifications": [
      {
        "ResourceType": "instance",
        "Tags": [
          {
            "Key": "Environment",
            "Value": "production"
          },
          {
            "Key": "Project",
            "Value": "web-app"
          }
        ]
      }
    ],
    "BlockDeviceMappings": [
      {
        "DeviceName": "/dev/xvda",
        "Ebs": {
          "VolumeSize": 30,
          "VolumeType": "gp3",
          "Encrypted": true
        }
      }
    ]
  }
}

Terraformで同じLaunch Templateを定義すると：

resource "aws_launch_template" "web_server" {
  name_prefix   = "web-server-"
  image_id      = data.aws_ami.amazon_linux_2.id
  instance_type = "t3.medium"
  key_name      = "my-key-pair"

  vpc_security_group_ids = [aws_security_group.web.id]

  user_data = base64encode(<<-EOF
    #!/bin/bash
    yum update -y
    yum install -y httpd
    systemctl start httpd
    systemctl enable httpd
    echo "Hello from $(hostname)" > /var/www/html/index.html
  EOF
  )

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size = 30
      volume_type = "gp3"
      encrypted   = true
    }
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Environment = "production"
      Project     = "web-app"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

スケーリングポリシー4つの戦略（せんりゃく）

2-1. Simple Scaling（シンプルスケーリング）

CloudWatchアラーム1つに1つの調整（ちょうせい）アクションを紐づける最も基本的（きほんてき）な方式です。

resource "aws_autoscaling_policy" "scale_up" {
  name                   = "scale-up"
  autoscaling_group_name = aws_autoscaling_group.web.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 2
  cooldown               = 300
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-alarm"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 80

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }

  alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}

制限（せいげん）：スケーリングアクション後のクールダウン期間中は追加スケーリングが不可能で、急激（きゅうげき）なトラフィック増加への対応が遅い。

2-2. Step Scaling（ステップスケーリング）

アラーム範囲（はんい）に応じて異なるサイズの調整を実行します。

resource "aws_autoscaling_policy" "step_scaling" {
  name                   = "step-scaling-policy"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "StepScaling"
  adjustment_type        = "ChangeInCapacity"

  step_adjustment {
    scaling_adjustment          = 1
    metric_interval_lower_bound = 0
    metric_interval_upper_bound = 20
  }

  step_adjustment {
    scaling_adjustment          = 3
    metric_interval_lower_bound = 20
    metric_interval_upper_bound = 40
  }

  step_adjustment {
    scaling_adjustment          = 5
    metric_interval_lower_bound = 40
  }
}

このポリシーは、CPUが閾値（しきいち）を超える度合いに応じて1個、3個、5個のインスタンスを追加します。

2-3. Target Tracking Scaling（ターゲット追跡スケーリング）

最も推奨（すいしょう）される方式です。特定のメトリクスが目標値を維持（いじ）するようにASGが自動的にインスタンスを調整します。

resource "aws_autoscaling_policy" "target_tracking" {
  name                   = "target-tracking-cpu"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

# ALBリクエスト数ベースのTarget Tracking
resource "aws_autoscaling_policy" "target_tracking_alb" {
  name                   = "target-tracking-alb"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb_target_group.web.arn_suffix}/${aws_lb.web.arn_suffix}"
    }
    target_value = 1000.0
  }
}

Target Trackingが推奨される理由：

アラームとポリシーを手動で管理する必要がない
スケールインとスケールアウトを自動でバランスよく調整
複数のTarget Trackingポリシーを同時に適用（てきよう）可能

2-4. Predictive Scaling（予測（よそく）スケーリング）

MLモデルが過去14日間のトラフィックパターンを分析し、将来（しょうらい）のトラフィックを予測して事前にインスタンスをプロビジョニングします。

resource "aws_autoscaling_policy" "predictive" {
  name                   = "predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70

      predefined_scaling_metric_specification {
        predefined_metric_type = "ASGAverageCPUUtilization"
        resource_label         = ""
      }

      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
        resource_label         = ""
      }
    }

    mode                          = "ForecastAndScale"
    scheduling_buffer_time        = 300
    max_capacity_breach_behavior  = "HonorMaxCapacity"
  }
}

Predictive Scalingのユースケース：

毎日同じ時間帯にトラフィックが急増（通勤時間、昼休み）
週次・月次の反復パターンが明確なサービス
Target Trackingと併用（へいよう）するとシナジー効果

Cooldown PeriodとWarm-upの設定

resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  desired_capacity    = 2
  max_size            = 20
  min_size            = 1
  vpc_zone_identifier = [aws_subnet.private_a.id, aws_subnet.private_c.id]

  launch_template {
    id      = aws_launch_template.web_server.id
    version = "$Latest"
  }

  # デフォルトクールダウン：スケーリングアクション後の待機時間
  default_cooldown = 300

  # インスタンスウォームアップ：新しいインスタンスが完全に準備完了するまでの時間
  default_instance_warmup = 120

  health_check_type         = "ELB"
  health_check_grace_period = 300

  tag {
    key                 = "Name"
    value               = "web-server"
    propagate_at_launch = true
  }
}

クールダウンとウォームアップの違い：

Cooldown：スケーリングアクション後、次のスケーリングまで待機する時間（過度なスケーリング防止）
Warm-up：新しく起動したインスタンスがトラフィックを受ける準備ができるまでの時間（ASGメトリクスから除外）

Mixed Instances Policy（オンデマンド + スポット混合）

コストを劇的に削減しながら安定性（あんていせい）を維持する核心戦略です：

resource "aws_autoscaling_group" "web_mixed" {
  name                = "web-mixed-asg"
  desired_capacity    = 6
  max_size            = 30
  min_size            = 2
  vpc_zone_identifier = [
    aws_subnet.private_a.id,
    aws_subnet.private_b.id,
    aws_subnet.private_c.id
  ]

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web_server.id
        version            = "$Latest"
      }

      override {
        instance_type = "t3.medium"
      }
      override {
        instance_type = "t3a.medium"
      }
      override {
        instance_type = "m5.large"
      }
      override {
        instance_type = "m5a.large"
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
      spot_max_price                           = ""  # オンデマンド価格まで許容
    }
  }
}

この設定の意味（いみ）：

基本2台：常にオンデマンドで維持（安定性保証）
追加インスタンスの80%：スポットインスタンス（コスト削減）
追加インスタンスの20%：オンデマンド（安定性補完）
複数インスタンスタイプ：特定タイプのスポット容量不足時に代替（だいたい）タイプを自動選択

Lifecycle Hooks（ライフサイクルフック）

インスタンスの起動（きどう）または終了（しゅうりょう）時にカスタム作業を実行できます：

resource "aws_autoscaling_lifecycle_hook" "launch_hook" {
  name                   = "launch-setup-hook"
  autoscaling_group_name = aws_autoscaling_group.web.name
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_LAUNCHING"
  heartbeat_timeout      = 600
  default_result         = "CONTINUE"

  notification_target_arn = aws_sns_topic.asg_notifications.arn
  role_arn                = aws_iam_role.asg_hook_role.arn
}

resource "aws_autoscaling_lifecycle_hook" "terminate_hook" {
  name                   = "terminate-cleanup-hook"
  autoscaling_group_name = aws_autoscaling_group.web.name
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  heartbeat_timeout      = 300
  default_result         = "CONTINUE"

  notification_target_arn = aws_sns_topic.asg_notifications.arn
  role_arn                = aws_iam_role.asg_hook_role.arn
}

Lifecycle Hookの活用事例：

Launch Hook：インスタンス起動時に構成管理ツール（Ansible、Chef）で初期構成完了を確認
Terminate Hook：インスタンス終了前のログバックアップ、コネクションドレイニング、サービスディスカバリ登録解除

3. Spotインスタンスマスタークラス

Spotインスタンスとは？

SpotインスタンスはAWSの未使用（みしよう）EC2容量をオンデマンド価格比最大90%割引（わりびき）で提供するインスタンスです。

価格比較（m5.xlarge、us-east-1）：

購入オプション	時間単価	月額コスト（730時間）	割引率
オンデマンド	$0.192	$140.16	-
Reserved（1年、全額前払い）	$0.120	$87.60	37%
Savings Plan（1年）	$0.125	$91.25	35%
Spot（平均）	$0.058	$42.34	70%
Spot（最安）	$0.019	$13.87	90%

Spot価格（かかく）ヒストリー分析

AWS CLIでスポット価格ヒストリーを照会（しょうかい）できます：

aws ec2 describe-spot-price-history \
  --instance-types m5.xlarge m5a.xlarge m5d.xlarge \
  --product-descriptions "Linux/UNIX" \
  --start-time "2026-03-16T00:00:00" \
  --end-time "2026-03-23T00:00:00" \
  --query 'SpotPriceHistory[*].[InstanceType,AvailabilityZone,SpotPrice,Timestamp]' \
  --output table

スポット価格の安定性を高める戦略：

複数のインスタンスタイプを指定（m5.xlarge、m5a.xlarge、m5d.xlarge、m5n.xlarge）
複数のアベイラビリティゾーン（AZ）を活用
capacity-optimizedアロケーション戦略を使用（最も余剰容量が多いプールから割り当て）

Spot Interruptionハンドリング

スポットインスタンスはAWSが容量を必要とする場合（ばあい）、2分前の警告（けいこく）とともに回収（かいしゅう）される可能性があります。

メタデータポーリングによるインタラプション検出（けんしゅつ）

#!/bin/bash
# spot-interruption-handler.sh

METADATA_TOKEN=$(curl -s -X PUT \
  "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

while true; do
  INTERRUPTION=$(curl -s -H "X-aws-ec2-metadata-token: $METADATA_TOKEN" \
    http://169.254.169.254/latest/meta-data/spot/instance-action 2>/dev/null)

  if [ "$INTERRUPTION" != "" ] && echo "$INTERRUPTION" | grep -q "action"; then
    echo "Spot interruption detected! Starting graceful shutdown..."

    # 1. ALBからインスタンスを除外（新しいリクエストの受信停止）
    INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $METADATA_TOKEN" \
      http://169.254.169.254/latest/meta-data/instance-id)

    aws elbv2 deregister-targets \
      --target-group-arn "$TARGET_GROUP_ARN" \
      --targets "Id=$INSTANCE_ID"

    # 2. 処理中のリクエスト完了待ち（Connection Draining）
    sleep 30

    # 3. ログバックアップ
    aws s3 sync /var/log/app/ "s3://my-logs-bucket/spot-terminated/$INSTANCE_ID/"

    # 4. アプリケーションの正常終了
    systemctl stop my-app

    echo "Graceful shutdown completed."
    break
  fi

  sleep 5
done

LambdaベースのInterruption Handler

EventBridgeルールを通じてスポットインタラプションイベントをLambdaで処理します：

import json
import boto3

ec2 = boto3.client('ec2')
elbv2 = boto3.client('elbv2')
sns = boto3.client('sns')
asg = boto3.client('autoscaling')

def lambda_handler(event, context):
    """
    EventBridgeからSpot Interruption Warningイベントを受信して処理
    """
    detail = event.get('detail', {})
    instance_id = detail.get('instance-id')
    action = detail.get('instance-action')

    print(f"Spot interruption: instance={instance_id}, action={action}")

    # 1. ASGでインスタンスを異常としてマーク（代替インスタンスの即時起動）
    try:
        asg.set_instance_health(
            InstanceId=instance_id,
            HealthStatus='Unhealthy',
            ShouldRespectGracePeriod=False
        )
        print(f"Marked {instance_id} as unhealthy in ASG")
    except Exception as e:
        print(f"ASG health update failed: {e}")

    # 2. SNS通知発送
    sns.publish(
        TopicArn='arn:aws:sns:ap-northeast-1:123456789012:spot-alerts',
        Subject=f'Spot Interruption: {instance_id}',
        Message=json.dumps({
            'instance_id': instance_id,
            'action': action,
            'region': event.get('region'),
            'time': event.get('time')
        }, indent=2)
    )

    return {
        'statusCode': 200,
        'body': f'Handled interruption for {instance_id}'
    }

EventBridgeルール設定（Terraform）：

resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "spot-interruption-rule"
  description = "Capture EC2 Spot Instance Interruption Warning"

  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Spot Instance Interruption Warning"]
  })
}

resource "aws_cloudwatch_event_target" "spot_handler_lambda" {
  rule      = aws_cloudwatch_event_rule.spot_interruption.name
  target_id = "spot-interruption-handler"
  arn       = aws_lambda_function.spot_handler.arn
}

Spot Fleet：複数インスタンスタイプ + 複数AZ

resource "aws_spot_fleet_request" "batch_processing" {
  iam_fleet_role                      = aws_iam_role.spot_fleet_role.arn
  target_capacity                     = 10
  terminate_instances_with_expiration = true
  allocation_strategy                 = "capacityOptimized"
  fleet_type                          = "maintain"

  launch_template_config {
    launch_template_specification {
      id      = aws_launch_template.batch.id
      version = "$Latest"
    }

    overrides {
      instance_type     = "c5.xlarge"
      availability_zone = "ap-northeast-1a"
    }
    overrides {
      instance_type     = "c5a.xlarge"
      availability_zone = "ap-northeast-1a"
    }
    overrides {
      instance_type     = "c5.xlarge"
      availability_zone = "ap-northeast-1c"
    }
    overrides {
      instance_type     = "c5a.xlarge"
      availability_zone = "ap-northeast-1c"
    }
    overrides {
      instance_type     = "c6i.xlarge"
      availability_zone = "ap-northeast-1a"
    }
    overrides {
      instance_type     = "c6i.xlarge"
      availability_zone = "ap-northeast-1c"
    }
  }
}

Spot適合（てきごう）/不適合ワークロード

適合するワークロード：

CI/CDパイプライン（ビルド、テストランナー）
バッチデータ処理（ETL、ログ分析）
ML（機械学習）モデルトレーニング（チェックポイント対応）
負荷テスト / パフォーマンステスト
ビッグデータ処理（EMR、Spark）
画像（がぞう）/動画（どうが）エンコーディング
Webサーバー（ASG Mixed Instances Policy使用時）

不適合なワークロード：

単一インスタンスデータベース（RDS Multi-AZを使用）
リアルタイム決済（けっさい）システム（中断不可）
長時間の状態保持が必要なワークロード
SLA 99.99%以上が要求される重要サービス（オンデマンドまたはReservedを使用）

4. サーバーレス動的インフラ

Lambda：イベント駆動型自動スケーリング

AWS Lambdaは動的インフラの究極形（きゅうきょくけい）です。リクエストがなければコスト0円、リクエストが来ればミリ秒単位（たんい）でリソース割り当て。

import json
import time

def lambda_handler(event, context):
    """
    API Gatewayから呼び出されるLambda関数
    同時実行：0から数千まで自動スケーリング
    """
    start = time.time()

    # ビジネスロジック
    body = event.get('body', '{}')
    data = json.loads(body) if body else {}

    result = process_request(data)

    duration = (time.time() - start) * 1000
    print(f"Processing took {duration:.2f}ms")

    return {
        'statusCode': 200,
        'headers': {
            'Content-Type': 'application/json',
            'X-Processing-Time': f'{duration:.2f}ms'
        },
        'body': json.dumps(result)
    }

def process_request(data):
    # 実際のビジネスロジック
    return {'status': 'success', 'data': data}

Lambda同時実行（どうじじっこう）管理

# 予約同時実行：この関数が使用する最大同時実行数を確保
resource "aws_lambda_function_event_invoke_config" "example" {
  function_name = aws_lambda_function.api.function_name

  maximum_event_age_in_seconds = 60
  maximum_retry_attempts       = 0
}

# プロビジョニング同時実行：インスタンスを事前にウォーム状態で維持（Cold Start防止）
resource "aws_lambda_provisioned_concurrency_config" "api" {
  function_name                  = aws_lambda_function.api.function_name
  provisioned_concurrent_executions = 50
  qualifier                      = aws_lambda_alias.live.name
}

同時実行タイプの比較：

Reserved Concurrency：他の関数がこの容量を使えないよう確保。追加コストなし
Provisioned Concurrency：実行環境を事前にウォーム状態で維持。Cold Startを排除。追加コスト発生

Cold Start最小化（さいしょうか）戦略

プロビジョニング同時実行を使用（最も確実な方法）
パッケージサイズの最小化（依存関係の最適化、Layerの活用）
ランタイム選択：Python/Node.jsはJavaよりCold Startが高速
初期化コードの最適化：ハンドラー外でDB接続などを初期化
SnapStartの活用（Javaランタイム限定、Cold Start 90%削減）

Fargate：サーバーレスコンテナ

resource "aws_ecs_service" "api" {
  name            = "api-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = [aws_subnet.private_a.id, aws_subnet.private_c.id]
    security_groups  = [aws_security_group.ecs.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8080
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1
    base              = 2  # 最低2つはFargateオンデマンド
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 4  # 追加分の80%はFargate Spot
  }
}

# Fargate Auto Scaling
resource "aws_appautoscaling_target" "ecs" {
  max_capacity       = 20
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_cpu" {
  name               = "ecs-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

Lambda vs Fargate vs EC2 比較表

項目	Lambda	Fargate	EC2（ASG）
スケーリング速度	ミリ秒	30秒-2分	2-5分
最大実行時間	15分	無制限	無制限
メモリ	128MB-10GB	512MB-120GB	インスタンスタイプ次第
vCPU	最大6	最大16	インスタンスタイプ次第
コストモデル	リクエスト数 + 実行時間	vCPU + メモリ時間	インスタンス時間
Cold Start	あり	あり（やや長い）	なし（稼働中）
管理負担	最小	中程度	高い
コンテナ対応	イメージデプロイ可能	ネイティブ	Docker直接管理
Spot対応	該当なし	Fargate Spot（70%）	Spot Instance（90%）
適合ワークロード	イベント処理、API	マイクロサービス、Webアプリ	高性能、ステートフル

5. Terraformで動的インフラ構築

完全なASG + ALBインフラ例

# provider.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "web-app/terraform.tfstate"
    region         = "ap-northeast-1"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
      Project     = var.project_name
    }
  }
}

# variables.tf
variable "aws_region" {
  default = "ap-northeast-1"
}

variable "environment" {
  default = "production"
}

variable "project_name" {
  default = "web-app"
}

variable "vpc_cidr" {
  default = "10.0.0.0/16"
}

# asg.tf - 完全なMixed Instances ASG
resource "aws_autoscaling_group" "web" {
  name                = "${var.project_name}-asg"
  desired_capacity    = 4
  max_size            = 30
  min_size            = 2
  vpc_zone_identifier = [aws_subnet.private_a.id, aws_subnet.private_c.id]

  target_group_arns         = [aws_lb_target_group.web.arn]
  health_check_type         = "ELB"
  health_check_grace_period = 300
  default_instance_warmup   = 120

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web_server.id
        version            = "$Latest"
      }

      override {
        instance_type     = "t3.medium"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "t3a.medium"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "m5.large"
        weighted_capacity = "2"
      }
      override {
        instance_type     = "m5a.large"
        weighted_capacity = "2"
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
    }
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 80
      instance_warmup        = 120
    }
  }

  tag {
    key                 = "Name"
    value               = "${var.project_name}-web"
    propagate_at_launch = true
  }
}

# Target Trackingスケーリング
resource "aws_autoscaling_policy" "cpu_target" {
  name                   = "cpu-target-tracking"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

# Predictive Scaling追加
resource "aws_autoscaling_policy" "predictive" {
  name                   = "predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70

      predefined_scaling_metric_specification {
        predefined_metric_type = "ASGAverageCPUUtilization"
        resource_label         = ""
      }

      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
        resource_label         = ""
      }
    }

    mode                   = "ForecastAndScale"
    scheduling_buffer_time = 300
  }
}

Terraformモジュールで再利用可能なインフラ

# modules/asg/main.tf
module "web_asg" {
  source = "./modules/asg"

  project_name       = "my-web-app"
  environment        = "production"
  vpc_id             = module.vpc.vpc_id
  private_subnet_ids = module.vpc.private_subnet_ids
  alb_target_group   = module.alb.target_group_arn

  instance_types = ["t3.medium", "t3a.medium", "m5.large"]
  min_size       = 2
  max_size        = 30
  desired_capacity = 4

  spot_percentage    = 80
  on_demand_base     = 2
  target_cpu         = 70

  tags = local.common_tags
}

State管理（S3 + DynamoDB）

# state-backend/main.tf（まずローカルでapply）
resource "aws_s3_bucket" "terraform_state" {
  bucket = "my-company-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Terraformワークフロー

# 1. 初期化
terraform init

# 2. コード検証
terraform validate

# 3. 変更計画の確認
terraform plan -out=tfplan

# 4. 変更適用
terraform apply tfplan

# 5. 状態確認
terraform state list
terraform state show aws_autoscaling_group.web

# 6. インフラ削除（開発環境）
terraform destroy

6. AWS CDKで動的インフラ構築

CDK TypeScript例：ASG + ALB + RDS

import * as cdk from 'aws-cdk-lib'
import * as ec2 from 'aws-cdk-lib/aws-ec2'
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2'
import * as autoscaling from 'aws-cdk-lib/aws-autoscaling'
import * as rds from 'aws-cdk-lib/aws-rds'
import { Construct } from 'constructs'

export class WebAppStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props)

    // VPC
    const vpc = new ec2.Vpc(this, 'WebVpc', {
      maxAzs: 3,
      natGateways: 2,
      subnetConfiguration: [
        {
          cidrMask: 24,
          name: 'Public',
          subnetType: ec2.SubnetType.PUBLIC,
        },
        {
          cidrMask: 24,
          name: 'Private',
          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
        },
        {
          cidrMask: 24,
          name: 'Isolated',
          subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
        },
      ],
    })

    // ALB
    const alb = new elbv2.ApplicationLoadBalancer(this, 'WebAlb', {
      vpc,
      internetFacing: true,
      vpcSubnets: { subnetType: ec2.SubnetType.PUBLIC },
    })

    // Mixed Instances付きASG
    const asg = new autoscaling.AutoScalingGroup(this, 'WebAsg', {
      vpc,
      vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
      mixedInstancesPolicy: {
        instancesDistribution: {
          onDemandBaseCapacity: 2,
          onDemandPercentageAboveBaseCapacity: 20,
          spotAllocationStrategy: autoscaling.SpotAllocationStrategy.CAPACITY_OPTIMIZED,
        },
        launchTemplate: new ec2.LaunchTemplate(this, 'LaunchTemplate', {
          instanceType: ec2.InstanceType.of(ec2.InstanceClass.T3, ec2.InstanceSize.MEDIUM),
          machineImage: ec2.MachineImage.latestAmazonLinux2023(),
        }),
        launchTemplateOverrides: [
          { instanceType: new ec2.InstanceType('t3.medium') },
          { instanceType: new ec2.InstanceType('t3a.medium') },
          { instanceType: new ec2.InstanceType('m5.large') },
          { instanceType: new ec2.InstanceType('m5a.large') },
        ],
      },
      minCapacity: 2,
      maxCapacity: 30,
      healthCheck: autoscaling.HealthCheck.elb({
        grace: cdk.Duration.seconds(300),
      }),
    })

    // Target Trackingスケーリング
    asg.scaleOnCpuUtilization('CpuScaling', {
      targetUtilizationPercent: 70,
      cooldown: cdk.Duration.seconds(300),
    })

    // ALBリスナー
    const listener = alb.addListener('HttpsListener', {
      port: 443,
      certificates: [
        elbv2.ListenerCertificate.fromArn(
          'arn:aws:acm:ap-northeast-1:123456789012:certificate/abc-123'
        ),
      ],
    })

    listener.addTargets('WebTarget', {
      port: 80,
      targets: [asg],
      healthCheck: {
        path: '/health',
        interval: cdk.Duration.seconds(15),
        healthyThresholdCount: 2,
        unhealthyThresholdCount: 3,
      },
      deregistrationDelay: cdk.Duration.seconds(30),
    })

    // RDS Multi-AZ
    const database = new rds.DatabaseCluster(this, 'Database', {
      engine: rds.DatabaseClusterEngine.auroraPostgres({
        version: rds.AuroraPostgresEngineVersion.VER_15_4,
      }),
      writer: rds.ClusterInstance.provisioned('Writer', {
        instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
      }),
      readers: [
        rds.ClusterInstance.provisioned('Reader1', {
          instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
        }),
      ],
      vpc,
      vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_ISOLATED },
      storageEncrypted: true,
      deletionProtection: true,
    })

    // ASGからRDSへのアクセスを許可
    database.connections.allowDefaultPortFrom(asg)

    // 出力
    new cdk.CfnOutput(this, 'AlbDnsName', {
      value: alb.loadBalancerDnsName,
      description: 'ALB DNS Name',
    })
  }
}

CDK Constructs：L1 vs L2 vs L3

レベル	名前	説明	例
L1	Cfnリソース	CloudFormationリソースの1:1マッピング	`CfnInstance`, `CfnVPC`
L2	キュレーテッド	合理的デフォルト + ヘルパーメソッド	`ec2.Vpc`, `lambda.Function`
L3	パターン	複数リソースを組み合わせたアーキテクチャパターン	`ecs_patterns.ApplicationLoadBalancedFargateService`

L3パターンを活用すれば、複雑なインフラを数行で定義できます：

import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns'

// L3パターン：ALB + Fargateサービスを一度に
const fargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
  this,
  'FargateService',
  {
    cluster,
    desiredCount: 2,
    taskImageOptions: {
      image: ecs.ContainerImage.fromRegistry('my-app:latest'),
      containerPort: 8080,
    },
    publicLoadBalancer: true,
    capacityProviderStrategies: [
      { capacityProvider: 'FARGATE', weight: 1, base: 2 },
      { capacityProvider: 'FARGATE_SPOT', weight: 4 },
    ],
  }
)

fargateService.targetGroup.configureHealthCheck({
  path: '/health',
})

CDK vs Terraform vs CloudFormation比較

項目	CDK	Terraform	CloudFormation
言語	TypeScript, Python, Java, Go	HCL	YAML/JSON
学習曲線	中程度（プログラミング言語活用）	中程度（HCL習得）	低い（宣言的YAML）
抽象化レベル	高い（L3パターン）	中程度（モジュール）	低い（リソース単位）
マルチクラウド	AWS専用	マルチクラウド	AWS専用
State管理	CloudFormationスタック	S3 + DynamoDB	自動管理
テスト	ユニットテスト可能	Terratest	限定的
ドリフト検出	CloudFormation経由で対応	terraform plan	対応
エコシステム	Construct Hub	Terraform Registry	限定的

CDK Pipeline：CI/CDによるインフラデプロイ

import { CodePipeline, CodePipelineSource, ShellStep } from 'aws-cdk-lib/pipelines'

const pipeline = new CodePipeline(this, 'Pipeline', {
  pipelineName: 'WebAppPipeline',
  synth: new ShellStep('Synth', {
    input: CodePipelineSource.gitHub('my-org/my-repo', 'main'),
    commands: ['npm ci', 'npm run build', 'npx cdk synth'],
  }),
})

// ステージング環境デプロイ
pipeline.addStage(
  new WebAppStage(this, 'Staging', {
    env: { account: '123456789012', region: 'ap-northeast-1' },
  })
)

// プロダクション環境デプロイ（手動承認付き）
pipeline.addStage(
  new WebAppStage(this, 'Production', {
    env: { account: '987654321098', region: 'ap-northeast-1' },
  }),
  {
    pre: [new pipelines.ManualApprovalStep('PromoteToProduction')],
  }
)

7. コスト最適化の実践戦略

Reserved Instances vs Savings Plans vs Spot比較

項目	Reserved Instances	Savings Plans	Spot Instances
割引率	最大72%	最大72%	最大90%
コミットメント期間	1年または3年	1年または3年	なし
柔軟性	インスタンスタイプ/リージョン固定	コンピューティングタイプ柔軟	中断（ちゅうだん）の可能性あり
前払いオプション	全額/一部/なし	全額/一部/なし	なし
適合対象	予測可能なベースワークロード	多様なコンピューティング利用	中断可能なワークロード
Lambda適用	不可	Compute SP適用可能	該当なし
Fargate適用	不可	Compute SP適用可能	Fargate Spot

Schedule-basedスケーリング

開発/ステージング環境やオフピーク時間にトラフィックが少ないサービスに適用します：

# 業務時間（月-金 09:00-18:00）：インスタンス4台
resource "aws_autoscaling_schedule" "scale_up_business_hours" {
  scheduled_action_name  = "scale-up-business"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 4
  max_size               = 30
  desired_capacity       = 4
  recurrence             = "0 0 * * 1-5"  # UTC基準（JST 09:00）
  time_zone              = "Asia/Tokyo"
}

# 夜間（月-金 18:00以降）：インスタンス2台
resource "aws_autoscaling_schedule" "scale_down_night" {
  scheduled_action_name  = "scale-down-night"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 2
  max_size               = 10
  desired_capacity       = 2
  recurrence             = "0 9 * * 1-5"  # UTC基準（JST 18:00）
  time_zone              = "Asia/Tokyo"
}

# 週末：インスタンス1台
resource "aws_autoscaling_schedule" "scale_down_weekend" {
  scheduled_action_name  = "scale-down-weekend"
  autoscaling_group_name = aws_autoscaling_group.web.name
  min_size               = 1
  max_size               = 5
  desired_capacity       = 1
  recurrence             = "0 0 * * 6"  # 土曜日 00:00 JST
  time_zone              = "Asia/Tokyo"
}

実践事例：月額1万ドルから2,500ドルへ削減

Before（静的インフラ）：

EC2 m5.xlarge x 10台（オンデマンド、24/7） = $1,401/月
EC2 m5.2xlarge x 5台（バッチサーバー、24/7） = $1,401/月
RDS db.r5.xlarge Multi-AZ = $1,020/月
NAT Gateway x 2 = $130/月
ALB = $50/月
その他（EBS、S3、CloudWatch） = $500/月
合計月額コスト：約 $4,502

After（動的インフラ）：

変更内容	Before	After	削減率
Webサーバー10台 → ASG（2 OD + Spot）	$1,401	$420	70%
バッチサーバー → Spot Fleet（必要時のみ）	$1,401	$140	90%
RDS → Savings Plan適用	$1,020	$663	35%
NAT Gateway最適化	$130	$65	50%
Schedule Scaling（夜間/週末縮小）	-	追加 -30%	-
合計	$4,502	約 $1,288	71%

8. 高可用性アーキテクチャ

Multi-AZデプロイアーキテクチャ

                    Route 53（DNS Failover）
                           |
                    CloudFront（CDN）
                           |
                    ALB（Multi-AZ）
                    /              \
            AZ-a（ap-northeast-1a）  AZ-c（ap-northeast-1c）
            +------------------+    +------------------+
            | EC2（ASG）        |    | EC2（ASG）        |
            | - Web Server x2  |    | - Web Server x2  |
            |                  |    |                  |
            | RDS（Primary）    |    | RDS（Standby）    |
            | ElastiCache      |    | ElastiCache      |
            +------------------+    +------------------+

Route 53 Health Check + Failover

resource "aws_route53_health_check" "primary" {
  fqdn              = "app.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 10

  regions = ["us-east-1", "eu-west-1", "ap-southeast-1"]

  tags = {
    Name = "primary-health-check"
  }
}

resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.web.dns_name
    zone_id                = aws_lb.web.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
}

カオスエンジニアリング：AWS Fault Injection Simulator

resource "aws_fis_experiment_template" "spot_interruption" {
  description = "Simulate Spot Instance interruptions"
  role_arn    = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.error_rate.arn
  }

  action {
    name      = "interrupt-spot-instances"
    action_id = "aws:ec2:send-spot-instance-interruptions"

    parameter {
      key   = "durationBeforeInterruption"
      value = "PT2M"
    }

    target {
      key   = "SpotInstances"
      value = "spot-instances-target"
    }
  }

  target {
    name           = "spot-instances-target"
    resource_type  = "aws:ec2:spot-instance"
    selection_mode = "COUNT(2)"

    resource_tag {
      key   = "Environment"
      value = "staging"
    }
  }
}

9. 必要な時にEC2を作成する3つのパターン

9-1. EventBridge + LambdaでEC2作成（イベント駆動）

S3にファイルがアップロードされたらEC2を起動して処理し、自動終了するパターンです：

import boto3
import json
import time

ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    """
    S3アップロードイベント -> EC2インスタンス作成 -> 処理 -> 自動終了
    """
    # S3イベントからファイル情報を抽出
    bucket = event['detail']['bucket']['name']
    key = event['detail']['object']['key']

    print(f"Processing file: s3://{bucket}/{key}")

    # EC2インスタンス起動
    user_data = f"""#!/bin/bash
set -e

# 作業実行
aws s3 cp s3://{bucket}/{key} /tmp/input
python3 /opt/process.py /tmp/input /tmp/output
aws s3 cp /tmp/output s3://{bucket}-processed/{key}

# 作業完了後に自己終了
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
"""

    response = ec2.run_instances(
        ImageId='ami-0abcdef1234567890',
        InstanceType='c5.xlarge',
        MinCount=1,
        MaxCount=1,
        IamInstanceProfile={
            'Name': 'ec2-processing-role'
        },
        UserData=user_data,
        TagSpecifications=[
            {
                'ResourceType': 'instance',
                'Tags': [
                    {'Key': 'Name', 'Value': f'processor-{key[:20]}'},
                    {'Key': 'Purpose', 'Value': 'batch-processing'},
                    {'Key': 'AutoTerminate', 'Value': 'true'}
                ]
            }
        ],
        InstanceMarketOptions={
            'MarketType': 'spot',
            'SpotOptions': {
                'SpotInstanceType': 'one-time',
                'InstanceInterruptionBehavior': 'terminate'
            }
        }
    )

    instance_id = response['Instances'][0]['InstanceId']
    print(f"Started processing instance: {instance_id}")

    return {
        'statusCode': 200,
        'body': json.dumps({
            'instance_id': instance_id,
            'file': f's3://{bucket}/{key}'
        })
    }

9-2. Step Functionsオーケストレーション

複雑なワークフローをStep Functionsで管理します：

{
  "Comment": "EC2ベースのバッチ処理ワークフロー",
  "StartAt": "CreateInstance",
  "States": {
    "CreateInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ec2:runInstances",
      "Parameters": {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "c5.2xlarge",
        "MinCount": 1,
        "MaxCount": 1,
        "IamInstanceProfile": {
          "Name": "batch-processing-role"
        },
        "TagSpecifications": [
          {
            "ResourceType": "instance",
            "Tags": [
              {
                "Key": "Purpose",
                "Value": "step-function-batch"
              }
            ]
          }
        ]
      },
      "ResultPath": "$.instanceInfo",
      "Next": "WaitForInstance"
    },
    "WaitForInstance": {
      "Type": "Wait",
      "Seconds": 60,
      "Next": "CheckInstanceStatus"
    },
    "CheckInstanceStatus": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeInstanceStatus",
      "Parameters": {
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)"
      },
      "ResultPath": "$.status",
      "Next": "IsInstanceReady"
    },
    "IsInstanceReady": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status.InstanceStatuses[0].InstanceState.Name",
          "StringEquals": "running",
          "Next": "RunProcessing"
        }
      ],
      "Default": "WaitForInstance"
    },
    "RunProcessing": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ssm:sendCommand.sync",
      "Parameters": {
        "DocumentName": "AWS-RunShellScript",
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)",
        "Parameters": {
          "commands": ["cd /opt/app && python3 process.py"]
        }
      },
      "ResultPath": "$.processingResult",
      "Next": "TerminateInstance",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "TerminateInstance",
          "ResultPath": "$.error"
        }
      ]
    },
    "TerminateInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:terminateInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.instanceInfo.Instances[0].InstanceId)"
      },
      "End": true
    }
  }
}

9-3. Kubernetes Jobs + Karpenter

KarpenterはAWSに最適化されたKubernetesノードオートスケーラーです：

# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: batch-processing
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - c5.xlarge
            - c5a.xlarge
            - c5.2xlarge
            - c6i.xlarge
            - c6i.2xlarge
            - m5.xlarge
            - m5a.xlarge
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - ap-northeast-1a
            - ap-northeast-1c
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: '100'
    memory: 400Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

# batch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing-job
spec:
  parallelism: 10
  completions: 100
  backoffLimit: 3
  template:
    metadata:
      labels:
        app: data-processor
    spec:
      containers:
        - name: processor
          image: my-registry/data-processor:v1.2
          resources:
            requests:
              cpu: '2'
              memory: '4Gi'
            limits:
              cpu: '4'
              memory: '8Gi'
          env:
            - name: BATCH_SIZE
              value: '1000'
      restartPolicy: OnFailure
      nodeSelector:
        karpenter.sh/capacity-type: spot
      tolerations:
        - key: 'karpenter.sh/disruption'
          operator: 'Exists'

Karpenter vs Cluster Autoscaler比較：

項目	Karpenter	Cluster Autoscaler
ノードプロビジョニング速度	数秒（EC2 API直接呼び出し）	数分（ASG経由）
インスタンスタイプ選択	ワークロードベースで自動選択	ASGに定義されたタイプのみ
ビンパッキング	自動最適化	限定的
Spot統合	ネイティブサポート	ASG Mixed Instances
スケールダウン	即座（未使用ノードを30秒後に削除）	デフォルト10分待機
AWS依存	AWS専用	マルチクラウド

10. モニタリングとアラーム

CloudWatch Alarms + SNS

# ASG関連アラーム
resource "aws_cloudwatch_metric_alarm" "asg_high_cpu" {
  alarm_name          = "asg-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 60
  statistic           = "Average"
  threshold           = 85
  alarm_description   = "ASG CPU使用率が85%を超過しました"

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Spotインタラプション回数アラーム
resource "aws_cloudwatch_metric_alarm" "spot_interruptions" {
  alarm_name          = "spot-interruption-count"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "SpotInterruptionCount"
  namespace           = "Custom/SpotMetrics"
  period              = 300
  statistic           = "Sum"
  threshold           = 3
  alarm_description   = "5分以内にSpotインタラプションが3回以上発生しました"

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}

# ALB 5xxエラー率アラーム
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
  alarm_name          = "alb-5xx-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5

  metric_query {
    id          = "error_rate"
    expression  = "(errors / requests) * 100"
    label       = "5xx Error Rate"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "HTTPCode_Target_5XX_Count"
      namespace   = "AWS/ApplicationELB"
      period      = 60
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.web.arn_suffix
      }
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 60
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.web.arn_suffix
      }
    }
  }

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}

Grafana + Prometheus on EKS

EKS環境でPrometheusとGrafanaを使用したモニタリングスタック：

# prometheus-values.yaml（Helm）
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ['ReadWriteOnce']
          resources:
            requests:
              storage: 50Gi

    additionalScrapeConfigs:
      - job_name: karpenter
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - karpenter
        relabel_configs:
          - source_labels: [__meta_kubernetes_endpoint_port_name]
            regex: http-metrics
            action: keep

grafana:
  adminPassword: 'secure-password'
  persistence:
    enabled: true
    size: 10Gi

11. クイズ

Q1: Auto Scaling GroupでTarget Tracking ScalingがSimple Scalingより推奨される主な理由は？

正解：Target Trackingは目標メトリクス値を自動的に維持し、スケールインとスケールアウトをバランスよく調整し、個別のCloudWatchアラーム管理が不要です。

Simple Scalingはクールダウン期間中に追加スケーリングが不可能で、スケーリング量を手動で定義する必要があります。Target TrackingはAWSが自動的に最適なスケーリングを実行するため、運用負担が少なく、より正確なスケーリングが可能です。

Q2: Mixed Instances Policyでon_demand_base_capacityを2、on_demand_percentage_above_base_capacityを20に設定しました。ASGで合計12台のインスタンスが必要な場合、オンデマンドとスポットの比率は？

正解：オンデマンド4台、スポット8台

計算過程：

基本オンデマンド：2台
追加必要数：12 - 2 = 10台
追加分のオンデマンド（20%）：10 x 0.2 = 2台
追加分のスポット（80%）：10 x 0.8 = 8台
合計オンデマンド：2 + 2 = 4台、合計スポット：8台

Q3: Spotインスタンスが中断（interruption）される際、AWSは何分前に警告を送りますか？また、この警告を検知する方法2つは？

正解：2分前に警告を送ります。

警告検知方法：

EC2メタデータポーリング：インスタンス内部からhttp://169.254.169.254/latest/meta-data/spot/instance-actionエンドポイントを定期的にチェック
EventBridgeルール：EC2 Spot Instance Interruption WarningイベントをEventBridgeで受信し、Lambdaなどで処理

Q4: KarpenterがCluster Autoscalerよりノードプロビジョニングが速い根本的な理由は？

正解：KarpenterはEC2 APIを直接呼び出してノードを作成しますが、Cluster AutoscalerはASG（Auto Scaling Group）を経由してノードを作成するためです。

Cluster AutoscalerはPending Podを検知し、ASGのdesired countを変更した後、ASGがLaunch Templateに従ってインスタンスを作成するという間接的な経路を通ります。Karpenterはワークロード要件を分析し、最適なインスタンスタイプを選択してEC2 Fleet APIで直接作成するため、数秒以内にノードが準備されます。

Q5: TerraformのStateをS3 + DynamoDBで管理する場合、DynamoDBの役割は何ですか？

正解：DynamoDBはState Locking（状態ロック）を担当します。

複数のチームメンバーが同時にterraform applyを実行すると、Stateファイルが競合する可能性があります。DynamoDBテーブルにLockレコードを作成して、一度に一人だけがStateを変更できるようにします。これにより競合状態（Race Condition）を防止し、インフラの一貫性を保証します。