Terraform and OpenTofu IaC drift detection and automatic recovery operation guide

Entering
Concept and types of infrastructure drift
- What is drift?
- Three types of drift
Common causes of drift
Native drift detection method
CI/CD pipeline integration
- GitHub Actions 기반 정기 드리프트 감지
- GitLab CI-based drift detection
Automated Remediation Strategy
Tool Comparison: Spacelift vs Terramate vs Native
Precautions during operation
- Side effects of drift detection
- Strategic use of ignore_changes
Failure cases and recovery procedures
Operational Checklist
Conclusion
References

Terraform OpenTofu Drift Detection

Entering

The core promise of Infrastructure as Code (IaC) is that “the state declared in code is the actual state of the infrastructure.” However, in a real production environment, this promise is easily broken. Emergency fixes in the console, intervention from other automation tools, dynamic changes in AWS Auto Scaling, and even differences between versions of API defaults -- this gap between code and actual infrastructure is called Infrastructure Drift.

As of 2026, with the release of Terraform 1.11 and OpenTofu 1.9, drift detection and management capabilities have been greatly enhanced. However, the tool itself plan It is difficult to establish an organization-level drift management system using orders alone. In this article, we analyze the essential causes of drift and cover everything you need for practical operation, including building an automatic detection pipeline, CI/CD integration, recovery strategies, and utilizing professional tools such as Spacelift and Terramate.

Concept and types of infrastructure drift

What is drift?

Infrastructure drift refers to the discrepancy between the Desired State declared in IaC code and the Current State that actually exists in the cloud provider. Terraform's State file stores the last known state of your infrastructure, and if there is a mismatch between these three (code, state, and actual infrastructure), you have drift.

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   HCL 코드       │     │  Terraform State  │     │  실제 인프라      │
│   (Desired)      │     │  (Last Known)     │     │  (Actual)        │
│                  │     │                   │     │                  │
│  instance_type   │     │  instance_type    │     │  instance_type   │
│  = "t3.medium"   │     │  = "t3.medium"    │     │  = "t3.xlarge"   │
│                  │     │                   │     │  ← 콘솔에서 변경  │
└─────────────────┘     └──────────────────┘     └─────────────────┘
        ▲                        ▲                        ▲
        │                        │                        │
        └─── 일치 ──────────────┘─── 불일치(드리프트) ────┘

Three types of drift

1. Configuration Drift

This is a case where the property of the resource is different from the value declared in the code. For example, this applies when a Security Group rule is added in the console or the EC2 instance type is changed.

2. Existence Drift

A resource that is declared in the code but is actually deleted, or a resource that is not in the code but actually exists. This occurs when someone deletes an S3 bucket from the console or creates an RDS instance outside of code.

3. Dependency Drift

This is a case where the reference relationship between resources is broken. For example, a subnet is deleted, but the EC2 instance referencing that subnet is moved to another subnet.

Common causes of drift

Human Factors

The most frequent cause of drift is direct human intervention. A typical example is when the infrastructure is urgently changed through the console or CLI in a failure situation.

# 장애 긴급 대응 -- 흔히 발생하는 드리프트 시나리오
# 1. Security Group에 디버깅용 포트 개방
aws ec2 authorize-security-group-ingress \
  --group-id sg-0abc123def456 \
  --protocol tcp \
  --port 5432 \
  --cidr 0.0.0.0/0  # 위험: PostgreSQL 포트를 전체 공개

# 2. RDS 인스턴스 클래스 긴급 스케일업
aws rds modify-db-instance \
  --db-instance-identifier prod-main-db \
  --db-instance-class db.r6g.2xlarge \
  --apply-immediately

# 3. Auto Scaling Group의 desired count 수동 변경
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name prod-web-asg \
  --desired-capacity 10

These changes may be unavoidable in emergency situations, but if the changes are not reflected in the code, then terraform apply An unintended rollback occurs.

System factors

Auto Scaling: AWS ASG, EKS Node Group, etc. automatically change the number of instances
Change provider defaults: AWS API automatically adds default tags or settings when creating a resource.
Service integration: CloudTrail, Config Rules, etc. automatically modify settings
Terraform Provider Upgrade: Properties added in new version set to default values

Process factors

State file corruption: State file in S3 backend is accidentally overwritten or deleted.
Parallel apply: Multiple pipelines perform apply simultaneously on the same state
partial apply: terraform apply failed prematurely and changed only some resources

Native drift detection method

Basic detection using terraform plan

terraform plan is the most basic drift detection tool. -detailed-exitcode Flags allow you to programmatically determine whether there is drift.

#!/bin/bash
# drift-detect.sh -- Terraform 드리프트 감지 스크립트

set -euo pipefail

WORKSPACE_DIR="${1:-.}"
SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL:-}"
LOG_FILE="/var/log/terraform/drift-$(date +%Y%m%d-%H%M%S).log"

cd "$WORKSPACE_DIR"

echo "[$(date)] 드리프트 감지 시작: $WORKSPACE_DIR" | tee -a "$LOG_FILE"

# terraform init (백엔드 재초기화 방지)
terraform init -input=false -no-color >> "$LOG_FILE" 2>&1

# -detailed-exitcode 종료 코드:
#   0 = 변경 없음 (드리프트 없음)
#   1 = 오류 발생
#   2 = 변경 있음 (드리프트 감지)
terraform plan -detailed-exitcode -input=false -no-color -out=drift.plan >> "$LOG_FILE" 2>&1
EXIT_CODE=$?

case $EXIT_CODE in
  0)
    echo "[$(date)] 드리프트 없음" | tee -a "$LOG_FILE"
    ;;
  1)
    echo "[$(date)] 오류 발생 -- 수동 확인 필요" | tee -a "$LOG_FILE"
    if [ -n "$SLACK_WEBHOOK_URL" ]; then
      curl -s -X POST "$SLACK_WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\": \":rotating_light: Terraform 드리프트 감지 오류\\n워크스페이스: $WORKSPACE_DIR\\n로그: $LOG_FILE\"}"
    fi
    exit 1
    ;;
  2)
    echo "[$(date)] 드리프트 감지됨!" | tee -a "$LOG_FILE"

    # plan 출력에서 변경 요약 추출
    DRIFT_SUMMARY=$(terraform show -no-color drift.plan | grep -E "^  #|Plan:|~ |+ |- " | head -30)

    if [ -n "$SLACK_WEBHOOK_URL" ]; then
      curl -s -X POST "$SLACK_WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\": \":warning: 인프라 드리프트 감지\\n워크스페이스: $WORKSPACE_DIR\\n\`\`\`$DRIFT_SUMMARY\`\`\`\"}"
    fi
    exit 2
    ;;
esac

Improved drift detection in OpenTofu

In OpenTofu 1.8 and higher, tofu plan Additional drift-related options are supported. In particular, more secure drift detection is possible when combined with state encryption.

# backend.tf -- OpenTofu State 암호화 설정
terraform {
  encryption {
    method "aes_gcm" "state_enc" {
      keys = key_provider.aws_kms.state_key
    }

    state {
      method   = method.aes_gcm.state_enc
      enforced = true
    }

    plan {
      method   = method.aes_gcm.state_enc
      enforced = true
    }
  }

  backend "s3" {
    bucket         = "myorg-tofu-state"
    key            = "prod/infrastructure.tfstate"
    region         = "ap-northeast-2"
    dynamodb_table = "tofu-state-lock"
    encrypt        = true
  }
}

# key_provider 설정
key_provider "aws_kms" "state_key" {
  kms_key_id = "alias/tofu-state-encryption"
  region     = "ap-northeast-2"
  key_spec   = "AES_256"
}

terraform plan JSON output parsing

The plan output in JSON format allows detailed programmatic analysis of drift.

#!/usr/bin/env python3
"""
drift_analyzer.py -- Terraform Plan JSON 드리프트 분석기
terraform show -json drift.plan > plan.json 으로 생성한 파일을 분석한다.
"""

import json
import sys
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional


class DriftSeverity(Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"


class DriftAction(Enum):
    CREATE = "create"
    UPDATE = "update"
    DELETE = "delete"
    REPLACE = "replace"
    READ = "read"


@dataclass
class DriftItem:
    address: str
    resource_type: str
    action: DriftAction
    severity: DriftSeverity
    changed_attributes: list = field(default_factory=list)
    description: str = ""


# 심각도 판단 규칙
SEVERITY_RULES = {
    # 리소스 타입별 기본 심각도
    "aws_security_group_rule": DriftSeverity.CRITICAL,
    "aws_iam_role_policy": DriftSeverity.CRITICAL,
    "aws_iam_policy": DriftSeverity.CRITICAL,
    "aws_s3_bucket_policy": DriftSeverity.HIGH,
    "aws_db_instance": DriftSeverity.HIGH,
    "aws_instance": DriftSeverity.MEDIUM,
    "aws_autoscaling_group": DriftSeverity.LOW,
}

# 삭제/교체는 무조건 HIGH 이상
ACTION_SEVERITY_FLOOR = {
    DriftAction.DELETE: DriftSeverity.HIGH,
    DriftAction.REPLACE: DriftSeverity.HIGH,
}


def classify_severity(resource_type: str, action: DriftAction) -> DriftSeverity:
    """리소스 타입과 액션 기반으로 드리프트 심각도를 분류한다."""
    base = SEVERITY_RULES.get(resource_type, DriftSeverity.MEDIUM)
    floor = ACTION_SEVERITY_FLOOR.get(action)
    if floor and floor.value < base.value:
        return floor
    return base


def parse_plan(plan_path: str) -> list[DriftItem]:
    """Terraform Plan JSON을 파싱하여 드리프트 항목 목록을 반환한다."""
    with open(plan_path) as f:
        plan = json.load(f)

    drifts = []
    for change in plan.get("resource_changes", []):
        actions = change.get("change", {}).get("actions", [])

        # no-op은 건너뛴다
        if actions == ["no-op"] or actions == ["read"]:
            continue

        if "delete" in actions and "create" in actions:
            action = DriftAction.REPLACE
        elif "delete" in actions:
            action = DriftAction.DELETE
        elif "create" in actions:
            action = DriftAction.CREATE
        elif "update" in actions:
            action = DriftAction.UPDATE
        else:
            continue

        resource_type = change.get("type", "unknown")
        address = change.get("address", "unknown")

        # 변경된 속성 추출
        before = change.get("change", {}).get("before") or {}
        after = change.get("change", {}).get("after") or {}
        changed_attrs = []
        if isinstance(before, dict) and isinstance(after, dict):
            all_keys = set(list(before.keys()) + list(after.keys()))
            for key in all_keys:
                if before.get(key) != after.get(key):
                    changed_attrs.append(key)

        severity = classify_severity(resource_type, action)

        drifts.append(DriftItem(
            address=address,
            resource_type=resource_type,
            action=action,
            severity=severity,
            changed_attributes=changed_attrs,
        ))

    return drifts


def generate_report(drifts: list[DriftItem]) -> dict:
    """드리프트 분석 보고서를 생성한다."""
    report = {
        "total_drifts": len(drifts),
        "by_severity": {},
        "by_action": {},
        "critical_items": [],
        "details": [],
    }

    for d in drifts:
        report["by_severity"][d.severity.value] = \
            report["by_severity"].get(d.severity.value, 0) + 1
        report["by_action"][d.action.value] = \
            report["by_action"].get(d.action.value, 0) + 1

        if d.severity == DriftSeverity.CRITICAL:
            report["critical_items"].append({
                "address": d.address,
                "action": d.action.value,
                "changed_attributes": d.changed_attributes,
            })

        report["details"].append({
            "address": d.address,
            "type": d.resource_type,
            "action": d.action.value,
            "severity": d.severity.value,
            "changed_attributes": d.changed_attributes,
        })

    return report


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python drift_analyzer.py <plan.json>")
        sys.exit(1)

    drifts = parse_plan(sys.argv[1])
    report = generate_report(drifts)

    print(json.dumps(report, indent=2, ensure_ascii=False))

    # CRITICAL 항목이 있으면 종료 코드 1
    if report["by_severity"].get("critical", 0) > 0:
        sys.exit(1)

CI/CD pipeline integration

GitHub Actions 기반 정기 드리프트 감지

실전에서는 드리프트 감지를 정기적으로 자동 실행해야 한다. 다음은 GitHub Actions를 활용한 드리프트 감지 파이프라인이다.

# .github/workflows/drift-detection.yml
name: Infrastructure Drift Detection

on:
  schedule:
    # 매일 오전 9시, 오후 3시 (KST) 실행
    - cron: '0 0,6 * * *'
  workflow_dispatch:
    inputs:
      workspace:
        description: '감지할 워크스페이스 (비워두면 전체)'
        required: false
        type: string

permissions:
  id-token: write
  contents: read
  issues: write

env:
  TF_VERSION: '1.11.0'
  TOFU_VERSION: '1.9.0'

jobs:
  detect-drift:
    name: Detect Drift - ${{ matrix.workspace }}
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        workspace:
          - environments/prod/networking
          - environments/prod/compute
          - environments/prod/database
          - environments/prod/security
          - environments/staging/networking
          - environments/staging/compute
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Configure AWS Credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DRIFT_DETECTION_ROLE_ARN }}
          aws-region: ap-northeast-2
          role-session-name: drift-detection-${{ github.run_id }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}
          terraform_wrapper: false

      - name: Terraform Init
        working-directory: ${{ matrix.workspace }}
        run: terraform init -input=false -no-color

      - name: Detect Drift
        id: drift
        working-directory: ${{ matrix.workspace }}
        run: |
          set +e
          terraform plan -detailed-exitcode -input=false -no-color \
            -out=drift.plan 2>&1 | tee plan-output.txt
          echo "exit_code=$?" >> "$GITHUB_OUTPUT"

      - name: Analyze Drift
        if: steps.drift.outputs.exit_code == '2'
        working-directory: ${{ matrix.workspace }}
        run: |
          terraform show -json drift.plan > plan.json
          python3 ${{ github.workspace }}/scripts/drift_analyzer.py plan.json \
            > drift-report.json
          echo "## 드리프트 감지: ${{ matrix.workspace }}" >> "$GITHUB_STEP_SUMMARY"
          echo '```json' >> "$GITHUB_STEP_SUMMARY"
          cat drift-report.json >> "$GITHUB_STEP_SUMMARY"
          echo '```' >> "$GITHUB_STEP_SUMMARY"

      - name: Create Issue for Critical Drift
        if: steps.drift.outputs.exit_code == '2'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = JSON.parse(
              fs.readFileSync('${{ matrix.workspace }}/drift-report.json', 'utf8')
            );
            if (report.by_severity?.critical > 0) {
              await github.rest.issues.create({
                owner: context.repo.owner,
                repo: context.repo.repo,
                title: `[CRITICAL DRIFT] ${{ matrix.workspace }}`,
                body: `## 심각한 인프라 드리프트 감지\n\n` +
                      `**워크스페이스**: ${{ matrix.workspace }}\n` +
                      `**감지 시각**: ${new Date().toISOString()}\n\n` +
                      `### Critical 항목\n` +
                      '```json\n' +
                      JSON.stringify(report.critical_items, null, 2) +
                      '\n```\n\n' +
                      `### 요약\n` +
                      `- 총 드리프트: ${report.total_drifts}\n` +
                      `- Critical: ${report.by_severity.critical || 0}\n` +
                      `- High: ${report.by_severity.high || 0}\n`,
                labels: ['drift', 'critical', 'infrastructure'],
              });
            }

      - name: Notify Slack
        if: steps.drift.outputs.exit_code == '2'
        uses: slackapi/slack-github-action@v2
        with:
          webhook: ${{ secrets.SLACK_DRIFT_WEBHOOK }}
          webhook-type: incoming-webhook
          payload: |
            {
              "text": "인프라 드리프트 감지: ${{ matrix.workspace }}",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": ":warning: *드리프트 감지*\n워크스페이스: `${{ matrix.workspace }}`\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|상세 보기>"
                  }
                }
              ]
            }

GitLab CI-based drift detection

Setting examples for organizations using GitLab CI are also provided.

# .gitlab-ci.yml -- 드리프트 감지 파이프라인
stages:
  - drift-detect
  - drift-report

.drift_template: &drift_template
  image: hashicorp/terraform:1.11
  before_script:
    - export AWS_WEB_IDENTITY_TOKEN_FILE=/tmp/web-identity-token
    - echo "${CI_JOB_JWT_V2}" > $AWS_WEB_IDENTITY_TOKEN_FILE
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
    - if: $CI_PIPELINE_SOURCE == "web"

drift:prod-networking:
  <<: *drift_template
  stage: drift-detect
  variables:
    TF_WORKSPACE_DIR: environments/prod/networking
  script:
    - cd $TF_WORKSPACE_DIR
    - terraform init -input=false -no-color
    - |
      set +e
      terraform plan -detailed-exitcode -input=false -no-color 2>&1 | tee plan.txt
      EXIT_CODE=$?
      set -e
      if [ $EXIT_CODE -eq 2 ]; then
        echo "DRIFT_DETECTED=true" >> drift.env
        echo "DRIFT_WORKSPACE=$TF_WORKSPACE_DIR" >> drift.env
      elif [ $EXIT_CODE -eq 1 ]; then
        echo "DRIFT_ERROR=true" >> drift.env
        exit 1
      fi
  artifacts:
    reports:
      dotenv: drift.env
    paths:
      - $TF_WORKSPACE_DIR/plan.txt
    when: always

Automated Remediation Strategy

Spectrum of recovery strategies

There are several ways to detect drift and then recover. An appropriate strategy should be selected based on the organization's maturity and risk tolerance.

strategy	Description	Risk	Automation level	suitable environment
Notifications only	Only send notification when drift is detected	low	low	Early adoption, regulatory environment
Code Sync	Reflect actual status in code (import/refresh)	middle	middle	Environment with many intentional changes
Auto apply	Automatically restore infrastructure to code state	High	High	An environment where your code is the single source of truth
Selective Recovery	Classifies automatic/manual recovery according to severity	middle	Medium to high	Most production environments

Implement optional automatic recovery

The most commonly used pattern in practice is severity-based selective recovery. CRITICAL drift immediately alerts and requires manual review, while LOW severity drift automatically recovers.

#!/bin/bash
# selective-remediation.sh -- 심각도 기반 선택적 드리프트 복구

set -euo pipefail

WORKSPACE_DIR="$1"
DRIFT_REPORT="$2"       # drift_analyzer.py의 출력 JSON
AUTO_APPLY="${3:-false}"  # true로 설정 시 LOW/MEDIUM 자동 복구

cd "$WORKSPACE_DIR"

# 드리프트 보고서에서 심각도별 카운트 추출
CRITICAL_COUNT=$(jq -r '.by_severity.critical // 0' "$DRIFT_REPORT")
HIGH_COUNT=$(jq -r '.by_severity.high // 0' "$DRIFT_REPORT")
MEDIUM_COUNT=$(jq -r '.by_severity.medium // 0' "$DRIFT_REPORT")
LOW_COUNT=$(jq -r '.by_severity.low // 0' "$DRIFT_REPORT")

echo "=== 드리프트 분석 결과 ==="
echo "  CRITICAL: $CRITICAL_COUNT"
echo "  HIGH:     $HIGH_COUNT"
echo "  MEDIUM:   $MEDIUM_COUNT"
echo "  LOW:      $LOW_COUNT"
echo "=========================="

# CRITICAL 또는 HIGH 드리프트가 있으면 자동 복구 중단
if [ "$CRITICAL_COUNT" -gt 0 ] || [ "$HIGH_COUNT" -gt 0 ]; then
  echo "[BLOCKED] CRITICAL/HIGH 드리프트 감지 -- 수동 검토 필요"
  echo "다음 리소스에 대해 수동 검토를 진행하세요:"
  jq -r '.details[] | select(.severity == "critical" or .severity == "high") |
    "  - [\(.severity | ascii_upcase)] \(.address) (\(.action)): \(.changed_attributes | join(", "))"' \
    "$DRIFT_REPORT"
  exit 2
fi

# LOW/MEDIUM만 있고 자동 복구가 활성화된 경우
if [ "$AUTO_APPLY" = "true" ]; then
  echo "[AUTO-REMEDIATE] LOW/MEDIUM 드리프트 자동 복구 시작"

  # 타겟 리소스만 선택적으로 apply
  TARGETS=$(jq -r '.details[] | select(.severity == "low" or .severity == "medium") |
    "-target=\(.address)"' "$DRIFT_REPORT")

  if [ -n "$TARGETS" ]; then
    echo "복구 대상: $TARGETS"
    # shellcheck disable=SC2086
    terraform apply -auto-approve -input=false -no-color $TARGETS
    echo "[DONE] 자동 복구 완료"
  fi
else
  echo "[INFO] 자동 복구 비활성화 -- 알림만 발송"
  exit 0
fi

Reverse synchronization using terraform import

Reverse synchronization, which imports resources created in the console into code, is also an important recovery strategy. Terraform 1.5 or higher import Using blocks, you can perform import declaratively at the code level.

# imports.tf -- 선언적 import 블록 (Terraform 1.5+, OpenTofu 1.6+)

# 콘솔에서 생성된 Security Group을 코드로 가져오기
import {
  to = aws_security_group.emergency_debug_sg
  id = "sg-0abc123def456"
}

# 콘솔에서 생성된 S3 버킷을 코드로 가져오기
import {
  to = aws_s3_bucket.manual_backup
  id = "prod-manual-backup-20260307"
}

# import된 리소스에 대한 코드 선언
resource "aws_security_group" "emergency_debug_sg" {
  name        = "emergency-debug-sg"
  description = "Emergency debugging SG - created during incident INC-2026-0307"
  vpc_id      = module.networking.vpc_id

  # import 후 plan으로 실제 설정과 비교하여 코드를 맞춘다
  ingress {
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]  # 내부 네트워크로 제한
    description = "PostgreSQL access from internal network"
  }

  tags = {
    Name        = "emergency-debug-sg"
    ManagedBy   = "terraform"
    CreatedBy   = "manual-import"
    Incident    = "INC-2026-0307"
    ReviewDate  = "2026-03-14"  # 1주 후 검토
  }

  lifecycle {
    # import 후 첫 apply에서 의도치 않은 삭제 방지
    prevent_destroy = true
  }
}

resource "aws_s3_bucket" "manual_backup" {
  bucket = "prod-manual-backup-20260307"

  tags = {
    Name      = "prod-manual-backup"
    ManagedBy = "terraform"
    CreatedBy = "manual-import"
  }
}

Tool Comparison: Spacelift vs Terramate vs Native

Feature comparison table

Using a dedicated drift management tool can significantly reduce operational burden compared to native methods. Compare the features of major tools.

Features	Terraform Native	OpenTofu Native	Spacelift	Terramate	Env0
Automatic drift detection	plan + cron manual configuration	plan + cron manual configuration	Built-in (schedule-based)	CLI + CI integration	Built-in (schedule-based)
Drift Visualization	plan text output	plan text output	Web UI Dashboard	CLI Report	Web UI Dashboard
Automatic Recovery	Direct implementation of script	Direct implementation of script	Policy-based automatic apply	Orchestration integration	Policy-based automatic apply
Policy Engine	Sentinel (paid)	OPA integration	Built in OPA + Rego	None (external linkage)	OPA integration
State Encryption	Not supported (backend dependent)	Native Support	Managed State	None (backend dependent)	Managed State
Multi-stack Management	Workspaces	Workspaces	Stack unit management	Stack Orchestration	Environment unit
Cost	Free (OSS)	Free (OSS)	Paid (Free tier available)	Free (OSS) + Paid Cloud	Paid (Free tier available)
VCS Integration	None (CI/CD dependent)	None (CI/CD dependent)	GitHub/GitLab deep integration	GitHub/GitLab integration	GitHub/GitLab integration
Notification Channel	Direct implementation	Direct implementation	Slack, Teams, Webhooks	Slack, Webhook	Slack, Teams, Webhooks

Spacelift drift detection settings

Spacelift is one of the most mature Terraform/OpenTofu management platforms. Drift detection can be set on a per-stack basis.

# spacelift.tf -- Spacelift 스택에서 드리프트 감지 활성화

resource "spacelift_stack" "prod_networking" {
  name        = "prod-networking"
  repository  = "infrastructure"
  branch      = "main"
  project_root = "environments/prod/networking"

  # Terraform 또는 OpenTofu 선택
  terraform_version = "1.11.0"
  # opentofu_version = "1.9.0"  # OpenTofu 사용 시

  autodeploy  = false  # PR merge 시 자동 apply 여부
  autoretry   = true   # 일시적 오류 시 재시도

  labels = ["prod", "networking", "drift-enabled"]
}

# 드리프트 감지 스케줄 설정
resource "spacelift_drift_detection" "prod_networking" {
  stack_id  = spacelift_stack.prod_networking.id
  reconcile = false  # true로 설정 시 자동 복구 (주의!)
  schedule  = ["0 0 * * *", "0 6 * * *"]  # UTC 기준 매일 2회
  timezone  = "Asia/Seoul"
  ignore_state = false
}

# 드리프트 감지 시 알림 정책
resource "spacelift_policy" "drift_notification" {
  name = "drift-notification-policy"
  type = "NOTIFICATION"
  body = <<-EOT
    package spacelift

    # 드리프트 감지 시 Slack 알림
    webhook[{"endpoint_id": endpoint_id, "payload": payload}] {
      input.run_type == "DRIFT_DETECTION"
      input.run_state == "FINISHED"
      input.drift_detection.drifted == true

      endpoint_id := "${spacelift_webhook.slack_drift.id}"
      payload := {
        "text": sprintf(":warning: 드리프트 감지: %s\n변경 리소스: %d개",
          [input.stack.name, input.drift_detection.resources_drifted])
      }
    }
  EOT
}

resource "spacelift_policy_attachment" "drift_notification" {
  policy_id = spacelift_policy.drift_notification.id
  stack_id  = spacelift_stack.prod_networking.id
}

Multi-stack drift orchestration with Terramate

Terramate is an open source tool for orchestrating multiple Terraform/OpenTofu stacks. It has strengths in change detection and execution order management.

# terramate.tm.hcl -- Terramate 스택 설정

terramate {
  config {
    run {
      env {
        TF_PLUGIN_CACHE_DIR = "/tmp/terraform-plugin-cache"
      }
    }

    cloud {
      organization = "myorg"
    }
  }
}

# 스택 정의
stack {
  name        = "prod-networking"
  description = "Production VPC and networking resources"
  id          = "prod-networking-001"

  tags = ["prod", "networking", "drift-check"]

  # 의존성 정의 -- networking이 먼저 감지되어야 함
  after = []
  before = [
    "/stacks/prod/compute",
    "/stacks/prod/database",
  ]
}

# Terramate CLI를 활용한 멀티 스택 드리프트 감지

# 모든 prod 스택에서 드리프트 감지 (의존성 순서 준수)
terramate run \
  --filter tags:prod \
  --filter tags:drift-check \
  --parallel 4 \
  -- terraform plan -detailed-exitcode -input=false -no-color

# 변경된 스택만 선택적으로 감지 (Git diff 기반)
terramate run \
  --changed \
  -- terraform plan -detailed-exitcode -input=false

# Terramate Cloud와 연동하여 드리프트 결과 동기화
terramate run \
  --filter tags:prod \
  --sync-drift \
  -- terraform plan -detailed-exitcode -out=drift.plan

Precautions during operation

Side effects of drift detection

Drift detection itself can cause operational problems. In particular, you should pay attention to the following:

1. API Rate Limiting

terraform plan calls the provider API for all managed resources. Performing frequent detections on large infrastructure can result in AWS API Rate Limits.

2. State Lock Conflict

During drift detection, a read lock is applied to the state. at the same time terraform apply A conflict may occur when executed.

3. Sensitive data exposure

Plan output may contain sensitive information such as database passwords and API keys. When saving logs, you must filter them.

# 민감 데이터가 plan 출력에 노출되지 않도록 방지
# Terraform 1.10+ ephemeral variables 활용

variable "database_password" {
  type      = string
  sensitive = true
  ephemeral = true  # State에 저장되지 않음 (Terraform 1.10+)
}

resource "aws_db_instance" "main" {
  identifier     = "prod-main-db"
  engine         = "postgres"
  engine_version = "16.4"
  instance_class = "db.r6g.xlarge"
  password       = var.database_password

  # lifecycle ignore_changes로 외부 변경이 예상되는 속성 제외
  lifecycle {
    ignore_changes = [
      # ASG가 관리하는 속성은 드리프트 감지에서 제외
      # (이를 제외하지 않으면 false positive가 발생)
    ]
  }
}

Strategic use of ignore_changes

If you try to detect every drift, you'll get a lot of false positives. ignore_changes It is important to use strategically to allow for intentional drift.

# Auto Scaling Group -- desired_capacity는 ASG가 관리
resource "aws_autoscaling_group" "web" {
  name                = "prod-web-asg"
  min_size            = 2
  max_size            = 20
  desired_capacity    = 4
  vpc_zone_identifier = module.networking.private_subnet_ids

  launch_template {
    id      = aws_launch_template.web.id
    version = "$Latest"
  }

  lifecycle {
    ignore_changes = [
      desired_capacity,  # ASG 스케일링 정책이 관리
      target_group_arns, # ALB 연동 시 동적으로 변경
    ]
  }
}

# EKS 노드 그룹 -- 클러스터 오토스케일러가 관리하는 속성 제외
resource "aws_eks_node_group" "workers" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "workers"
  node_role_arn   = aws_iam_role.node_group.arn
  subnet_ids      = module.networking.private_subnet_ids

  scaling_config {
    desired_size = 3
    max_size     = 10
    min_size     = 2
  }

  lifecycle {
    ignore_changes = [
      scaling_config[0].desired_size,  # Cluster Autoscaler가 관리
    ]
  }
}

Failure cases and recovery procedures

Case 1: Total Drift Due to State File Corruption

Situation: State file in S3 backend is corrupted and all resources are detected as drifting

# 복구 절차 1: S3 버전닝에서 이전 State 복구
aws s3api list-object-versions \
  --bucket myorg-terraform-state \
  --prefix prod/infrastructure.tfstate \
  --query 'Versions[*].[VersionId,LastModified,Size]' \
  --output table

# 정상적인 이전 버전의 VersionId를 확인한 후 복원
aws s3api get-object \
  --bucket myorg-terraform-state \
  --key prod/infrastructure.tfstate \
  --version-id "abc123def456" \
  restored-state.tfstate

# State 파일 무결성 확인
terraform show restored-state.tfstate | head -20

# 복원된 State로 교체 (기존 State 백업 후)
aws s3 cp \
  s3://myorg-terraform-state/prod/infrastructure.tfstate \
  s3://myorg-terraform-state/prod/infrastructure.tfstate.corrupted-backup

aws s3 cp \
  restored-state.tfstate \
  s3://myorg-terraform-state/prod/infrastructure.tfstate

# State Lock 해제 (필요 시)
terraform force-unlock <LOCK_ID>

# plan으로 드리프트 상태 확인
terraform plan -detailed-exitcode

Case 2: Code synchronization fails after emergency changes

Situation: A Security Group rule was added in the console during failure response, but terraform apply Delete that rule

# 복구 절차: refresh로 현재 상태를 State에 반영한 후 코드 수정

# 1. 현재 실제 인프라 상태를 State에 반영
terraform refresh

# 2. 드리프트 확인
terraform plan

# 3. 긴급 변경 사항을 코드에 반영
# (코드 수정 후)
terraform plan  # 변경 사항 없음(No changes) 확인
terraform apply

Case 3: Massive drift after provider upgrade

Situation: After upgrading an AWS provider from 5.x to 6.x, drift is detected in hundreds of resources due to default changes

In this case terraform plan You should carefully review the output of and determine whether it requires actual infrastructure changes or simply adding State properties. In most cases terraform apply When you perform , only the State is updated and the actual infrastructure is not changed.

Operational Checklist

When building an IaC drift management system in a production environment, refer to the following checklist.

Base building phase

Are versioning and locking set in the state backend?
Are regular backups of state files performed?
Is there a workflow where all infrastructure changes are reviewed on a PR basis?
Are audit logs (CloudTrail, etc.) enabled for direct console/CLI access?

Drift Detection Step

Is a regular drift detection schedule set (at least once a day)?
Is a notification channel configured for drift detection results?
Is a severity-based classification system defined?
To reduce false positives ignore_changes Is the policy organized?
Is the detection cycle set considering the API Rate Limit?

Recovery Process Steps

Are recovery procedures (runbooks) documented for each drift type?
Is there a defined code synchronization process after emergency changes?
Have recovery procedures been tested in case of state file corruption?
Are the scope and conditions of automatic recovery clearly defined?

Governance Stage

Are there regular review meetings on drift detection results?
Are there metrics/dashboards to track drift trends?
Is root cause analysis (RCA) performed for repetitive drift?
Are IaC change rules (required code review, no console changes, etc.) agreed upon between teams?

Conclusion

Infrastructure drift is an inevitable reality that every organization that adopts IaC faces. The important thing is not to completely eliminate drift, but to have the ability to detect it quickly and manage it systematically.

Basic drift detection is possible with the native features of Terraform and OpenTofu, but as your organization grows, you will need the help of specialized tools such as Spacelift or Terramate. No matter which tool you choose, the core principles are the same. The goal is to maintain the principle that the code is a single source of truth, but to have a process that systematically accommodates real-world exceptions.

It is unrealistic to completely ban console access in response to emergency failures. Instead, a pragmatic approach is to establish a culture of synchronizing code within 24 hours of emergency changes and deploying automated drift detection to monitor this. Drift is both a technical problem and a process and culture problem. Only by having the tools and processes in place can the promise of IaC be kept.