- Authors
- Name
- Entering
- Concept and types of infrastructure drift
- Common causes of drift
- Native drift detection method
- CI/CD pipeline integration
- Automated Remediation Strategy
- Tool Comparison: Spacelift vs Terramate vs Native
- Precautions during operation
- Failure cases and recovery procedures
- Operational Checklist
- Conclusion
- References

Entering
The core promise of Infrastructure as Code (IaC) is that “the state declared in code is the actual state of the infrastructure.” However, in a real production environment, this promise is easily broken. Emergency fixes in the console, intervention from other automation tools, dynamic changes in AWS Auto Scaling, and even differences between versions of API defaults -- this gap between code and actual infrastructure is called Infrastructure Drift.
As of 2026, with the release of Terraform 1.11 and OpenTofu 1.9, drift detection and management capabilities have been greatly enhanced. However, the tool itself plan It is difficult to establish an organization-level drift management system using orders alone. In this article, we analyze the essential causes of drift and cover everything you need for practical operation, including building an automatic detection pipeline, CI/CD integration, recovery strategies, and utilizing professional tools such as Spacelift and Terramate.
Concept and types of infrastructure drift
What is drift?
Infrastructure drift refers to the discrepancy between the Desired State declared in IaC code and the Current State that actually exists in the cloud provider. Terraform's State file stores the last known state of your infrastructure, and if there is a mismatch between these three (code, state, and actual infrastructure), you have drift.
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ HCL 코드 │ │ Terraform State │ │ 실제 인프라 │
│ (Desired) │ │ (Last Known) │ │ (Actual) │
│ │ │ │ │ │
│ instance_type │ │ instance_type │ │ instance_type │
│ = "t3.medium" │ │ = "t3.medium" │ │ = "t3.xlarge" │
│ │ │ │ │ ← 콘솔에서 변경 │
└─────────────────┘ └──────────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
└─── 일치 ──────────────┘─── 불일치(드리프트) ────┘
Three types of drift
1. Configuration Drift
This is a case where the property of the resource is different from the value declared in the code. For example, this applies when a Security Group rule is added in the console or the EC2 instance type is changed.
2. Existence Drift
A resource that is declared in the code but is actually deleted, or a resource that is not in the code but actually exists. This occurs when someone deletes an S3 bucket from the console or creates an RDS instance outside of code.
3. Dependency Drift
This is a case where the reference relationship between resources is broken. For example, a subnet is deleted, but the EC2 instance referencing that subnet is moved to another subnet.
Common causes of drift
Human Factors
The most frequent cause of drift is direct human intervention. A typical example is when the infrastructure is urgently changed through the console or CLI in a failure situation.
# 장애 긴급 대응 -- 흔히 발생하는 드리프트 시나리오
# 1. Security Group에 디버깅용 포트 개방
aws ec2 authorize-security-group-ingress \
--group-id sg-0abc123def456 \
--protocol tcp \
--port 5432 \
--cidr 0.0.0.0/0 # 위험: PostgreSQL 포트를 전체 공개
# 2. RDS 인스턴스 클래스 긴급 스케일업
aws rds modify-db-instance \
--db-instance-identifier prod-main-db \
--db-instance-class db.r6g.2xlarge \
--apply-immediately
# 3. Auto Scaling Group의 desired count 수동 변경
aws autoscaling set-desired-capacity \
--auto-scaling-group-name prod-web-asg \
--desired-capacity 10
These changes may be unavoidable in emergency situations, but if the changes are not reflected in the code, then terraform apply An unintended rollback occurs.
System factors
- Auto Scaling: AWS ASG, EKS Node Group, etc. automatically change the number of instances
- Change provider defaults: AWS API automatically adds default tags or settings when creating a resource.
- Service integration: CloudTrail, Config Rules, etc. automatically modify settings
- Terraform Provider Upgrade: Properties added in new version set to default values
Process factors
- State file corruption: State file in S3 backend is accidentally overwritten or deleted.
- Parallel apply: Multiple pipelines perform apply simultaneously on the same state
- partial apply:
terraform applyfailed prematurely and changed only some resources
Native drift detection method
Basic detection using terraform plan
terraform plan is the most basic drift detection tool. -detailed-exitcode Flags allow you to programmatically determine whether there is drift.
#!/bin/bash
# drift-detect.sh -- Terraform 드리프트 감지 스크립트
set -euo pipefail
WORKSPACE_DIR="${1:-.}"
SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL:-}"
LOG_FILE="/var/log/terraform/drift-$(date +%Y%m%d-%H%M%S).log"
cd "$WORKSPACE_DIR"
echo "[$(date)] 드리프트 감지 시작: $WORKSPACE_DIR" | tee -a "$LOG_FILE"
# terraform init (백엔드 재초기화 방지)
terraform init -input=false -no-color >> "$LOG_FILE" 2>&1
# -detailed-exitcode 종료 코드:
# 0 = 변경 없음 (드리프트 없음)
# 1 = 오류 발생
# 2 = 변경 있음 (드리프트 감지)
terraform plan -detailed-exitcode -input=false -no-color -out=drift.plan >> "$LOG_FILE" 2>&1
EXIT_CODE=$?
case $EXIT_CODE in
0)
echo "[$(date)] 드리프트 없음" | tee -a "$LOG_FILE"
;;
1)
echo "[$(date)] 오류 발생 -- 수동 확인 필요" | tee -a "$LOG_FILE"
if [ -n "$SLACK_WEBHOOK_URL" ]; then
curl -s -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\": \":rotating_light: Terraform 드리프트 감지 오류\\n워크스페이스: $WORKSPACE_DIR\\n로그: $LOG_FILE\"}"
fi
exit 1
;;
2)
echo "[$(date)] 드리프트 감지됨!" | tee -a "$LOG_FILE"
# plan 출력에서 변경 요약 추출
DRIFT_SUMMARY=$(terraform show -no-color drift.plan | grep -E "^ #|Plan:|~ |+ |- " | head -30)
if [ -n "$SLACK_WEBHOOK_URL" ]; then
curl -s -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\": \":warning: 인프라 드리프트 감지\\n워크스페이스: $WORKSPACE_DIR\\n\`\`\`$DRIFT_SUMMARY\`\`\`\"}"
fi
exit 2
;;
esac
Improved drift detection in OpenTofu
In OpenTofu 1.8 and higher, tofu plan Additional drift-related options are supported. In particular, more secure drift detection is possible when combined with state encryption.
# backend.tf -- OpenTofu State 암호화 설정
terraform {
encryption {
method "aes_gcm" "state_enc" {
keys = key_provider.aws_kms.state_key
}
state {
method = method.aes_gcm.state_enc
enforced = true
}
plan {
method = method.aes_gcm.state_enc
enforced = true
}
}
backend "s3" {
bucket = "myorg-tofu-state"
key = "prod/infrastructure.tfstate"
region = "ap-northeast-2"
dynamodb_table = "tofu-state-lock"
encrypt = true
}
}
# key_provider 설정
key_provider "aws_kms" "state_key" {
kms_key_id = "alias/tofu-state-encryption"
region = "ap-northeast-2"
key_spec = "AES_256"
}
terraform plan JSON output parsing
The plan output in JSON format allows detailed programmatic analysis of drift.
#!/usr/bin/env python3
"""
drift_analyzer.py -- Terraform Plan JSON 드리프트 분석기
terraform show -json drift.plan > plan.json 으로 생성한 파일을 분석한다.
"""
import json
import sys
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class DriftSeverity(Enum):
CRITICAL = "critical"
HIGH = "high"
MEDIUM = "medium"
LOW = "low"
class DriftAction(Enum):
CREATE = "create"
UPDATE = "update"
DELETE = "delete"
REPLACE = "replace"
READ = "read"
@dataclass
class DriftItem:
address: str
resource_type: str
action: DriftAction
severity: DriftSeverity
changed_attributes: list = field(default_factory=list)
description: str = ""
# 심각도 판단 규칙
SEVERITY_RULES = {
# 리소스 타입별 기본 심각도
"aws_security_group_rule": DriftSeverity.CRITICAL,
"aws_iam_role_policy": DriftSeverity.CRITICAL,
"aws_iam_policy": DriftSeverity.CRITICAL,
"aws_s3_bucket_policy": DriftSeverity.HIGH,
"aws_db_instance": DriftSeverity.HIGH,
"aws_instance": DriftSeverity.MEDIUM,
"aws_autoscaling_group": DriftSeverity.LOW,
}
# 삭제/교체는 무조건 HIGH 이상
ACTION_SEVERITY_FLOOR = {
DriftAction.DELETE: DriftSeverity.HIGH,
DriftAction.REPLACE: DriftSeverity.HIGH,
}
def classify_severity(resource_type: str, action: DriftAction) -> DriftSeverity:
"""리소스 타입과 액션 기반으로 드리프트 심각도를 분류한다."""
base = SEVERITY_RULES.get(resource_type, DriftSeverity.MEDIUM)
floor = ACTION_SEVERITY_FLOOR.get(action)
if floor and floor.value < base.value:
return floor
return base
def parse_plan(plan_path: str) -> list[DriftItem]:
"""Terraform Plan JSON을 파싱하여 드리프트 항목 목록을 반환한다."""
with open(plan_path) as f:
plan = json.load(f)
drifts = []
for change in plan.get("resource_changes", []):
actions = change.get("change", {}).get("actions", [])
# no-op은 건너뛴다
if actions == ["no-op"] or actions == ["read"]:
continue
if "delete" in actions and "create" in actions:
action = DriftAction.REPLACE
elif "delete" in actions:
action = DriftAction.DELETE
elif "create" in actions:
action = DriftAction.CREATE
elif "update" in actions:
action = DriftAction.UPDATE
else:
continue
resource_type = change.get("type", "unknown")
address = change.get("address", "unknown")
# 변경된 속성 추출
before = change.get("change", {}).get("before") or {}
after = change.get("change", {}).get("after") or {}
changed_attrs = []
if isinstance(before, dict) and isinstance(after, dict):
all_keys = set(list(before.keys()) + list(after.keys()))
for key in all_keys:
if before.get(key) != after.get(key):
changed_attrs.append(key)
severity = classify_severity(resource_type, action)
drifts.append(DriftItem(
address=address,
resource_type=resource_type,
action=action,
severity=severity,
changed_attributes=changed_attrs,
))
return drifts
def generate_report(drifts: list[DriftItem]) -> dict:
"""드리프트 분석 보고서를 생성한다."""
report = {
"total_drifts": len(drifts),
"by_severity": {},
"by_action": {},
"critical_items": [],
"details": [],
}
for d in drifts:
report["by_severity"][d.severity.value] = \
report["by_severity"].get(d.severity.value, 0) + 1
report["by_action"][d.action.value] = \
report["by_action"].get(d.action.value, 0) + 1
if d.severity == DriftSeverity.CRITICAL:
report["critical_items"].append({
"address": d.address,
"action": d.action.value,
"changed_attributes": d.changed_attributes,
})
report["details"].append({
"address": d.address,
"type": d.resource_type,
"action": d.action.value,
"severity": d.severity.value,
"changed_attributes": d.changed_attributes,
})
return report
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python drift_analyzer.py <plan.json>")
sys.exit(1)
drifts = parse_plan(sys.argv[1])
report = generate_report(drifts)
print(json.dumps(report, indent=2, ensure_ascii=False))
# CRITICAL 항목이 있으면 종료 코드 1
if report["by_severity"].get("critical", 0) > 0:
sys.exit(1)
CI/CD pipeline integration
GitHub Actions 기반 정기 드리프트 감지
실전에서는 드리프트 감지를 정기적으로 자동 실행해야 한다. 다음은 GitHub Actions를 활용한 드리프트 감지 파이프라인이다.
# .github/workflows/drift-detection.yml
name: Infrastructure Drift Detection
on:
schedule:
# 매일 오전 9시, 오후 3시 (KST) 실행
- cron: '0 0,6 * * *'
workflow_dispatch:
inputs:
workspace:
description: '감지할 워크스페이스 (비워두면 전체)'
required: false
type: string
permissions:
id-token: write
contents: read
issues: write
env:
TF_VERSION: '1.11.0'
TOFU_VERSION: '1.9.0'
jobs:
detect-drift:
name: Detect Drift - ${{ matrix.workspace }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
workspace:
- environments/prod/networking
- environments/prod/compute
- environments/prod/database
- environments/prod/security
- environments/staging/networking
- environments/staging/compute
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS Credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_DRIFT_DETECTION_ROLE_ARN }}
aws-region: ap-northeast-2
role-session-name: drift-detection-${{ github.run_id }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
terraform_wrapper: false
- name: Terraform Init
working-directory: ${{ matrix.workspace }}
run: terraform init -input=false -no-color
- name: Detect Drift
id: drift
working-directory: ${{ matrix.workspace }}
run: |
set +e
terraform plan -detailed-exitcode -input=false -no-color \
-out=drift.plan 2>&1 | tee plan-output.txt
echo "exit_code=$?" >> "$GITHUB_OUTPUT"
- name: Analyze Drift
if: steps.drift.outputs.exit_code == '2'
working-directory: ${{ matrix.workspace }}
run: |
terraform show -json drift.plan > plan.json
python3 ${{ github.workspace }}/scripts/drift_analyzer.py plan.json \
> drift-report.json
echo "## 드리프트 감지: ${{ matrix.workspace }}" >> "$GITHUB_STEP_SUMMARY"
echo '```json' >> "$GITHUB_STEP_SUMMARY"
cat drift-report.json >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
- name: Create Issue for Critical Drift
if: steps.drift.outputs.exit_code == '2'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const report = JSON.parse(
fs.readFileSync('${{ matrix.workspace }}/drift-report.json', 'utf8')
);
if (report.by_severity?.critical > 0) {
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `[CRITICAL DRIFT] ${{ matrix.workspace }}`,
body: `## 심각한 인프라 드리프트 감지\n\n` +
`**워크스페이스**: ${{ matrix.workspace }}\n` +
`**감지 시각**: ${new Date().toISOString()}\n\n` +
`### Critical 항목\n` +
'```json\n' +
JSON.stringify(report.critical_items, null, 2) +
'\n```\n\n' +
`### 요약\n` +
`- 총 드리프트: ${report.total_drifts}\n` +
`- Critical: ${report.by_severity.critical || 0}\n` +
`- High: ${report.by_severity.high || 0}\n`,
labels: ['drift', 'critical', 'infrastructure'],
});
}
- name: Notify Slack
if: steps.drift.outputs.exit_code == '2'
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_DRIFT_WEBHOOK }}
webhook-type: incoming-webhook
payload: |
{
"text": "인프라 드리프트 감지: ${{ matrix.workspace }}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":warning: *드리프트 감지*\n워크스페이스: `${{ matrix.workspace }}`\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|상세 보기>"
}
}
]
}
GitLab CI-based drift detection
Setting examples for organizations using GitLab CI are also provided.
# .gitlab-ci.yml -- 드리프트 감지 파이프라인
stages:
- drift-detect
- drift-report
.drift_template: &drift_template
image: hashicorp/terraform:1.11
before_script:
- export AWS_WEB_IDENTITY_TOKEN_FILE=/tmp/web-identity-token
- echo "${CI_JOB_JWT_V2}" > $AWS_WEB_IDENTITY_TOKEN_FILE
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
- if: $CI_PIPELINE_SOURCE == "web"
drift:prod-networking:
<<: *drift_template
stage: drift-detect
variables:
TF_WORKSPACE_DIR: environments/prod/networking
script:
- cd $TF_WORKSPACE_DIR
- terraform init -input=false -no-color
- |
set +e
terraform plan -detailed-exitcode -input=false -no-color 2>&1 | tee plan.txt
EXIT_CODE=$?
set -e
if [ $EXIT_CODE -eq 2 ]; then
echo "DRIFT_DETECTED=true" >> drift.env
echo "DRIFT_WORKSPACE=$TF_WORKSPACE_DIR" >> drift.env
elif [ $EXIT_CODE -eq 1 ]; then
echo "DRIFT_ERROR=true" >> drift.env
exit 1
fi
artifacts:
reports:
dotenv: drift.env
paths:
- $TF_WORKSPACE_DIR/plan.txt
when: always
Automated Remediation Strategy
Spectrum of recovery strategies
There are several ways to detect drift and then recover. An appropriate strategy should be selected based on the organization's maturity and risk tolerance.
| strategy | Description | Risk | Automation level | suitable environment |
|---|---|---|---|---|
| Notifications only | Only send notification when drift is detected | low | low | Early adoption, regulatory environment |
| Code Sync | Reflect actual status in code (import/refresh) | middle | middle | Environment with many intentional changes |
| Auto apply | Automatically restore infrastructure to code state | High | High | An environment where your code is the single source of truth |
| Selective Recovery | Classifies automatic/manual recovery according to severity | middle | Medium to high | Most production environments |
Implement optional automatic recovery
The most commonly used pattern in practice is severity-based selective recovery. CRITICAL drift immediately alerts and requires manual review, while LOW severity drift automatically recovers.
#!/bin/bash
# selective-remediation.sh -- 심각도 기반 선택적 드리프트 복구
set -euo pipefail
WORKSPACE_DIR="$1"
DRIFT_REPORT="$2" # drift_analyzer.py의 출력 JSON
AUTO_APPLY="${3:-false}" # true로 설정 시 LOW/MEDIUM 자동 복구
cd "$WORKSPACE_DIR"
# 드리프트 보고서에서 심각도별 카운트 추출
CRITICAL_COUNT=$(jq -r '.by_severity.critical // 0' "$DRIFT_REPORT")
HIGH_COUNT=$(jq -r '.by_severity.high // 0' "$DRIFT_REPORT")
MEDIUM_COUNT=$(jq -r '.by_severity.medium // 0' "$DRIFT_REPORT")
LOW_COUNT=$(jq -r '.by_severity.low // 0' "$DRIFT_REPORT")
echo "=== 드리프트 분석 결과 ==="
echo " CRITICAL: $CRITICAL_COUNT"
echo " HIGH: $HIGH_COUNT"
echo " MEDIUM: $MEDIUM_COUNT"
echo " LOW: $LOW_COUNT"
echo "=========================="
# CRITICAL 또는 HIGH 드리프트가 있으면 자동 복구 중단
if [ "$CRITICAL_COUNT" -gt 0 ] || [ "$HIGH_COUNT" -gt 0 ]; then
echo "[BLOCKED] CRITICAL/HIGH 드리프트 감지 -- 수동 검토 필요"
echo "다음 리소스에 대해 수동 검토를 진행하세요:"
jq -r '.details[] | select(.severity == "critical" or .severity == "high") |
" - [\(.severity | ascii_upcase)] \(.address) (\(.action)): \(.changed_attributes | join(", "))"' \
"$DRIFT_REPORT"
exit 2
fi
# LOW/MEDIUM만 있고 자동 복구가 활성화된 경우
if [ "$AUTO_APPLY" = "true" ]; then
echo "[AUTO-REMEDIATE] LOW/MEDIUM 드리프트 자동 복구 시작"
# 타겟 리소스만 선택적으로 apply
TARGETS=$(jq -r '.details[] | select(.severity == "low" or .severity == "medium") |
"-target=\(.address)"' "$DRIFT_REPORT")
if [ -n "$TARGETS" ]; then
echo "복구 대상: $TARGETS"
# shellcheck disable=SC2086
terraform apply -auto-approve -input=false -no-color $TARGETS
echo "[DONE] 자동 복구 완료"
fi
else
echo "[INFO] 자동 복구 비활성화 -- 알림만 발송"
exit 0
fi
Reverse synchronization using terraform import
Reverse synchronization, which imports resources created in the console into code, is also an important recovery strategy. Terraform 1.5 or higher import Using blocks, you can perform import declaratively at the code level.
# imports.tf -- 선언적 import 블록 (Terraform 1.5+, OpenTofu 1.6+)
# 콘솔에서 생성된 Security Group을 코드로 가져오기
import {
to = aws_security_group.emergency_debug_sg
id = "sg-0abc123def456"
}
# 콘솔에서 생성된 S3 버킷을 코드로 가져오기
import {
to = aws_s3_bucket.manual_backup
id = "prod-manual-backup-20260307"
}
# import된 리소스에 대한 코드 선언
resource "aws_security_group" "emergency_debug_sg" {
name = "emergency-debug-sg"
description = "Emergency debugging SG - created during incident INC-2026-0307"
vpc_id = module.networking.vpc_id
# import 후 plan으로 실제 설정과 비교하여 코드를 맞춘다
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"] # 내부 네트워크로 제한
description = "PostgreSQL access from internal network"
}
tags = {
Name = "emergency-debug-sg"
ManagedBy = "terraform"
CreatedBy = "manual-import"
Incident = "INC-2026-0307"
ReviewDate = "2026-03-14" # 1주 후 검토
}
lifecycle {
# import 후 첫 apply에서 의도치 않은 삭제 방지
prevent_destroy = true
}
}
resource "aws_s3_bucket" "manual_backup" {
bucket = "prod-manual-backup-20260307"
tags = {
Name = "prod-manual-backup"
ManagedBy = "terraform"
CreatedBy = "manual-import"
}
}
Tool Comparison: Spacelift vs Terramate vs Native
Feature comparison table
Using a dedicated drift management tool can significantly reduce operational burden compared to native methods. Compare the features of major tools.
| Features | Terraform Native | OpenTofu Native | Spacelift | Terramate | Env0 |
|---|---|---|---|---|---|
| Automatic drift detection | plan + cron manual configuration | plan + cron manual configuration | Built-in (schedule-based) | CLI + CI integration | Built-in (schedule-based) |
| Drift Visualization | plan text output | plan text output | Web UI Dashboard | CLI Report | Web UI Dashboard |
| Automatic Recovery | Direct implementation of script | Direct implementation of script | Policy-based automatic apply | Orchestration integration | Policy-based automatic apply |
| Policy Engine | Sentinel (paid) | OPA integration | Built in OPA + Rego | None (external linkage) | OPA integration |
| State Encryption | Not supported (backend dependent) | Native Support | Managed State | None (backend dependent) | Managed State |
| Multi-stack Management | Workspaces | Workspaces | Stack unit management | Stack Orchestration | Environment unit |
| Cost | Free (OSS) | Free (OSS) | Paid (Free tier available) | Free (OSS) + Paid Cloud | Paid (Free tier available) |
| VCS Integration | None (CI/CD dependent) | None (CI/CD dependent) | GitHub/GitLab deep integration | GitHub/GitLab integration | GitHub/GitLab integration |
| Notification Channel | Direct implementation | Direct implementation | Slack, Teams, Webhooks | Slack, Webhook | Slack, Teams, Webhooks |
Spacelift drift detection settings
Spacelift is one of the most mature Terraform/OpenTofu management platforms. Drift detection can be set on a per-stack basis.
# spacelift.tf -- Spacelift 스택에서 드리프트 감지 활성화
resource "spacelift_stack" "prod_networking" {
name = "prod-networking"
repository = "infrastructure"
branch = "main"
project_root = "environments/prod/networking"
# Terraform 또는 OpenTofu 선택
terraform_version = "1.11.0"
# opentofu_version = "1.9.0" # OpenTofu 사용 시
autodeploy = false # PR merge 시 자동 apply 여부
autoretry = true # 일시적 오류 시 재시도
labels = ["prod", "networking", "drift-enabled"]
}
# 드리프트 감지 스케줄 설정
resource "spacelift_drift_detection" "prod_networking" {
stack_id = spacelift_stack.prod_networking.id
reconcile = false # true로 설정 시 자동 복구 (주의!)
schedule = ["0 0 * * *", "0 6 * * *"] # UTC 기준 매일 2회
timezone = "Asia/Seoul"
ignore_state = false
}
# 드리프트 감지 시 알림 정책
resource "spacelift_policy" "drift_notification" {
name = "drift-notification-policy"
type = "NOTIFICATION"
body = <<-EOT
package spacelift
# 드리프트 감지 시 Slack 알림
webhook[{"endpoint_id": endpoint_id, "payload": payload}] {
input.run_type == "DRIFT_DETECTION"
input.run_state == "FINISHED"
input.drift_detection.drifted == true
endpoint_id := "${spacelift_webhook.slack_drift.id}"
payload := {
"text": sprintf(":warning: 드리프트 감지: %s\n변경 리소스: %d개",
[input.stack.name, input.drift_detection.resources_drifted])
}
}
EOT
}
resource "spacelift_policy_attachment" "drift_notification" {
policy_id = spacelift_policy.drift_notification.id
stack_id = spacelift_stack.prod_networking.id
}
Multi-stack drift orchestration with Terramate
Terramate is an open source tool for orchestrating multiple Terraform/OpenTofu stacks. It has strengths in change detection and execution order management.
# terramate.tm.hcl -- Terramate 스택 설정
terramate {
config {
run {
env {
TF_PLUGIN_CACHE_DIR = "/tmp/terraform-plugin-cache"
}
}
cloud {
organization = "myorg"
}
}
}
# 스택 정의
stack {
name = "prod-networking"
description = "Production VPC and networking resources"
id = "prod-networking-001"
tags = ["prod", "networking", "drift-check"]
# 의존성 정의 -- networking이 먼저 감지되어야 함
after = []
before = [
"/stacks/prod/compute",
"/stacks/prod/database",
]
}
# Terramate CLI를 활용한 멀티 스택 드리프트 감지
# 모든 prod 스택에서 드리프트 감지 (의존성 순서 준수)
terramate run \
--filter tags:prod \
--filter tags:drift-check \
--parallel 4 \
-- terraform plan -detailed-exitcode -input=false -no-color
# 변경된 스택만 선택적으로 감지 (Git diff 기반)
terramate run \
--changed \
-- terraform plan -detailed-exitcode -input=false
# Terramate Cloud와 연동하여 드리프트 결과 동기화
terramate run \
--filter tags:prod \
--sync-drift \
-- terraform plan -detailed-exitcode -out=drift.plan
Precautions during operation
Side effects of drift detection
Drift detection itself can cause operational problems. In particular, you should pay attention to the following:
1. API Rate Limiting
terraform plan calls the provider API for all managed resources. Performing frequent detections on large infrastructure can result in AWS API Rate Limits.
2. State Lock Conflict
During drift detection, a read lock is applied to the state. at the same time terraform apply A conflict may occur when executed.
3. Sensitive data exposure
Plan output may contain sensitive information such as database passwords and API keys. When saving logs, you must filter them.
# 민감 데이터가 plan 출력에 노출되지 않도록 방지
# Terraform 1.10+ ephemeral variables 활용
variable "database_password" {
type = string
sensitive = true
ephemeral = true # State에 저장되지 않음 (Terraform 1.10+)
}
resource "aws_db_instance" "main" {
identifier = "prod-main-db"
engine = "postgres"
engine_version = "16.4"
instance_class = "db.r6g.xlarge"
password = var.database_password
# lifecycle ignore_changes로 외부 변경이 예상되는 속성 제외
lifecycle {
ignore_changes = [
# ASG가 관리하는 속성은 드리프트 감지에서 제외
# (이를 제외하지 않으면 false positive가 발생)
]
}
}
Strategic use of ignore_changes
If you try to detect every drift, you'll get a lot of false positives. ignore_changes It is important to use strategically to allow for intentional drift.
# Auto Scaling Group -- desired_capacity는 ASG가 관리
resource "aws_autoscaling_group" "web" {
name = "prod-web-asg"
min_size = 2
max_size = 20
desired_capacity = 4
vpc_zone_identifier = module.networking.private_subnet_ids
launch_template {
id = aws_launch_template.web.id
version = "$Latest"
}
lifecycle {
ignore_changes = [
desired_capacity, # ASG 스케일링 정책이 관리
target_group_arns, # ALB 연동 시 동적으로 변경
]
}
}
# EKS 노드 그룹 -- 클러스터 오토스케일러가 관리하는 속성 제외
resource "aws_eks_node_group" "workers" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "workers"
node_role_arn = aws_iam_role.node_group.arn
subnet_ids = module.networking.private_subnet_ids
scaling_config {
desired_size = 3
max_size = 10
min_size = 2
}
lifecycle {
ignore_changes = [
scaling_config[0].desired_size, # Cluster Autoscaler가 관리
]
}
}
Failure cases and recovery procedures
Case 1: Total Drift Due to State File Corruption
Situation: State file in S3 backend is corrupted and all resources are detected as drifting
# 복구 절차 1: S3 버전닝에서 이전 State 복구
aws s3api list-object-versions \
--bucket myorg-terraform-state \
--prefix prod/infrastructure.tfstate \
--query 'Versions[*].[VersionId,LastModified,Size]' \
--output table
# 정상적인 이전 버전의 VersionId를 확인한 후 복원
aws s3api get-object \
--bucket myorg-terraform-state \
--key prod/infrastructure.tfstate \
--version-id "abc123def456" \
restored-state.tfstate
# State 파일 무결성 확인
terraform show restored-state.tfstate | head -20
# 복원된 State로 교체 (기존 State 백업 후)
aws s3 cp \
s3://myorg-terraform-state/prod/infrastructure.tfstate \
s3://myorg-terraform-state/prod/infrastructure.tfstate.corrupted-backup
aws s3 cp \
restored-state.tfstate \
s3://myorg-terraform-state/prod/infrastructure.tfstate
# State Lock 해제 (필요 시)
terraform force-unlock <LOCK_ID>
# plan으로 드리프트 상태 확인
terraform plan -detailed-exitcode
Case 2: Code synchronization fails after emergency changes
Situation: A Security Group rule was added in the console during failure response, but terraform apply Delete that rule
# 복구 절차: refresh로 현재 상태를 State에 반영한 후 코드 수정
# 1. 현재 실제 인프라 상태를 State에 반영
terraform refresh
# 2. 드리프트 확인
terraform plan
# 3. 긴급 변경 사항을 코드에 반영
# (코드 수정 후)
terraform plan # 변경 사항 없음(No changes) 확인
terraform apply
Case 3: Massive drift after provider upgrade
Situation: After upgrading an AWS provider from 5.x to 6.x, drift is detected in hundreds of resources due to default changes
In this case terraform plan You should carefully review the output of and determine whether it requires actual infrastructure changes or simply adding State properties. In most cases terraform apply When you perform , only the State is updated and the actual infrastructure is not changed.
Operational Checklist
When building an IaC drift management system in a production environment, refer to the following checklist.
Base building phase
- Are versioning and locking set in the state backend?
- Are regular backups of state files performed?
- Is there a workflow where all infrastructure changes are reviewed on a PR basis?
- Are audit logs (CloudTrail, etc.) enabled for direct console/CLI access?
Drift Detection Step
- Is a regular drift detection schedule set (at least once a day)?
- Is a notification channel configured for drift detection results?
- Is a severity-based classification system defined?
- To reduce false positives
ignore_changesIs the policy organized? - Is the detection cycle set considering the API Rate Limit?
Recovery Process Steps
- Are recovery procedures (runbooks) documented for each drift type?
- Is there a defined code synchronization process after emergency changes?
- Have recovery procedures been tested in case of state file corruption?
- Are the scope and conditions of automatic recovery clearly defined?
Governance Stage
- Are there regular review meetings on drift detection results?
- Are there metrics/dashboards to track drift trends?
- Is root cause analysis (RCA) performed for repetitive drift?
- Are IaC change rules (required code review, no console changes, etc.) agreed upon between teams?
Conclusion
Infrastructure drift is an inevitable reality that every organization that adopts IaC faces. The important thing is not to completely eliminate drift, but to have the ability to detect it quickly and manage it systematically.
Basic drift detection is possible with the native features of Terraform and OpenTofu, but as your organization grows, you will need the help of specialized tools such as Spacelift or Terramate. No matter which tool you choose, the core principles are the same. The goal is to maintain the principle that the code is a single source of truth, but to have a process that systematically accommodates real-world exceptions.
It is unrealistic to completely ban console access in response to emergency failures. Instead, a pragmatic approach is to establish a culture of synchronizing code within 24 hours of emergency changes and deploying automated drift detection to monitor this. Drift is both a technical problem and a process and culture problem. Only by having the tools and processes in place can the promise of IaC be kept.