Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

들어가며: Terraform의 진짜 힘

한 줄 명령의 복잡성

$ terraform apply

이 명령 하나로 Terraform은:

HCL 파일들을 파싱 (*.tf).
모든 리소스 정의를 DAG (Directed Acyclic Graph) 로 변환.
State file에서 현재 상태 로드.
Provider를 통해 클라우드에 실제 상태 조회 (refresh).
Desired vs actual 차이를 plan으로 표시.
의존성 순서대로 concurrent apply.
State file 업데이트.

이 모든 것이 수 분 안에 일어난다. AWS, GCP, Azure, Kubernetes, DataDog, GitHub — 2000개 이상의 provider가 같은 모델을 따른다.

Terraform이 바꾼 것

2014년 HashiCorp가 발표한 Terraform은 인프라 관리의 패러다임을 바꿨다:

Before Terraform:

웹 콘솔에서 수동 클릭.
스크립트로 aws ec2 run-instances.
상태 추적 불가.
Reproducibility 없음.
Drift 발견 어려움.

After Terraform:

선언형 HCL로 원하는 상태 기술.
Git에 commit되는 인프라.
Plan으로 변경 사항 미리 확인.
State로 정확한 추적.
Provider로 표준 인터페이스.

2023년의 충격: BSL 라이선스

2023년 8월, HashiCorp가 Terraform 라이선스를 BSL (Business Source License) 로 변경. 오픈소스 커뮤니티에 큰 충격. 곧이어 OpenTofu가 fork로 출범. Linux Foundation 산하.

이 글은 Terraform과 OpenTofu 둘 다에 적용된다 (내부 구조는 거의 동일).

이 글에서 다룰 것

HCL: Terraform의 DSL.
DAG: 의존성 그래프.
State: 진실의 원천.
Provider: 클라우드와의 인터페이스.
Plan: 변경 미리보기.
Apply: 실제 변경 실행.
Modules: 재사용 단위.
Workspaces: 환경 분리.
Backend: State 저장.
Drift Detection: 변경 감지.

1. HCL: Terraform의 언어

HCL이란

HCL (HashiCorp Configuration Language) 은 JSON/YAML과 JavaScript 사이의 중간 언어. 선언적이면서도 프로그래밍 가능.

# 변수
variable "region" {
  type    = string
  default = "us-east-1"
}

# 리소스
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"
  
  tags = {
    Name = "WebServer"
  }
}

# Output
output "instance_ip" {
  value = aws_instance.web.public_ip
}

HCL vs JSON vs YAML

JSON:

{
  "resource": {
    "aws_instance": {
      "web": {
        "ami": "ami-0c55b159cbfafe1f0",
        "instance_type": "t3.micro"
      }
    }
  }
}

기계 친화적.
주석 없음.
문법 엄격.

YAML:

사람 친화적.
들여쓰기 기반 (위험).
기호 많음.

HCL:

사람 친화적 + 기계 친화적.
주석 있음.
함수, 조건, 반복 지원.
타입 시스템.

HCL의 주요 구조

Block: type "label1" "label2" { ... }

resource "aws_instance" "web" {
  # ...
}

provider "aws" {
  # ...
}

variable "count" {
  # ...
}

Expressions: 값 계산.

count = 3
name = "server-${count.index}"
ips  = [for i in range(3) : cidrhost("10.0.0.0/24", i)]

Functions: 내장 함수.

length(var.list)
format("hello-%s", var.name)
jsonencode({foo = "bar"})

조건, 반복

조건:

resource "aws_instance" "web" {
  instance_type = var.env == "prod" ? "t3.large" : "t3.micro"
}

반복 (count):

resource "aws_instance" "web" {
  count = 3
  ami   = "ami-..."
}
# web[0], web[1], web[2]

반복 (for_each):

resource "aws_instance" "web" {
  for_each = {
    web1 = "t3.micro"
    web2 = "t3.small"
    web3 = "t3.medium"
  }
  
  ami           = "ami-..."
  instance_type = each.value
  
  tags = {
    Name = each.key
  }
}
# web["web1"], web["web2"], ...

for_each vs count: for_each는 map/set 기반이라 안정적. count는 인덱스 기반이라 삭제 시 혼란.

표현식의 평가

HCL 표현식은 lazy 평가:

참조가 있을 때만 평가.
순환 참조는 에러.
의존성 자동 추적.

예:

resource "aws_instance" "web" {
  ami = "ami-..."
}

resource "aws_eip" "web_ip" {
  instance = aws_instance.web.id  # 이 참조가 의존성 생성
}

Terraform이 이를 분석해:

web_ip는 web에 의존.
먼저 web 생성, 그 다음 web_ip.

이 의존성이 DAG를 만든다.

2. DAG: 의존성 그래프

모든 것은 그래프다

Terraform의 핵심: DAG (Directed Acyclic Graph). 모든 리소스와 모듈, 변수가 노드. 참조가 엣지.

예시:

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "public" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }
}

DAG:

                  aws_vpc.main
                /     |      \
               ↓      ↓       ↓
       aws_subnet  aws_igw   (참조)
       .public     .igw       ↓
                    └────────→aws_route_table.public

aws_vpc.main은 다른 모든 리소스의 부모. 다른 리소스들은 VPC가 먼저 생성되어야.

DAG의 역할

1. Concurrency:

독립적인 리소스는 병렬 생성:

resource "aws_instance" "web1" {
  ami = "ami-..."
}

resource "aws_instance" "web2" {
  ami = "ami-..."
}

web1과 web2는 서로 의존성 없음 → 동시에 생성.

기본 concurrency: -parallelism=10 (동시 10개). 조정 가능.

2. Ordering:

의존성 있는 리소스는 순서대로:

VPC 생성 → Subnet 생성 → Instance 생성

3. Cycle 감지:

순환 참조는 에러:

resource "a" "x" {
  b = b.y.id
}

resource "b" "y" {
  a = a.x.id  # 순환!
}

에러: Error: Cycle in dependency graph.

Graph 시각화

terraform graph | dot -Tpng > graph.png

dot은 GraphViz 도구. PNG 이미지로 DAG 시각화.

의존성 종류

1. Explicit Reference (암시적):

resource "aws_subnet" "public" {
  vpc_id = aws_vpc.main.id  # 자동 감지
}

2. Explicit Dependency (명시적):

resource "aws_instance" "web" {
  # ...
  depends_on = [aws_security_group.web]
}

드물게 필요. 일반적으로 reference로 충분.

Graph Walker

Terraform이 DAG를 순회하는 로직:

Leaf nodes 식별: 들어오는 엣지 없는 노드.
Concurrent execution: 현재 준비된 노드 모두 실행.
완료 대기.
의존자로 이동: 새로 "준비된" 노드.
반복.

이는 Topological sort의 변형.

3. State: 진실의 원천

State란

Terraform state는 관리하는 리소스의 현재 상태 기록. JSON 파일.

역할:

Resource → Cloud object 매핑: aws_instance.web → i-0123456789.
의존성 추적: 어떤 리소스가 어떤 것에 의존하나.
Drift detection: 마지막 상태와 실제 상태 비교.
Performance: 매번 모든 것을 조회하지 않음.
Metadata: 민감 정보 포함 가능.

State 파일 예시

{
  "version": 4,
  "terraform_version": "1.6.0",
  "serial": 42,
  "lineage": "abc-123-xyz",
  "outputs": {
    "instance_ip": {
      "value": "54.123.45.67",
      "type": "string"
    }
  },
  "resources": [
    {
      "mode": "managed",
      "type": "aws_instance",
      "name": "web",
      "provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
      "instances": [
        {
          "schema_version": 1,
          "attributes": {
            "id": "i-0123456789abcdef0",
            "ami": "ami-0c55b159cbfafe1f0",
            "instance_type": "t3.micro",
            "tags": {
              "Name": "WebServer"
            }
          },
          "dependencies": ["aws_vpc.main"]
        }
      ]
    }
  ]
}

주요 필드:

serial: 버전 번호. 매 변경마다 증가.
lineage: state의 UUID. 여러 state 혼동 방지.
resources[].instances[].attributes: 리소스의 전체 속성.

Local vs Remote State

Local state (terraform.tfstate):

기본값.
로컬 파일.
Git에 commit 금지. 민감 정보 포함.

Remote state:

백엔드에 저장.
팀 협업 가능.
Locking: 동시 수정 방지.
Versioning: 이전 상태 복구.

Backend

Backend: State가 저장되는 곳.

S3 + DynamoDB (AWS 표준):

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

S3: State file 저장. 버저닝 + 암호화.
DynamoDB: Lock 관리. 동시 apply 방지.

GCS:

terraform {
  backend "gcs" {
    bucket = "my-terraform-state"
    prefix = "prod"
  }
}

Azure:

terraform {
  backend "azurerm" {
    resource_group_name  = "tfstate"
    storage_account_name = "tfstate"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
  }
}

Terraform Cloud / HCP Terraform:

terraform {
  cloud {
    organization = "my-org"
    workspaces {
      name = "prod"
    }
  }
}

State Locking

문제: 두 사람이 동시에 terraform apply → race condition → state 손상.

해결: Lock. Apply 시작 시 lock 획득. 끝나면 해제.

DynamoDB lock:

{
  "LockID": "my-terraform-state/prod/terraform.tfstate-md5",
  "Info": "...",
  "Operation": "OperationTypeApply",
  "Who": "alice@laptop.local",
  "Version": "1.6.0",
  "Created": "2025-04-15T10:00:00Z"
}

다른 사람이 시도하면:

Error: Error acquiring the state lock
Lock Info:
  ID:        abc-123
  Path:      my-terraform-state/prod/terraform.tfstate
  Operation: OperationTypeApply
  Who:       alice@laptop.local

Force unlock (긴급):

terraform force-unlock abc-123

주의: 실제 apply가 진행 중이면 state 손상 위험.

State의 민감성

State 파일에는 민감 정보가 포함될 수 있다:

데이터베이스 암호.
API 키.
인증서 내용.
Sensitive variables.

Git에 절대 commit 금지. Remote backend 사용 + 암호화 필수.

State 보안:

# 민감한 값 보기
terraform output -json

Sensitive output:

output "db_password" {
  value     = aws_db_instance.main.password
  sensitive = true
}

CLI에 안 보이지만 state 파일엔 저장됨.

State 조작 명령어

terraform state list: 리소스 목록.

terraform state list
# aws_vpc.main
# aws_subnet.public
# aws_instance.web

terraform state show: 리소스 상세.

terraform state show aws_instance.web

terraform state mv: 리소스 이동 (rename, module 이동).

terraform state mv aws_instance.web aws_instance.web_server

terraform state rm: State에서 제거 (실제 리소스는 남음).

terraform state rm aws_instance.legacy

terraform import: 기존 리소스를 state에 추가.

terraform import aws_instance.web i-0123456789abcdef0

4. Provider: 클라우드와의 다리

Provider란

Provider는 Terraform과 외부 시스템 (AWS, GCP 등) 사이의 plugin.

역할:

리소스 타입 정의 (aws_instance, aws_vpc 등).
API 호출 구현.
Schema 제공.
CRUD 동작 매핑.

Provider 생태계

Terraform Registry: 2000+ providers.

Official: AWS, Azure, GCP, Kubernetes 등.
Partner: Datadog, GitHub, MongoDB Atlas 등.
Community: 수많은 커뮤니티 provider.

Provider 사용

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
  
  default_tags {
    tags = {
      Environment = "prod"
    }
  }
}

terraform init 시 provider를 .terraform/providers에 다운로드.

Provider Protocol

Terraform ↔ Provider 통신은 gRPC.

Terraform Core ←→ gRPC ←→ Provider Plugin (별도 프로세스)

왜 별도 프로세스:

언어 독립: provider는 Go 외 언어로도 가능.
격리: provider crash가 core를 죽이지 않음.
버전 관리: 여러 provider 동시 실행.

Provider의 책임

각 리소스 타입마다:

1. Schema: 속성 정의.

schema.Resource{
  Schema: map[string]*schema.Schema{
    "ami": {
      Type:     schema.TypeString,
      Required: true,
    },
    "instance_type": {
      Type:     schema.TypeString,
      Required: true,
    },
    // ...
  },
}

2. CRUD functions:

Create: 새 리소스 생성.
Read: 현재 상태 조회.
Update: 변경 적용.
Delete: 리소스 삭제.

3. Diff: Desired와 actual의 차이 계산.

Terraform Plugin Framework

Terraform Plugin Framework (2022+): Go SDK의 새 버전.

더 엄격한 타입 시스템.
더 나은 에러 처리.
Nested attributes: 복잡한 구조.

이전: SDK v2 (여전히 많이 사용).

Custom Provider 작성

직접 provider를 작성할 수 있다:

package main

import (
  "github.com/hashicorp/terraform-plugin-framework/providerserver"
  "context"
)

func main() {
  providerserver.Serve(context.Background(), NewProvider, ...)
}

용도:

내부 API.
Custom infrastructure.
공개 서비스에 대한 Terraform wrapper.

Provider 버전 관리

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # 5.x, but not 6.0
    }
  }
}

버전 연산자:

= 5.23.0: 정확히.
>= 5.0: 이상.
~> 5.0: 5.x (패치만).
~> 5.20: 5.20.x.

Lock file (.terraform.lock.hcl): 정확한 버전 + checksum. Commit.

5. Plan: 변경 미리보기

Plan의 목적

Plan은 apply가 실제로 할 일을 보여준다:

Terraform will perform the following actions:

  # aws_instance.web will be created
  + resource "aws_instance" "web" {
      + ami           = "ami-0c55b159cbfafe1f0"
      + instance_type = "t3.micro"
      + id            = (known after apply)
      + public_ip     = (known after apply)
    }

  # aws_vpc.main will be updated in-place
  ~ resource "aws_vpc" "main" {
        id         = "vpc-12345"
      ~ cidr_block = "10.0.0.0/16" -> "10.0.0.0/8"
    }

  # aws_eip.old will be destroyed
  - resource "aws_eip" "old" {
      - id = "eipalloc-12345"
    }

Plan: 1 to add, 1 to change, 1 to destroy.

기호:

+: Create.
~: Update in-place.
-: Destroy.
-/+: Destroy and recreate.

Plan의 단계

Plan 계산:

HCL 파싱: 모든 .tf 파일.
Module 해석: module 블록 확장.
State 로드: Backend에서.
Refresh: Provider로 현재 상태 조회 (API 호출).
DAG 생성: 의존성 분석.
Diff 계산: Desired vs refreshed state.
출력: 사람이 읽을 수 있는 포맷.

Refresh 단계

Refresh: 각 리소스에 대해 provider.Read() 호출.

For each resource in state:
  current = provider.Read(resource.id)
  if current != state:
    update state

문제: 많은 리소스 → 많은 API 호출 → 느림.

최적화:

병렬 실행: -parallelism=10.
Targeted refresh: -target.
Refresh 생략: -refresh=false (위험).

Plan File

Plan 결과를 파일로 저장 가능:

terraform plan -out=plan.tfplan

이후:

terraform apply plan.tfplan

이점:

Plan과 apply 분리.
승인 워크플로우.
CI/CD 통합.

Known after apply

일부 값은 apply 전까지 알 수 없음:

+ public_ip = (known after apply)
+ id        = (known after apply)

이는 정상. 실제로 리소스 생성 후에야 알 수 있는 값들.

영향: 다른 리소스가 이에 의존하면, 그 리소스도 "known after apply".

In-place vs Replace

In-place update: 리소스 수정.

~ tags = {
    ~ "Name" = "Old" -> "New"
  }

Replace (destroy + create): 일부 속성은 변경 불가. 재생성 필요.

-/+ resource "aws_instance" "web" {
    ~ availability_zone = "us-east-1a" -> "us-east-1b" # forces replacement
  }

주의: Replace는 다운타임 유발 가능. create_before_destroy로 완화:

resource "aws_instance" "web" {
  # ...
  lifecycle {
    create_before_destroy = true
  }
}

새 리소스 생성 → 확인 → 구 리소스 삭제. 다운타임 없음 (거의).

6. Apply: 실제 변경

Apply의 흐름

Plan 재확인 (또는 plan file 사용).
사용자 확인 (yes).
State lock 획득.
DAG 순회:
- 병렬 실행 가능한 리소스 모두 시작.
- 완료 시 다음 단계.
각 작업:
- Provider에 API 요청.
- 결과로 state 업데이트.
State 저장.
Lock 해제.

Concurrency

기본 -parallelism=10:

10개 리소스 동시 처리.
Rate limit, API throttling 조심.

증가:

terraform apply -parallelism=20

감소 (API 제한):

terraform apply -parallelism=1

Partial Apply

실패 시나리오:

aws_vpc.main: Creating...
aws_subnet.public: Creating...
aws_subnet.private: Creating...
aws_subnet.public: Creation complete
aws_subnet.private: Error: AccessDenied

aws_vpc.main, aws_subnet.public: 생성됨, state에 저장.
aws_subnet.private: 실패, state에 없음.

문제 해결:

에러 원인 수정 (권한 등).
다시 terraform apply.
Terraform이 이미 있는 리소스는 건드리지 않고, 실패한 것만 재시도.

Rollback은 없다

Terraform에는 rollback 기능 없음. 실패하면:

에러 수정.
다시 apply.

이유:

인프라는 파일 시스템이 아님.
"이전 상태로 복원"은 복잡.
명시적 관리가 더 안전.

Git revert로 이전 코드 적용 → apply. 이것이 "rollback".

Apply 시 주의점

1. Plan 없이 apply 금지:

# 나쁨
terraform apply -auto-approve

# 좋음
terraform plan -out=plan.tfplan
terraform apply plan.tfplan

2. 긴급 수동 변경 후 동기화:

terraform refresh  # 또는 apply -refresh-only

3. Drift 감지:

terraform plan
# 예상하지 못한 변경이 있으면 drift

7. Modules: 재사용의 열쇠

Module의 필요성

여러 환경에 같은 구조 배포:

Dev, Staging, Prod.
여러 리전.
여러 팀.

복사-붙여넣기 금지. Module로 재사용.

Module 기본

Module 정의:

# modules/vpc/main.tf
variable "cidr_block" {
  type = string
}

resource "aws_vpc" "this" {
  cidr_block = var.cidr_block
}

output "vpc_id" {
  value = aws_vpc.this.id
}

Module 사용:

module "vpc" {
  source = "./modules/vpc"
  
  cidr_block = "10.0.0.0/16"
}

# 출력 참조
output "my_vpc_id" {
  value = module.vpc.vpc_id
}

Module Sources

Local:

module "vpc" {
  source = "./modules/vpc"
}

Git:

module "vpc" {
  source = "git::https://github.com/myorg/modules.git//vpc?ref=v1.0"
}

Terraform Registry:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"
}

S3, HTTP: 기타 다양한 소스.

Registry의 힘

Public Terraform Registry (registry.terraform.io):

terraform-aws-modules: 가장 유명. VPC, EKS, RDS 등.
Azure-verified: Microsoft가 검증.
Google Cloud: GCP 모듈.

예시:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.0"
  
  name = "my-vpc"
  cidr = "10.0.0.0/16"
  
  azs             = ["us-east-1a", "us-east-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]
  
  enable_nat_gateway = true
}

20개가 넘는 리소스를 한 블록으로 구성.

Module Versioning

Semantic versioning:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.1.0"  # 5.1.x
}

Git tag:

source = "git::...repo.git?ref=v1.2.3"

프로덕션은 정확한 버전 pinning 필수.

Module 디자인 원칙

1. Single Responsibility:

나쁨:

module "everything" {
  source = "./modules/full-stack"
  # VPC, EKS, RDS, ALB, CloudFront, ...
}

좋음:

module "vpc" { ... }
module "eks" { ... }
module "rds" { ... }

2. Composable:

Output으로 다른 module에 전달:

module "vpc" {
  source = "./vpc"
  # ...
}

module "eks" {
  source     = "./eks"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids
}

3. Minimal Variables:

사용자에게 선택권 주되 합리적 default:

variable "instance_type" {
  type    = string
  default = "t3.micro"
}

variable "tags" {
  type    = map(string)
  default = {}
}

4. Clear Outputs:

다른 module이 필요한 값 노출:

output "vpc_id" {}
output "subnet_ids" {}
output "security_group_id" {}

Module 테스팅

Terratest (Go 기반):

func TestVPC(t *testing.T) {
    opts := &terraform.Options{
        TerraformDir: "./modules/vpc",
    }
    defer terraform.Destroy(t, opts)
    terraform.InitAndApply(t, opts)
    
    vpcID := terraform.Output(t, opts, "vpc_id")
    assert.NotEmpty(t, vpcID)
}

Terraform Test (Terraform 1.6+):

# tests/basic.tftest.hcl
run "create_vpc" {
  command = plan
  
  assert {
    condition     = aws_vpc.this.cidr_block == "10.0.0.0/16"
    error_message = "VPC CIDR mismatch"
  }
}

8. Workspaces: 환경 분리

Workspace란

같은 코드로 여러 환경 관리:

Dev, Staging, Prod.
각 환경은 별도 state.

terraform workspace list
# * default
#   dev
#   staging
#   prod

terraform workspace new prod
terraform workspace select prod

사용

resource "aws_instance" "web" {
  instance_type = terraform.workspace == "prod" ? "t3.large" : "t3.micro"
}

단점:

같은 코드를 공유해야.
환경별 설정 차이 관리 어려움.
실수로 잘못된 workspace에서 apply 위험.

Workspace의 대안

Directory per environment:

environments/
├── dev/
│   ├── main.tf
│   └── terraform.tfvars
├── staging/
│   ├── main.tf
│   └── terraform.tfvars
└── prod/
    ├── main.tf
    └── terraform.tfvars

각 디렉토리가 독립. 실수 방지.

Terragrunt: HashiCorp가 아닌 도구. DRY Terraform.

9. Drift Detection

Drift란

Drift: Terraform state ≠ 실제 클라우드 상태.

원인:

수동 변경: 누군가 콘솔에서 수정.
외부 자동화: Auto-scaling, Lambda 등.
다른 도구: CloudFormation과 병행 사용.

Drift 감지

terraform plan

State에 있는 리소스를 provider로 refresh → 차이가 있으면 plan에 표시.

예시:

~ resource "aws_instance" "web" {
    id            = "i-0123456789abcdef0"
  ~ instance_type = "t3.micro" -> "t3.small"  # 누군가 콘솔에서 변경!
  }

대응

옵션 1: Terraform에 반영:

코드를 실제 상태에 맞게 수정:

instance_type = "t3.small"

옵션 2: Revert:

apply로 원래 상태로 복원:

terraform apply

Terraform이 t3.small → t3.micro로 되돌림.

Refresh-only

변경 없이 state만 업데이트:

terraform apply -refresh-only

클라우드 변경을 인정.

Continuous Drift Detection

Atlantis, Terraform Cloud 등이 자동화:

주기적으로 plan 실행.
Drift 감지 시 알림.
자동 수정 또는 수동 확인.

10. 실전: CI/CD with Terraform

GitHub Actions 예시

name: Terraform
on:
  pull_request:
  push:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./terraform
    
    steps:
      - uses: actions/checkout@v4
      
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.0
      
      - run: terraform init
      
      - run: terraform fmt -check
      
      - run: terraform validate
      
      - name: Terraform Plan
        run: terraform plan -out=tfplan
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY }}
      
      - name: Terraform Apply
        if: github.ref == 'refs/heads/main'
        run: terraform apply -auto-approve tfplan

Best Practices

1. PR 단위 Plan:

Pull request마다 plan 실행.
결과를 PR 코멘트로 게시.
Merge 전 검토.

2. State 보안:

IAM 권한 엄격.
Encryption at rest.
Access logging.

3. Secret 관리:

sensitive = true.
외부 secret manager 사용 (Vault, AWS Secrets Manager).
Environment variables.

4. Lock Timeout:

terraform apply -lock-timeout=10m

Long-running apply에 유용.

5. Targeted Operations (조심):

terraform apply -target=aws_instance.web

긴급 상황에만. 일반적으론 전체 apply.

11. OpenTofu

왜 fork 되었나

2023년 8월 10일: HashiCorp가 Terraform을 BSL (Business Source License) 로 전환.

BSL의 제약:

경쟁자 (Terraform Cloud에 대한 경쟁 서비스)는 4년 제한.
오픈소스 커뮤니티의 우려.

반응:

OpenTF Manifesto: 커뮤니티가 fork 선언.
OpenTofu로 명명.
Linux Foundation 산하.
2024년 1.6 release.

OpenTofu vs Terraform

기술적: 거의 동일. HCL 호환. 대부분의 provider 작동.

라이선스: OpenTofu는 MPL 2.0 (순수 오픈소스).

기능:

OpenTofu가 더 빠르게 혁신 중.
State encryption, for_each in provider 등 OpenTofu 먼저.
Terraform도 따라잡는 중.

마이그레이션

# Terraform → OpenTofu
tofu init
tofu plan
tofu apply

대부분 그대로 작동. 고급 기능 몇 개 차이.

어느 것을 쓸까

Terraform:

HashiCorp 생태계 통합.
Terraform Cloud/Enterprise.
가장 많은 provider.

OpenTofu:

순수 오픈소스 원할 때.
BSL 우려 있을 때.
커뮤니티 주도 혁신.

많은 조직이 관망 중. 앞으로 1-2년에 결정될 것.

12. 실전 운영과 트러블슈팅

흔한 실수

1. 잘못된 workspace에서 apply:

해결: 디렉토리 per environment.

2. State 파일 Git commit:

해결: .gitignore에 *.tfstate*.

3. Hardcoded credentials:

# 나쁨
provider "aws" {
  access_key = "AKIAIOSFODNN7EXAMPLE"
  secret_key = "wJalrXUtnFEMI/K7MDENG/..."
}

# 좋음: Environment variables
# AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY

4. Terraform version 불일치:

terraform {
  required_version = "~> 1.6.0"
}

5. Manual state 수정:

위험. terraform state 명령어 사용.

트러블슈팅

State lock stuck:

terraform force-unlock <lock_id>

Drift:

terraform plan
# Review changes
terraform apply -refresh-only

Provider 에러:

# Debug
TF_LOG=DEBUG terraform plan

의존성 순환:

terraform graph | grep -i cycle

성능 최적화

1. Parallelism 조정:

terraform apply -parallelism=20

2. Targeted plan:

terraform plan -target=module.vpc

3. Refresh 생략:

terraform plan -refresh=false

주의: State가 오래되면 잘못된 plan.

4. Provider cache:

export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"

여러 프로젝트에서 provider 재사용.

모듈화 전략

작게, 재사용 가능하게:

modules/
├── vpc/              # VPC + subnets
├── eks/              # EKS cluster
├── rds/              # Database
├── alb/              # Load balancer
└── monitoring/       # CloudWatch, alarms

Composition:

module "vpc" {
  source = "./modules/vpc"
}

module "eks" {
  source = "./modules/eks"
  vpc_id = module.vpc.vpc_id
}

퀴즈로 복습하기

Q1. Terraform의 DAG가 어떻게 인프라 관리를 더 쉽게 만드는가?

DAG (Directed Acyclic Graph) 는 Terraform의 가장 중요한 내부 구조다. 모든 리소스, 모듈, 변수를 노드로, 참조를 엣지로 표현.

DAG가 가능하게 하는 것들:

1. 자동 의존성 관리:

사용자가 명시적으로 순서를 지정하지 않아도:

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "public" {
  vpc_id = aws_vpc.main.id  # 이 참조가 의존성 생성
}

Terraform이 자동으로:

VPC가 subnet보다 먼저 생성되어야 함을 추론.
삭제 시엔 역순 (subnet 먼저, VPC 나중).

이유: aws_subnet.public이 aws_vpc.main.id를 참조하기 때문에 DAG에 엣지 생성.

2. 병렬 실행:

독립적인 리소스는 동시에 처리:

resource "aws_instance" "web1" {}
resource "aws_instance" "web2" {}
resource "aws_instance" "web3" {}

세 개는 서로 의존성 없음 → 병렬 생성.

성능 영향:

Sequential: 3분 (1분 × 3).
Parallel: 1분.

큰 인프라 (100+ 리소스)에선 수십 배 빠름.

3. Cycle 감지:

순환 참조는 불가능:

resource "aws_security_group" "web" {
  ingress {
    security_groups = [aws_security_group.db.id]  # db 참조
  }
}

resource "aws_security_group" "db" {
  ingress {
    security_groups = [aws_security_group.web.id]  # web 참조 → 순환!
  }
}

Terraform이 즉시 감지:

Error: Cycle in graph

실제로는 해결 가능: SG를 먼저 만들고, rule을 별도 리소스로:

resource "aws_security_group" "web" {}
resource "aws_security_group" "db" {}

resource "aws_security_group_rule" "web_to_db" {
  security_group_id        = aws_security_group.web.id
  source_security_group_id = aws_security_group.db.id
}

Rule이 별도 노드라 cycle 없음.

4. Graph 시각화:

terraform graph | dot -Tpng > graph.png

전체 인프라의 의존성을 시각적으로. 복잡한 시스템 이해에 유용.

5. Targeted operations:

terraform apply -target=aws_instance.web

DAG 분석으로 web과 그 의존성만 처리. 나머지는 건드리지 않음.

6. Destroy 순서:

생성 순서를 반대로. 역순 topological sort:

Create: VPC → Subnet → Instance
Destroy: Instance → Subnet → VPC

DAG가 자동으로 처리.

DAG의 수학적 배경:

Topological Sort: DAG의 노드를 의존성 순서로 정렬.

알고리즘:

In-degree가 0인 노드 찾기 (leaves).
이들을 처리.
그들의 out-edge 제거.
새로 in-degree 0이 된 노드 처리.
모든 노드 처리 완료까지 반복.

이 과정이 병렬 실행을 자연스럽게 만든다: 각 단계에서 "동시 처리 가능한" 노드 집합.

Graph 구조 예시:

resource "aws_vpc" "main" {}
resource "aws_subnet" "public" { vpc_id = aws_vpc.main.id }
resource "aws_subnet" "private" { vpc_id = aws_vpc.main.id }
resource "aws_instance" "web" { subnet_id = aws_subnet.public.id }
resource "aws_instance" "db" { subnet_id = aws_subnet.private.id }

DAG:

         aws_vpc.main
         /          \
        ↓            ↓
aws_subnet.public   aws_subnet.private
        ↓                   ↓
aws_instance.web    aws_instance.db

실행 순서:

aws_vpc.main (혼자).
aws_subnet.public, aws_subnet.private (병렬).
aws_instance.web, aws_instance.db (병렬).

3 단계, 최대 병렬성.

DAG vs Imperative Scripts:

Imperative (bash, Python):

aws ec2 create-vpc --cidr-block 10.0.0.0/16
aws ec2 create-subnet --vpc-id vpc-xxx --cidr-block 10.0.1.0/24
aws ec2 create-instance --subnet-id subnet-xxx

문제:

순서를 수동 관리.
병렬화 수동 구현.
실패 시 recovery 복잡.
의존성 추적 없음.
삭제 순서 또 수동.

DAG (Terraform):

순서 자동.
병렬 자동.
Recovery: 다시 apply.
의존성 명시적.
삭제는 자동 역순.

코드 양 비교: 종종 Terraform이 더 짧다. 선언형 + 자동화.

실전 이점:

1. 대규모 인프라 관리:

1000+ 리소스를 30분 내 배포.
순서 문제 없음.
에러 시 해당 부분만 재시도.

2. 팀 협업:

누군가 새 리소스 추가.
기존 코드 건드리지 않아도 자동으로 올바른 위치에 배치.
DAG가 알아서 순서 맞춤.

3. Refactoring:

리소스를 module로 옮겨도 의존성 유지.
Terraform이 추적.

DAG의 한계:

1. Provider 간 숨은 의존성:

IAM role이 먼저 있어야 Lambda 생성 가능:

resource "aws_iam_role" "lambda" {}
resource "aws_lambda_function" "app" {
  role = aws_iam_role.lambda.arn  # DAG가 감지
}

하지만 IAM role의 권한 전파 지연 (~10초) 은 DAG가 모름. 생성 후 즉시 Lambda 호출하면 실패 가능.

해결:

sleep provisioner (hacky).
Provider에서 retry.
depends_on + time_sleep 리소스.

2. External side effects:

Terraform 밖에서 일어나는 일 (DNS propagation, cache invalidation 등)은 DAG로 모델링 못 함.

3. Dynamic dependencies:

런타임에만 알 수 있는 의존성은 표현 어려움. count, for_each로 일부 해결.

교훈:

DAG는 선언형 인프라 관리의 근본이다. 모든 강력한 IaC 도구 (Terraform, Pulumi, CloudFormation)가 이 개념을 쓴다. 차이는 언어와 생태계.

DAG 덕분에 인프라가 코드가 된다:

Git으로 버전 관리.
PR로 리뷰.
CI/CD로 자동 배포.
의존성 분석.

이것이 2014년 Terraform의 혁신이었다. 이전에는 CloudFormation 정도가 있었지만 AWS-only. Terraform이 멀티 클라우드 + 선언형 + DAG 를 결합해서 업계를 바꿨다.

당신이 terraform apply를 칠 때, 이 모든 것이 뒤에서 일어난다. 의존성 분석, 병렬 스케줄링, 실패 복구, state 추적. 이 복잡성을 하나의 명령어로 추상화한 것이 Terraform의 진짜 가치다.

Q2. Terraform state가 왜 그렇게 중요하고 어떻게 관리해야 하는가?

State는 Terraform의 "기억" 이다. 없으면 Terraform이 아무것도 못 한다.

State가 하는 일:

1. Resource → Cloud Object 매핑:

resource "aws_instance" "web" {
  ami = "ami-..."
}

State에 저장:

{
  "type": "aws_instance",
  "name": "web",
  "instances": [{
    "attributes": {
      "id": "i-0123456789abcdef0",  # 실제 AWS instance ID
      ...
    }
  }]
}

다음 terraform apply 시 Terraform은:

State의 web이 i-0123456789abcdef0에 해당을 안다.
이 instance를 refresh.
Desired state와 비교.

State 없이는: 매번 새로 생성. 기존 리소스는 orphan이 됨.

2. 의존성 추적:

State에는 각 리소스의 의존성 목록도:

{
  "dependencies": ["aws_vpc.main", "aws_subnet.public"]
}

Destroy 시 역순으로 처리.

3. Metadata 저장:

클라우드 API에서 안 보이는 값:

sensitive 필드.
Internal IDs.
Computed attributes.

4. Performance:

매번 모든 리소스를 조회하는 대신, state의 정보 활용. Refresh는 선택적.

State 관리의 위험:

1. 손실:

State 잃으면 모든 리소스를 다시 import해야.

terraform import aws_instance.web i-0123456789
terraform import aws_vpc.main vpc-abc123
# ... 수십/수백 개

악몽.

2. 손상:

Corrupt JSON → Terraform이 읽지 못함. 수동 수정 어려움.

3. Drift:

State와 실제가 맞지 않음. 다음 apply 시 예상 못한 변경.

4. 민감 정보:

State에 password, API key 등이 저장됨. 유출되면 큰 문제.

State 관리 원칙:

1. Remote Backend 사용:

절대 로컬 state만으로 production 운영 금지.

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

이점:

공유: 팀 멤버가 같은 state 접근.
백업: S3 versioning으로 이전 state.
Encryption: 미사용 시 암호화.
Lock: 동시 apply 방지.

S3 버킷 설정:

resource "aws_s3_bucket" "tfstate" {
  bucket = "my-terraform-state"
}

resource "aws_s3_bucket_versioning" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "tfstate" {
  bucket                  = aws_s3_bucket.tfstate.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

2. State Locking:

DynamoDB (AWS):

resource "aws_dynamodb_table" "tf_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
}

다른 backend:

GCS: 내장 locking.
Azure: Blob leasing.
HTTP: 커스텀 lock API.

Lock의 중요성:

두 사람이 동시에 apply → race condition → state 손상. Lock이 이를 방지.

alice: terraform apply
  → lock 획득
  → 작업 시작

bob: terraform apply  (alice 작업 중)
  → lock 획득 시도
  → 실패
  → 에러: "Lock held by alice"
  → 대기 또는 중단

3. State 파일 크기 관리:

State가 수십 MB가 되면:

Plan/apply 느림.
메모리 사용 증가.
일부 backend는 크기 제한.

해결:

State 분할: 큰 프로젝트를 여러 개의 작은 state로.
Environment별: 하나의 거대한 state 대신.
Service별: VPC, EKS, RDS 각각.

예시 구조:

terraform/
├── network/     # VPC, subnets (별도 state)
│   └── terraform.tfstate
├── compute/     # EC2, ASG
│   └── terraform.tfstate
├── database/    # RDS
│   └── terraform.tfstate
└── monitoring/  # CloudWatch, alarms
    └── terraform.tfstate

4. State 간 참조:

State 분리 후 값 공유:

Data source:

data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "web" {
  subnet_id = data.terraform_remote_state.network.outputs.public_subnet_id
}

이점: 네트워크 team과 app team이 독립 관리.

5. Sensitive Data 관리:

State에 저장되는 민감 정보:

DB password.
API keys.
Certificates.

보호:

Backend encryption.
Access control: IAM 엄격.
Audit logging.

Vault / Secrets Manager 활용:

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/db/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Password가 state에도 저장되지만 Secrets Manager가 진실의 원천.

6. Backup:

S3 versioning: 자동 이전 버전.

수동 백업:

aws s3 cp s3://my-state/prod/terraform.tfstate ./backup/prod-$(date +%F).tfstate

복구:

aws s3 cp ./backup/prod-2025-04-10.tfstate s3://my-state/prod/terraform.tfstate

7. State Migration:

Backend 변경:

# 기존
terraform {
  backend "local" {}
}

# 새로
terraform {
  backend "s3" { ... }
}

terraform init -migrate-state

Terraform이 자동으로 state를 새 backend로 이동.

State 조작 명령어:

terraform state list: 리소스 목록.

terraform state show <resource>: 리소스 상세.

terraform state mv: 리소스 이름/위치 변경.

# Module로 이동
terraform state mv aws_instance.web module.compute.aws_instance.web

# 이름 변경
terraform state mv aws_instance.old aws_instance.new

terraform state rm: State에서 제거 (실제 리소스는 남음).

# Terraform이 이 리소스 관리 안 하도록
terraform state rm aws_instance.legacy

terraform import: 기존 리소스를 state에 추가.

terraform import aws_instance.existing i-0123456789

이런 명령어는 조심. State 손상 위험.

Best Practices 요약:

Remote backend 필수 (S3 + DynamoDB 표준).
State encryption at rest.
State locking 활성화.
Access control 엄격 (IAM).
Backup 주기적.
State 분할 (monolithic 금지).
*.tfstate* gitignore.
Sensitive 값 external secret manager.
Audit logging.
Team에게 terraform state 명령어 교육.

실전 사고 사례:

사례 1: Git에 state commit:

실수로 개발자가 terraform.tfstate를 Git에 commit.
Repository public.
DB password 노출.
해결: Secret rotation, git history 정리, education.

사례 2: State 손실:

개발자 로컬에서 작업.
노트북 분실.
Backend 없음 → state 영원히 손실.
복구: 100+ 리소스 수동 import. 수 일.

사례 3: 동시 apply:

Lock 없는 GCS backend.
두 팀이 동시 apply.
State corruption.
복구: 수동 state 수정, 업계 전문가 도움 필요.

교훈:

State는 Terraform의 가장 중요한 자산이다. Code는 Git에 있다. Cloud는 provider에 있다. 둘을 연결하는 것은 state다.

State를 제대로 관리하지 않으면:

혼란.
장애.
보안 사고.
복구 불가능한 상황.

반면 잘 관리하면:

팀 협업 매끄럽게.
신뢰할 수 있는 인프라.
쉬운 troubleshooting.
장기적 유지보수 가능.

"Terraform을 쓸지 말지 망설일 때 답은 간단하다: state 관리 방법을 이해했는가?". 이해했다면 사용하라. 이해 못 했다면 먼저 배워라.

이 글의 지식은 state 관리의 모든 기본을 다룬다. 하지만 실전에선 조직의 요구에 맞게 응용해야 한다. 작은 팀과 대기업은 다른 방법이 필요하다. 하지만 원칙은 같다:

State를 보호하라. Backup하라. Lock하라. Encrypt하라. Audit하라.

이 다섯 가지만 지키면 대부분의 문제를 피할 수 있다. 그리고 Terraform의 진정한 힘 — declarative infrastructure management — 을 안전하게 누릴 수 있다.

마치며: 선언형 인프라의 승리

핵심 정리

HCL: 선언형 DSL. 함수와 조건 지원.
DAG: 의존성 그래프. 병렬 실행 기반.
State: 진실의 원천. Remote backend 필수.
Provider: 클라우드 인터페이스. gRPC 기반.
Plan/Apply: 미리보기 + 실행.
Modules: 재사용 단위.
Workspaces vs Directories: 환경 분리.
OpenTofu: 오픈소스 fork.

실전 체크리스트

New Terraform project:

Terraform이 가르쳐준 것

Terraform은 인프라 관리의 패러다임을 바꿨다:

Before: 웹 콘솔, 수동 스크립트, 정적 문서. After: 코드, 버전 관리, 자동화, 재현 가능성.

이 변화는 단순한 도구 교체가 아니다. 문화의 변화다:

인프라 팀 → 엔지니어링 팀.
수동 변경 → PR 리뷰.
문서 → 코드.
수동 확인 → 자동 테스트.

이것이 DevOps 혁명의 기반이다.

마지막 교훈

Terraform을 제대로 쓰려면 내부를 이해해야 한다:

DAG: 왜 이 순서로?
State: 왜 이 값이?
Provider: 왜 이 에러?
Plan: 왜 이 변경?

이 질문들의 답이 이 글의 지식에 있다.

당신이 다음에 terraform apply를 칠 때, 잠시 생각해 보자:

HCL이 파싱된다.
DAG가 생성된다.
State가 refresh된다.
Plan이 계산된다.
Provider가 API를 호출한다.
병렬로 수많은 리소스가 변경된다.
State가 업데이트된다.

이 복잡한 오케스트레이션이 한 줄 명령으로 작동한다. Terraform이 모든 복잡성을 숨기고 선언적 인터페이스만 보여준다.

이것이 좋은 도구의 정의다. 복잡함을 숨기되, 필요할 때 들여다볼 수 있는 것.