Split View: Terraform 모듈 설계 패턴 완전 가이드: 상태 관리·워크스페이스·Atlantis 자동화

Terraform 모듈 설계 패턴 완전 가이드: 상태 관리·워크스페이스·Atlantis 자동화

들어가며
Terraform 모듈 기본 구조와 설계 원칙
- 모듈 디렉토리 구조
- 핵심 설계 원칙
모듈 설계 패턴
변수 설계와 출력값 전략
- 변수 설계 가이드라인
- 출력값 설계
원격 상태 관리
워크스페이스 전략 vs 디렉토리 분리
Atlantis를 활용한 GitOps 자동화
모듈 버전 관리와 레지스트리
- 시맨틱 버전 관리
- Private Module Registry
비교표
- 상태 백엔드 비교
- IaC 도구 비교
장애 사례와 복구 절차
운영 체크리스트
마무리

들어가며

인프라를 코드로 관리하는 IaC(Infrastructure as Code) 생태계에서 Terraform은 멀티 클라우드 환경을 지원하는 사실상의 표준 도구로 자리잡았다. 그러나 Terraform 프로젝트의 규모가 커질수록 모듈 설계, 상태 관리, 팀 협업 워크플로의 복잡성이 기하급수적으로 증가한다.

단일 main.tf 파일에 수백 개의 리소스를 나열하던 초기 방식은 유지보수 불가능한 "스파게티 인프라"로 빠르게 변질된다. 모듈화된 코드라 하더라도 상태 파일이 하나에 몰려 있으면 terraform plan에 10분 이상 걸리고, 팀원 간 상태 충돌이 빈번하게 발생한다.

이 글에서는 Terraform 모듈 설계의 3대 패턴(Composition, Facade, Factory)을 실제 HCL 코드와 함께 설명하고, 원격 상태 관리(S3+DynamoDB, GCS, Terraform Cloud), 워크스페이스 전략, 그리고 Atlantis를 활용한 GitOps 기반 자동화까지 종합적으로 다룬다. 실전 운영에서 마주하는 상태 잠금 충돌, 드리프트 감지, 순환 의존성 등 장애 사례와 복구 절차도 포함했다.

Terraform 모듈 기본 구조와 설계 원칙

모듈 디렉토리 구조

잘 설계된 Terraform 모듈은 명확한 파일 구조를 따른다. HashiCorp 공식 가이드라인과 Google Cloud Best Practices를 기반으로 한 표준 구조는 다음과 같다.

modules/
  networking/
    main.tf          # 핵심 리소스 정의
    variables.tf     # 입력 변수 선언
    outputs.tf       # 출력 값 정의
    versions.tf      # provider/terraform 버전 제약
    README.md        # 모듈 사용법 문서
    examples/
      simple/
        main.tf      # 간단한 사용 예시
      complete/
        main.tf      # 모든 옵션을 활용한 예시
    tests/
      networking_test.go  # terratest 테스트

핵심 설계 원칙

1. 단일 책임 원칙(Single Responsibility)

하나의 모듈은 하나의 논리적 기능만 담당해야 한다. "모듈의 기능이나 목적을 한 문장으로 설명하기 어렵다면, 모듈이 너무 복잡하다"는 것이 HashiCorp의 기준이다.

2. 느슨한 결합(Loose Coupling)

모듈 간 직접적인 의존성을 최소화한다. terraform plan 실행 시 하나의 모듈 변경이 다른 여러 모듈의 상태를 예기치 않게 변경한다면, 모듈 간 결합도가 지나치게 높다는 신호이다.

3. 프로바이더 설정 금지

공유 모듈에서는 provider 블록이나 backend 블록을 직접 설정하지 않는다. 프로바이더 설정은 항상 루트 모듈에서 수행한다.

# 잘못된 예시 - 모듈 내부에서 provider 설정
# modules/vpc/main.tf
provider "aws" {
  region = "ap-northeast-2"  # 하드코딩된 리전
}

resource "aws_vpc" "main" {
  cidr_block = var.cidr_block
}

# 올바른 예시 - 루트 모듈에서 provider 설정
# environments/prod/main.tf
provider "aws" {
  region = "ap-northeast-2"
}

module "vpc" {
  source     = "../../modules/vpc"
  cidr_block = "10.0.0.0/16"
}

4. 출력 값의 필수화

모듈에서 생성하는 모든 리소스에 대해 최소 하나의 출력 값을 정의한다. 출력 값이 없으면 모듈 간 의존성 추론이 불가능하고, 모듈 조합 시 다른 모듈에서 참조할 수 없다.

모듈 설계 패턴

1. Composition 패턴 (컴포지션)

작은 단위의 모듈을 조합하여 복잡한 인프라를 구성하는 패턴이다. 소프트웨어 공학의 "Composition over Inheritance" 원칙을 인프라 코드에 적용한 것으로, 가장 권장되는 패턴이다.

# environments/prod/main.tf - Composition 패턴
module "vpc" {
  source     = "../../modules/networking/vpc"
  cidr_block = "10.0.0.0/16"
  azs        = ["ap-northeast-2a", "ap-northeast-2b", "ap-northeast-2c"]
}

module "security_group" {
  source = "../../modules/networking/security-group"
  vpc_id = module.vpc.vpc_id

  ingress_rules = [
    {
      port        = 443
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  ]
}

module "eks" {
  source            = "../../modules/compute/eks"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  security_group_id = module.security_group.sg_id
  cluster_version   = "1.31"
}

module "rds" {
  source            = "../../modules/database/rds"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.database_subnet_ids
  security_group_id = module.security_group.sg_id
  engine            = "postgres"
  engine_version    = "16.4"
}

각 모듈은 독립적으로 테스트, 버전 관리, 재사용이 가능하며, 출력 값을 통해 모듈 간 데이터를 전달한다.

2. Facade 패턴 (파사드)

복잡한 내부 구현을 감추고, 소비자에게 단순한 인터페이스를 제공하는 패턴이다. TV 리모컨처럼 하나의 버튼(변수)으로 내부의 복잡한 동작(여러 리소스 생성)을 제어한다.

# modules/platform/main.tf - Facade 패턴
variable "environment" {
  type = string
}

variable "app_name" {
  type = string
}

variable "instance_type" {
  type    = string
  default = "t3.medium"
}

# 내부에서 여러 하위 모듈을 조합
module "networking" {
  source      = "../networking/vpc"
  cidr_block  = var.environment == "prod" ? "10.0.0.0/16" : "10.1.0.0/16"
  environment = var.environment
}

module "compute" {
  source        = "../compute/eks"
  vpc_id        = module.networking.vpc_id
  subnet_ids    = module.networking.private_subnet_ids
  instance_type = var.instance_type
  cluster_name  = "cluster-name-placeholder"
}

module "monitoring" {
  source     = "../observability/cloudwatch"
  cluster_id = module.compute.cluster_id
  alarm_sns  = module.compute.alarm_topic_arn
}

# 소비자는 간단하게 사용
# environments/prod/main.tf
module "platform" {
  source       = "../../modules/platform"
  environment  = "prod"
  app_name     = "my-service"
  instance_type = "m5.xlarge"
}

3. Factory 패턴 (팩토리)

for_each를 활용하여 동일한 구조의 리소스를 데이터 기반으로 대량 생성하는 패턴이다.

# modules/multi-region/main.tf - Factory 패턴
variable "regions" {
  type = map(object({
    cidr_block    = string
    instance_type = string
    replicas      = number
  }))
}

module "regional_stack" {
  source   = "../regional-stack"
  for_each = var.regions

  region        = each.key
  cidr_block    = each.value.cidr_block
  instance_type = each.value.instance_type
  replicas      = each.value.replicas
}

# 사용 예시
module "global_infra" {
  source = "../../modules/multi-region"

  regions = {
    "ap-northeast-2" = {
      cidr_block    = "10.0.0.0/16"
      instance_type = "m5.xlarge"
      replicas      = 3
    }
    "us-east-1" = {
      cidr_block    = "10.1.0.0/16"
      instance_type = "m5.large"
      replicas      = 2
    }
  }
}

변수 설계와 출력값 전략

변수 설계 가이드라인

효과적인 변수 설계는 모듈의 재사용성과 안정성을 결정한다.

# modules/vpc/variables.tf
variable "cidr_block" {
  type        = string
  description = "VPC CIDR block (e.g., 10.0.0.0/16)"

  validation {
    condition     = can(cidrnetmask(var.cidr_block))
    error_message = "Must be a valid CIDR block."
  }
}

variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "enable_nat_gateway" {
  type        = bool
  default     = true
  description = "Whether to create NAT Gateways for private subnets"
}

variable "tags" {
  type        = map(string)
  default     = {}
  description = "Additional tags to apply to all resources"
}

핵심 원칙: 환경별로 달라져야 하는 값(CIDR, 인스턴스 크기, 이름 등)만 변수로 노출하고, 내부 구현 세부사항(IAM 정책 구조, 로깅 설정, 태그 체계 등)은 모듈 내부에 캡슐화한다.

출력값 설계

# modules/vpc/outputs.tf
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "The ID of the VPC"
}

output "private_subnet_ids" {
  value       = aws_subnet.private[*].id
  description = "List of private subnet IDs"
}

output "database_subnet_ids" {
  value       = aws_subnet.database[*].id
  description = "List of database subnet IDs"
}

output "nat_gateway_ips" {
  value       = aws_eip.nat[*].public_ip
  description = "Elastic IPs of NAT Gateways"
}

원격 상태 관리

S3 + DynamoDB 백엔드 (AWS)

AWS 환경에서 가장 널리 사용되는 원격 상태 관리 구성이다. S3는 상태 파일 저장, DynamoDB는 상태 잠금을 담당한다. 다만 AWS는 DynamoDB 기반 잠금을 점차 S3 네이티브 잠금으로 전환하고 있으므로, 최신 버전에서는 use_lockfile = true 옵션을 확인해야 한다.

# backend.tf - S3 + DynamoDB 원격 상태 설정
terraform {
  backend "s3" {
    bucket         = "my-company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "ap-northeast-2"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    # use_lockfile = true  # S3 네이티브 잠금 (최신 버전)
  }
}

상태 버킷 부트스트래핑 스크립트:

#!/bin/bash
# bootstrap-backend.sh - 상태 저장 인프라 생성

BUCKET_NAME="my-company-terraform-state"
DYNAMODB_TABLE="terraform-state-lock"
REGION="ap-northeast-2"

# S3 버킷 생성
aws s3api create-bucket \
  --bucket "$BUCKET_NAME" \
  --region "$REGION" \
  --create-bucket-configuration LocationConstraint="$REGION"

# 버전 관리 활성화
aws s3api put-bucket-versioning \
  --bucket "$BUCKET_NAME" \
  --versioning-configuration Status=Enabled

# 퍼블릭 액세스 차단
aws s3api put-public-access-block \
  --bucket "$BUCKET_NAME" \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

# KMS 암호화 설정
aws s3api put-bucket-encryption \
  --bucket "$BUCKET_NAME" \
  --server-side-encryption-configuration '{
    "Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms"}}]
  }'

# DynamoDB 테이블 생성 (상태 잠금용)
aws dynamodb create-table \
  --table-name "$DYNAMODB_TABLE" \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region "$REGION"

echo "Backend infrastructure created successfully"

GCS 백엔드 (Google Cloud)

terraform {
  backend "gcs" {
    bucket = "my-company-tf-state"
    prefix = "prod/networking"
  }
}

Terraform Cloud / HCP Terraform

terraform {
  cloud {
    organization = "my-company"

    workspaces {
      name = "prod-networking"
    }
  }
}

원격 상태 데이터 소스 (Cross-Stack 참조)

한 스택의 출력 값을 다른 스택에서 참조하려면 terraform_remote_state 데이터 소스를 사용한다.

# compute 스택에서 networking 스택의 상태 참조
data "terraform_remote_state" "networking" {
  backend = "s3"

  config = {
    bucket = "my-company-terraform-state"
    key    = "prod/networking/terraform.tfstate"
    region = "ap-northeast-2"
  }
}

resource "aws_instance" "app" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"
  subnet_id     = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}

워크스페이스 전략 vs 디렉토리 분리

워크스페이스 방식

Terraform 워크스페이스는 동일한 .tf 파일을 공유하면서 환경별로 독립된 상태 파일을 관리한다.

# 워크스페이스 생성 및 전환
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
terraform workspace select prod

# 현재 워크스페이스 확인
terraform workspace show

HCL에서 워크스페이스 참조:

resource "aws_instance" "app" {
  instance_type = terraform.workspace == "prod" ? "m5.xlarge" : "t3.medium"

  tags = {
    Environment = terraform.workspace
  }
}

디렉토리 분리 방식

infrastructure/
  modules/
    vpc/
    eks/
    rds/
  environments/
    dev/
      main.tf
      terraform.tfvars
      backend.tf
    staging/
      main.tf
      terraform.tfvars
      backend.tf
    prod/
      main.tf
      terraform.tfvars
      backend.tf

워크스페이스 vs 디렉토리 비교

기준	워크스페이스	디렉토리 분리
코드 중복	없음 (코드 공유)	일부 중복 발생
환경별 격리	약함 (같은 백엔드)	강함 (별도 백엔드 가능)
IAM 권한 분리	어려움	환경별 별도 설정 가능
폭발 반경	넓음 (코드 공유)	좁음 (독립적)
운영 복잡도	낮음	중간
적합한 사용처	임시 환경, 테스트	프로덕션 환경

권장 사항: 프로덕션 환경에는 디렉토리 분리를, 단기 테스트 환경에는 워크스페이스를 사용한다. 많은 성공적인 팀들이 두 방식을 조합하여 사용한다.

Atlantis를 활용한 GitOps 자동화

Atlantis란?

Atlantis는 Pull Request 기반으로 Terraform plan과 apply를 자동화하는 GitOps 도구이다. 개발자가 인프라 변경 PR을 올리면, Atlantis가 자동으로 terraform plan을 실행하고 그 결과를 PR 코멘트에 표시한다. 리뷰어가 승인하면 atlantis apply 코멘트로 적용할 수 있다.

핵심 장점

일관된 실행 환경: 모든 Terraform 실행이 전용 서버에서 이뤄져 "내 PC에서는 되는데" 문제를 방지
자동 상태 잠금: PR이 열려 있는 동안 해당 프로젝트의 상태 파일을 잠가 동시 수정 방지
코드 리뷰와 통합: plan 결과를 PR에서 바로 확인하여 인프라 변경의 가시성 확보
감사 로그: 모든 변경이 PR 히스토리에 기록

atlantis.yaml 설정

# atlantis.yaml - 저장소 루트에 위치
version: 3
automerge: false
parallel_plan: true
parallel_apply: false

projects:
  - name: prod-networking
    dir: environments/prod/networking
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
        - '../../../modules/networking/**/*.tf'
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: prod-compute
    dir: environments/prod/compute
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
        - '../../../modules/compute/**/*.tf'
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: dev-networking
    dir: environments/dev/networking
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
      enabled: true

Atlantis 워크플로 커스터마이징

# atlantis.yaml - 커스텀 워크플로
workflows:
  custom:
    plan:
      steps:
        - run: terraform fmt -check -recursive
        - run: tflint --init
        - run: tflint
        - init
        - plan
    apply:
      steps:
        - apply

모듈 버전 관리와 레지스트리

시맨틱 버전 관리

Terraform 모듈은 시맨틱 버전(SemVer)을 따르는 것이 권장된다.

Major 버전 증가: 필수 입력 변수 추가, 출력 값 제거 등 호환성이 깨지는 변경
Minor 버전 증가: 선택적 입력 변수 추가, 새로운 출력 값 추가
Patch 버전 증가: 버그 수정, 문서 업데이트

# 버전 제약 지정
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"  # 5.x 범위 내에서 최신 버전
}

module "eks" {
  source  = "git::https://github.com/my-org/terraform-aws-eks.git?ref=v3.2.1"
}

Private Module Registry

Terraform Cloud나 자체 Registry를 사용하여 조직 내부 모듈을 관리할 수 있다.

# Terraform Cloud Private Registry 사용
module "vpc" {
  source  = "app.terraform.io/my-org/vpc/aws"
  version = "2.1.0"
}

비교표

상태 백엔드 비교

기능	S3 + DynamoDB	GCS	Terraform Cloud	Azure Blob
상태 잠금	DynamoDB / S3 네이티브	기본 지원	기본 지원	Blob Lease
암호화	KMS	Google KMS	기본 제공	Azure KeyVault
버전 관리	S3 Versioning	Object Versioning	기본 제공	Blob Snapshots
접근 제어	IAM Policy	IAM	Teams/RBAC	Azure RBAC
비용	S3 + DynamoDB 과금	GCS 과금	무료 티어 제한	Blob 과금
설정 난이도	중간	낮음	낮음	중간

IaC 도구 비교

특성	Terraform/OpenTofu	Pulumi	Crossplane	CloudFormation
언어	HCL	TypeScript/Python/Go	YAML/CRD	JSON/YAML
상태 관리	외부 백엔드 필요	자체/외부	Kubernetes etcd	AWS 관리형
멀티 클라우드	우수	우수	우수	AWS 전용
학습 곡선	중간	낮음 (기존 언어)	높음	낮음 (AWS 사용자)
커뮤니티	매우 큼	성장 중	성장 중	AWS 생태계
드리프트 감지	plan으로 수동	preview로 수동	자동 (reconciliation)	Drift Detection

장애 사례와 복구 절차

사례 1: 상태 잠금 충돌

증상: terraform plan 또는 apply 실행 시 "Error acquiring the state lock" 오류 발생

원인: 이전 Terraform 작업이 비정상 종료(네트워크 끊김, CI 러너 타임아웃, Ctrl+C 강제 중단)되어 잠금이 해제되지 않은 상태

복구 절차:

# 1. 잠금 상태 확인 - 다른 사용자가 실행 중인지 먼저 확인
# 오류 메시지에서 Lock ID를 확인

# 2. 실제로 다른 작업이 실행 중이 아님을 확인한 후 강제 해제
terraform force-unlock LOCK_ID

# 3. -force 옵션으로 확인 없이 즉시 해제 (주의: 다른 작업이 없음을 반드시 확인)
terraform force-unlock -force LOCK_ID

예방 조치:

CI/CD 파이프라인에 적절한 타임아웃 설정
동시 실행 제어(concurrency control) 적용
Atlantis 사용 시 PR 기반 자동 잠금으로 충돌 방지

사례 2: 상태 드리프트 (Drift)

증상: terraform plan에서 예상치 못한 변경 사항이 표시됨. 콘솔에서 수동으로 변경한 리소스가 Terraform 상태와 불일치

복구 절차:

# 1. 현재 실제 인프라 상태로 상태 파일 갱신
terraform refresh

# 2. 또는 plan으로 드리프트 확인 후 선택적으로 import
terraform plan

# 3. 수동 변경 사항을 코드에 반영하거나 되돌리기
terraform apply  # 코드 기준으로 인프라 복원

사례 3: 순환 의존성

증상: terraform plan 시 "Cycle" 오류 발생

원인: 모듈 A가 모듈 B의 출력을 참조하고, 모듈 B가 다시 모듈 A의 출력을 참조하는 경우

해결 방법:

공통 의존성을 별도 모듈로 분리
depends_on을 사용하여 명시적 의존성 지정
데이터 소스를 사용하여 간접 참조로 전환

사례 4: 대규모 상태 파일 성능 저하

증상: terraform plan이 10분 이상 소요, API rate limiting 발생

해결 방법:

# 특정 모듈만 대상으로 plan/apply 실행
terraform plan -target=module.eks
terraform apply -target=module.eks

# 상태 파일 분리 (state 이동)
terraform state mv module.monitoring module.monitoring

근본적 해결: 상태 파일을 컴포넌트별로 분리하여 각 상태 파일의 크기를 줄인다. 네트워킹, 컴퓨트, 데이터베이스, 모니터링을 별도 상태 파일로 관리하고, terraform_remote_state 데이터 소스로 상호 참조한다.

운영 체크리스트

모듈 설계 체크리스트

모듈이 단일 책임 원칙을 따르는가
모든 리소스에 대한 출력 값이 정의되어 있는가
변수에 type, description, validation이 포함되어 있는가
provider와 backend가 루트 모듈에만 설정되어 있는가
README.md와 examples 디렉토리가 포함되어 있는가
시맨틱 버전으로 태그가 관리되고 있는가

상태 관리 체크리스트

원격 백엔드가 설정되어 있는가 (로컬 상태 파일 사용 금지)
상태 잠금이 활성화되어 있는가
상태 파일이 암호화되어 있는가
S3 버킷에 버전 관리가 활성화되어 있는가
퍼블릭 액세스가 차단되어 있는가
컴포넌트별로 상태 파일이 분리되어 있는가

Atlantis / CI-CD 체크리스트

atlantis.yaml이 저장소 루트에 설정되어 있는가
apply 전 approved + mergeable 요구사항이 설정되어 있는가
모듈 변경 시 의존하는 프로젝트가 자동으로 plan되는가
Webhook 시크릿이 안전하게 관리되고 있는가
자격 증명(credentials) 로테이션이 주기적으로 수행되는가

마무리

Terraform 모듈 설계와 상태 관리는 인프라 코드의 규모가 커질수록 그 중요성이 기하급수적으로 증가한다. Composition 패턴으로 작은 모듈을 조합하고, 원격 상태를 컴포넌트별로 분리하며, Atlantis로 GitOps 워크플로를 자동화하는 것이 2026년 현재 가장 성숙한 운영 모델이다.

핵심은 **"작게 시작하고, 필요할 때 분리하라"**는 원칙이다. 처음부터 완벽한 모듈 구조를 설계하려 하기보다, 단일 모듈에서 시작하여 중복이 발생할 때 리팩토링하고, 상태 파일이 커지면 분리하며, 팀원이 늘어나면 Atlantis를 도입하는 점진적 접근이 가장 현실적이다.

인프라 코드도 애플리케이션 코드와 동일한 엔지니어링 규율(코드 리뷰, 테스트, 버전 관리, CI/CD)을 적용해야 하며, 이 글에서 소개한 패턴과 도구들이 그 여정에 실질적인 도움이 되길 바란다.

Complete Guide to Terraform Module Design Patterns: State Management, Workspaces, and Atlantis Automation

Introduction
Terraform Module Structure and Design Principles
- Module Directory Structure
- Core Design Principles
Module Design Patterns
Variable Design and Output Strategy
- Variable Design Guidelines
- Output Design
Remote State Management
Workspace Strategy vs Directory Separation
GitOps Automation with Atlantis
Module Versioning and Registry
- Semantic Versioning
- Private Module Registry
Comparison Tables
- State Backend Comparison
- IaC Tool Comparison
Failure Cases and Recovery Procedures
Operational Checklists
Conclusion

Introduction

In the Infrastructure as Code (IaC) ecosystem, Terraform has established itself as the de facto standard tool for managing multi-cloud environments. However, as Terraform projects grow in scale, the complexity of module design, state management, and team collaboration workflows increases exponentially.

The early approach of listing hundreds of resources in a single main.tf file quickly degenerates into unmaintainable "spaghetti infrastructure." Even modularized code suffers when state files are consolidated into a single backend -- terraform plan can take over 10 minutes, and state conflicts between team members become frequent.

This guide covers three core Terraform module design patterns (Composition, Facade, Factory) with real HCL code examples, remote state management (S3+DynamoDB, GCS, Terraform Cloud), workspace strategies, and GitOps automation with Atlantis. It also includes failure cases encountered in production operations -- state lock conflicts, drift detection, and circular dependencies -- along with recovery procedures.

Terraform Module Structure and Design Principles

Module Directory Structure

A well-designed Terraform module follows a clear file structure. Based on HashiCorp official guidelines and Google Cloud Best Practices, the standard structure is:

modules/
  networking/
    main.tf          # Core resource definitions
    variables.tf     # Input variable declarations
    outputs.tf       # Output value definitions
    versions.tf      # Provider/terraform version constraints
    README.md        # Module usage documentation
    examples/
      simple/
        main.tf      # Simple usage example
      complete/
        main.tf      # Full-featured usage example
    tests/
      networking_test.go  # Terratest tests

Core Design Principles

1. Single Responsibility Principle

Each module should handle exactly one logical function. As HashiCorp states, "If a module's function or purpose is hard to explain, the module is probably too complex."

2. Loose Coupling

Minimize direct dependencies between modules. If running terraform plan reveals that a change in one module unexpectedly alters the state of several others, that is a signal of excessive coupling.

3. No Provider Configuration in Shared Modules

Shared modules must never configure provider or backend blocks directly. Provider configuration should always be done in root modules.

# Bad example - provider configured inside module
# modules/vpc/main.tf
provider "aws" {
  region = "us-east-1"  # Hardcoded region
}

resource "aws_vpc" "main" {
  cidr_block = var.cidr_block
}

# Good example - provider configured in root module
# environments/prod/main.tf
provider "aws" {
  region = "us-east-1"
}

module "vpc" {
  source     = "../../modules/vpc"
  cidr_block = "10.0.0.0/16"
}

4. Mandatory Output Values

Define at least one output for every resource created by a module. Without outputs, dependency inference between modules is impossible, and other modules cannot reference resources from your module.

Module Design Patterns

1. Composition Pattern

The Composition pattern combines small, focused modules to build complex infrastructure. It applies the software engineering principle of "Composition over Inheritance" to infrastructure code and is the most recommended pattern.

# environments/prod/main.tf - Composition Pattern
module "vpc" {
  source     = "../../modules/networking/vpc"
  cidr_block = "10.0.0.0/16"
  azs        = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

module "security_group" {
  source = "../../modules/networking/security-group"
  vpc_id = module.vpc.vpc_id

  ingress_rules = [
    {
      port        = 443
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  ]
}

module "eks" {
  source            = "../../modules/compute/eks"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  security_group_id = module.security_group.sg_id
  cluster_version   = "1.31"
}

module "rds" {
  source            = "../../modules/database/rds"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.database_subnet_ids
  security_group_id = module.security_group.sg_id
  engine            = "postgres"
  engine_version    = "16.4"
}

Each module can be independently tested, versioned, and reused. Data flows between modules through output values.

2. Facade Pattern

The Facade pattern hides complex internal implementation and provides consumers with a simple interface. Like a TV remote control, a single button (variable) controls complex internal operations (multiple resource creation).

# modules/platform/main.tf - Facade Pattern
variable "environment" {
  type = string
}

variable "app_name" {
  type = string
}

variable "instance_type" {
  type    = string
  default = "t3.medium"
}

# Internally composes multiple sub-modules
module "networking" {
  source      = "../networking/vpc"
  cidr_block  = var.environment == "prod" ? "10.0.0.0/16" : "10.1.0.0/16"
  environment = var.environment
}

module "compute" {
  source        = "../compute/eks"
  vpc_id        = module.networking.vpc_id
  subnet_ids    = module.networking.private_subnet_ids
  instance_type = var.instance_type
  cluster_name  = "cluster-name-placeholder"
}

module "monitoring" {
  source     = "../observability/cloudwatch"
  cluster_id = module.compute.cluster_id
  alarm_sns  = module.compute.alarm_topic_arn
}

# Consumer uses it simply
# environments/prod/main.tf
module "platform" {
  source        = "../../modules/platform"
  environment   = "prod"
  app_name      = "my-service"
  instance_type = "m5.xlarge"
}

3. Factory Pattern

The Factory pattern uses for_each to create identical resource structures in bulk based on data-driven configuration.

# modules/multi-region/main.tf - Factory Pattern
variable "regions" {
  type = map(object({
    cidr_block    = string
    instance_type = string
    replicas      = number
  }))
}

module "regional_stack" {
  source   = "../regional-stack"
  for_each = var.regions

  region        = each.key
  cidr_block    = each.value.cidr_block
  instance_type = each.value.instance_type
  replicas      = each.value.replicas
}

# Usage example
module "global_infra" {
  source = "../../modules/multi-region"

  regions = {
    "us-east-1" = {
      cidr_block    = "10.0.0.0/16"
      instance_type = "m5.xlarge"
      replicas      = 3
    }
    "eu-west-1" = {
      cidr_block    = "10.1.0.0/16"
      instance_type = "m5.large"
      replicas      = 2
    }
  }
}

Variable Design and Output Strategy

Variable Design Guidelines

Effective variable design determines the reusability and stability of your modules.

# modules/vpc/variables.tf
variable "cidr_block" {
  type        = string
  description = "VPC CIDR block (e.g., 10.0.0.0/16)"

  validation {
    condition     = can(cidrnetmask(var.cidr_block))
    error_message = "Must be a valid CIDR block."
  }
}

variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "enable_nat_gateway" {
  type        = bool
  default     = true
  description = "Whether to create NAT Gateways for private subnets"
}

variable "tags" {
  type        = map(string)
  default     = {}
  description = "Additional tags to apply to all resources"
}

Key Principle: Expose only values that should vary across environments (CIDR ranges, instance sizes, names, timeouts) as variables. Encapsulate internal implementation details (IAM policy structures, logging configurations, tagging schemes) inside the module.

Output Design

# modules/vpc/outputs.tf
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "The ID of the VPC"
}

output "private_subnet_ids" {
  value       = aws_subnet.private[*].id
  description = "List of private subnet IDs"
}

output "database_subnet_ids" {
  value       = aws_subnet.database[*].id
  description = "List of database subnet IDs"
}

output "nat_gateway_ips" {
  value       = aws_eip.nat[*].public_ip
  description = "Elastic IPs of NAT Gateways"
}

Remote State Management

S3 + DynamoDB Backend (AWS)

The most widely used remote state configuration in AWS environments. S3 handles state file storage while DynamoDB provides state locking. Note that AWS is transitioning from DynamoDB-based locking to S3 native locking, so check the use_lockfile = true option in newer versions.

# backend.tf - S3 + DynamoDB remote state configuration
terraform {
  backend "s3" {
    bucket         = "my-company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    # use_lockfile = true  # S3 native locking (newer versions)
  }
}

State bucket bootstrap script:

#!/bin/bash
# bootstrap-backend.sh - Create state storage infrastructure

BUCKET_NAME="my-company-terraform-state"
DYNAMODB_TABLE="terraform-state-lock"
REGION="us-east-1"

# Create S3 bucket
aws s3api create-bucket \
  --bucket "$BUCKET_NAME" \
  --region "$REGION"

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket "$BUCKET_NAME" \
  --versioning-configuration Status=Enabled

# Block public access
aws s3api put-public-access-block \
  --bucket "$BUCKET_NAME" \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

# Configure KMS encryption
aws s3api put-bucket-encryption \
  --bucket "$BUCKET_NAME" \
  --server-side-encryption-configuration '{
    "Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms"}}]
  }'

# Create DynamoDB table for state locking
aws dynamodb create-table \
  --table-name "$DYNAMODB_TABLE" \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region "$REGION"

echo "Backend infrastructure created successfully"

GCS Backend (Google Cloud)

terraform {
  backend "gcs" {
    bucket = "my-company-tf-state"
    prefix = "prod/networking"
  }
}

Terraform Cloud / HCP Terraform

terraform {
  cloud {
    organization = "my-company"

    workspaces {
      name = "prod-networking"
    }
  }
}

Remote State Data Source (Cross-Stack References)

To reference outputs from one stack in another, use the terraform_remote_state data source.

# Referencing networking stack state from compute stack
data "terraform_remote_state" "networking" {
  backend = "s3"

  config = {
    bucket = "my-company-terraform-state"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "app" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"
  subnet_id     = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}

Workspace Strategy vs Directory Separation

Workspace Approach

Terraform workspaces share the same .tf files while maintaining independent state files per environment.

# Create and switch workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
terraform workspace select prod

# Check current workspace
terraform workspace show

Referencing workspaces in HCL:

resource "aws_instance" "app" {
  instance_type = terraform.workspace == "prod" ? "m5.xlarge" : "t3.medium"

  tags = {
    Environment = terraform.workspace
  }
}

Directory Separation Approach

infrastructure/
  modules/
    vpc/
    eks/
    rds/
  environments/
    dev/
      main.tf
      terraform.tfvars
      backend.tf
    staging/
      main.tf
      terraform.tfvars
      backend.tf
    prod/
      main.tf
      terraform.tfvars
      backend.tf

Workspaces vs Directories Comparison

Criteria	Workspaces	Directory Separation
Code Duplication	None (shared code)	Some duplication
Environment Isolation	Weak (same backend)	Strong (separate backends)
IAM Permission Separation	Difficult	Per-environment configuration
Blast Radius	Wide (shared code)	Narrow (independent)
Operational Complexity	Low	Medium
Best Suited For	Ephemeral environments, testing	Production environments

Recommendation: Use directory separation for production environments and workspaces for short-lived test environments. Many successful teams combine both approaches.

GitOps Automation with Atlantis

What is Atlantis?

Atlantis is a GitOps tool that automates Terraform plan and apply through pull request workflows. When a developer opens an infrastructure change PR, Atlantis automatically runs terraform plan and posts the results as a PR comment. Once reviewers approve, the changes can be applied with an atlantis apply comment.

Key Benefits

Consistent execution environment: All Terraform operations run on a dedicated server, eliminating "works on my machine" problems
Automatic state locking: While a PR is open, Atlantis locks the corresponding project state file to prevent concurrent modifications
Code review integration: Plan results are visible directly in the PR, ensuring visibility of infrastructure changes
Audit logging: All changes are recorded in PR history

atlantis.yaml Configuration

# atlantis.yaml - located at repository root
version: 3
automerge: false
parallel_plan: true
parallel_apply: false

projects:
  - name: prod-networking
    dir: environments/prod/networking
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
        - '../../../modules/networking/**/*.tf'
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: prod-compute
    dir: environments/prod/compute
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
        - '../../../modules/compute/**/*.tf'
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: dev-networking
    dir: environments/dev/networking
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
      enabled: true

Custom Atlantis Workflows

# atlantis.yaml - custom workflow
workflows:
  custom:
    plan:
      steps:
        - run: terraform fmt -check -recursive
        - run: tflint --init
        - run: tflint
        - init
        - plan
    apply:
      steps:
        - apply

Module Versioning and Registry

Semantic Versioning

Terraform modules should follow Semantic Versioning (SemVer):

Major version bump: Adding required input variables, removing outputs -- breaking changes
Minor version bump: Adding optional input variables, new outputs
Patch version bump: Bug fixes, documentation updates

# Specifying version constraints
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"  # Latest within 5.x range
}

module "eks" {
  source  = "git::https://github.com/my-org/terraform-aws-eks.git?ref=v3.2.1"
}

Private Module Registry

Use Terraform Cloud or a self-hosted registry to manage internal modules.

# Using Terraform Cloud Private Registry
module "vpc" {
  source  = "app.terraform.io/my-org/vpc/aws"
  version = "2.1.0"
}

Comparison Tables

State Backend Comparison

Feature	S3 + DynamoDB	GCS	Terraform Cloud	Azure Blob
State Locking	DynamoDB / S3 Native	Built-in	Built-in	Blob Lease
Encryption	KMS	Google KMS	Included	Azure KeyVault
Versioning	S3 Versioning	Object Versioning	Included	Blob Snapshots
Access Control	IAM Policy	IAM	Teams/RBAC	Azure RBAC
Cost	S3 + DynamoDB billing	GCS billing	Free tier limited	Blob billing
Setup Difficulty	Medium	Low	Low	Medium

IaC Tool Comparison

Feature	Terraform/OpenTofu	Pulumi	Crossplane	CloudFormation
Language	HCL	TypeScript/Python/Go	YAML/CRD	JSON/YAML
State Management	External backend required	Self-managed/external	Kubernetes etcd	AWS managed
Multi-Cloud	Excellent	Excellent	Excellent	AWS only
Learning Curve	Medium	Low (existing languages)	High	Low (AWS users)
Community	Very large	Growing	Growing	AWS ecosystem
Drift Detection	Manual via plan	Manual via preview	Automatic (reconciliation)	Drift Detection

Failure Cases and Recovery Procedures

Case 1: State Lock Conflict

Symptom: "Error acquiring the state lock" error when running terraform plan or apply

Cause: A previous Terraform operation terminated abnormally (network disconnection, CI runner timeout, Ctrl+C forced termination) and the lock was not released

Recovery procedure:

# 1. Check lock status - verify no other users are running operations
# Find the Lock ID from the error message

# 2. After confirming no other operations are running, force release
terraform force-unlock LOCK_ID

# 3. Force release without confirmation (caution: verify no operations running)
terraform force-unlock -force LOCK_ID

Prevention measures:

Set appropriate timeouts in CI/CD pipelines
Implement concurrency controls
Use Atlantis for PR-based automatic locking to prevent conflicts

Case 2: State Drift

Symptom: terraform plan shows unexpected changes. Resources manually modified in the console are inconsistent with Terraform state

Recovery procedure:

# 1. Refresh state file to match actual infrastructure
terraform refresh

# 2. Or check drift with plan and selectively import
terraform plan

# 3. Either update code to match manual changes or revert
terraform apply  # Restore infrastructure to match code

Case 3: Circular Dependencies

Symptom: "Cycle" error during terraform plan

Cause: Module A references outputs from Module B, and Module B references outputs from Module A

Solutions:

Extract common dependencies into a separate module
Use depends_on for explicit dependency specification
Switch to indirect references using data sources

Case 4: Large State File Performance Degradation

Symptom: terraform plan takes over 10 minutes, API rate limiting occurs

Solutions:

# Target specific modules for plan/apply
terraform plan -target=module.eks
terraform apply -target=module.eks

# Split state file (move state)
terraform state mv module.monitoring module.monitoring

Root cause fix: Split state files by component to reduce individual state file size. Manage networking, compute, database, and monitoring as separate state files, using terraform_remote_state data sources for cross-references.

Operational Checklists

Module Design Checklist

Does each module follow the single responsibility principle
Are output values defined for all resources
Do variables include type, description, and validation
Are provider and backend configured only in root modules
Does the module include README.md and an examples directory
Are semantic version tags maintained

State Management Checklist

Is a remote backend configured (no local state files)
Is state locking enabled
Are state files encrypted
Is S3 bucket versioning enabled
Is public access blocked
Are state files separated by component

Atlantis / CI-CD Checklist

Is atlantis.yaml configured at the repository root
Are approved + mergeable requirements set before apply
Do dependent projects auto-plan when modules change
Are webhook secrets securely managed
Is credential rotation performed periodically

Conclusion

Terraform module design and state management become exponentially more important as infrastructure code scales. Composing small modules with the Composition pattern, separating remote state by component, and automating GitOps workflows with Atlantis represents the most mature operational model as of 2026.

The key principle is "start small and split when needed." Rather than attempting to design a perfect module structure from the beginning, start with a single module and refactor when duplication arises, split state files when they grow too large, and introduce Atlantis when the team expands. This incremental approach is the most practical path forward.

Infrastructure code should be subject to the same engineering disciplines as application code -- code review, testing, version control, and CI/CD. The patterns and tools presented in this guide aim to provide practical assistance on that journey.