Skip to content
Published on

Complete Guide to Terraform Module Design Patterns: State Management, Workspaces, and Atlantis Automation

Authors
  • Name
    Twitter
Terraform Module Design

Introduction

In the Infrastructure as Code (IaC) ecosystem, Terraform has established itself as the de facto standard tool for managing multi-cloud environments. However, as Terraform projects grow in scale, the complexity of module design, state management, and team collaboration workflows increases exponentially.

The early approach of listing hundreds of resources in a single main.tf file quickly degenerates into unmaintainable "spaghetti infrastructure." Even modularized code suffers when state files are consolidated into a single backend -- terraform plan can take over 10 minutes, and state conflicts between team members become frequent.

This guide covers three core Terraform module design patterns (Composition, Facade, Factory) with real HCL code examples, remote state management (S3+DynamoDB, GCS, Terraform Cloud), workspace strategies, and GitOps automation with Atlantis. It also includes failure cases encountered in production operations -- state lock conflicts, drift detection, and circular dependencies -- along with recovery procedures.

Terraform Module Structure and Design Principles

Module Directory Structure

A well-designed Terraform module follows a clear file structure. Based on HashiCorp official guidelines and Google Cloud Best Practices, the standard structure is:

modules/
  networking/
    main.tf          # Core resource definitions
    variables.tf     # Input variable declarations
    outputs.tf       # Output value definitions
    versions.tf      # Provider/terraform version constraints
    README.md        # Module usage documentation
    examples/
      simple/
        main.tf      # Simple usage example
      complete/
        main.tf      # Full-featured usage example
    tests/
      networking_test.go  # Terratest tests

Core Design Principles

1. Single Responsibility Principle

Each module should handle exactly one logical function. As HashiCorp states, "If a module's function or purpose is hard to explain, the module is probably too complex."

2. Loose Coupling

Minimize direct dependencies between modules. If running terraform plan reveals that a change in one module unexpectedly alters the state of several others, that is a signal of excessive coupling.

3. No Provider Configuration in Shared Modules

Shared modules must never configure provider or backend blocks directly. Provider configuration should always be done in root modules.

# Bad example - provider configured inside module
# modules/vpc/main.tf
provider "aws" {
  region = "us-east-1"  # Hardcoded region
}

resource "aws_vpc" "main" {
  cidr_block = var.cidr_block
}

# Good example - provider configured in root module
# environments/prod/main.tf
provider "aws" {
  region = "us-east-1"
}

module "vpc" {
  source     = "../../modules/vpc"
  cidr_block = "10.0.0.0/16"
}

4. Mandatory Output Values

Define at least one output for every resource created by a module. Without outputs, dependency inference between modules is impossible, and other modules cannot reference resources from your module.

Module Design Patterns

1. Composition Pattern

The Composition pattern combines small, focused modules to build complex infrastructure. It applies the software engineering principle of "Composition over Inheritance" to infrastructure code and is the most recommended pattern.

# environments/prod/main.tf - Composition Pattern
module "vpc" {
  source     = "../../modules/networking/vpc"
  cidr_block = "10.0.0.0/16"
  azs        = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

module "security_group" {
  source = "../../modules/networking/security-group"
  vpc_id = module.vpc.vpc_id

  ingress_rules = [
    {
      port        = 443
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  ]
}

module "eks" {
  source            = "../../modules/compute/eks"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  security_group_id = module.security_group.sg_id
  cluster_version   = "1.31"
}

module "rds" {
  source            = "../../modules/database/rds"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.database_subnet_ids
  security_group_id = module.security_group.sg_id
  engine            = "postgres"
  engine_version    = "16.4"
}

Each module can be independently tested, versioned, and reused. Data flows between modules through output values.

2. Facade Pattern

The Facade pattern hides complex internal implementation and provides consumers with a simple interface. Like a TV remote control, a single button (variable) controls complex internal operations (multiple resource creation).

# modules/platform/main.tf - Facade Pattern
variable "environment" {
  type = string
}

variable "app_name" {
  type = string
}

variable "instance_type" {
  type    = string
  default = "t3.medium"
}

# Internally composes multiple sub-modules
module "networking" {
  source      = "../networking/vpc"
  cidr_block  = var.environment == "prod" ? "10.0.0.0/16" : "10.1.0.0/16"
  environment = var.environment
}

module "compute" {
  source        = "../compute/eks"
  vpc_id        = module.networking.vpc_id
  subnet_ids    = module.networking.private_subnet_ids
  instance_type = var.instance_type
  cluster_name  = "cluster-name-placeholder"
}

module "monitoring" {
  source     = "../observability/cloudwatch"
  cluster_id = module.compute.cluster_id
  alarm_sns  = module.compute.alarm_topic_arn
}

# Consumer uses it simply
# environments/prod/main.tf
module "platform" {
  source        = "../../modules/platform"
  environment   = "prod"
  app_name      = "my-service"
  instance_type = "m5.xlarge"
}

3. Factory Pattern

The Factory pattern uses for_each to create identical resource structures in bulk based on data-driven configuration.

# modules/multi-region/main.tf - Factory Pattern
variable "regions" {
  type = map(object({
    cidr_block    = string
    instance_type = string
    replicas      = number
  }))
}

module "regional_stack" {
  source   = "../regional-stack"
  for_each = var.regions

  region        = each.key
  cidr_block    = each.value.cidr_block
  instance_type = each.value.instance_type
  replicas      = each.value.replicas
}

# Usage example
module "global_infra" {
  source = "../../modules/multi-region"

  regions = {
    "us-east-1" = {
      cidr_block    = "10.0.0.0/16"
      instance_type = "m5.xlarge"
      replicas      = 3
    }
    "eu-west-1" = {
      cidr_block    = "10.1.0.0/16"
      instance_type = "m5.large"
      replicas      = 2
    }
  }
}

Variable Design and Output Strategy

Variable Design Guidelines

Effective variable design determines the reusability and stability of your modules.

# modules/vpc/variables.tf
variable "cidr_block" {
  type        = string
  description = "VPC CIDR block (e.g., 10.0.0.0/16)"

  validation {
    condition     = can(cidrnetmask(var.cidr_block))
    error_message = "Must be a valid CIDR block."
  }
}

variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "enable_nat_gateway" {
  type        = bool
  default     = true
  description = "Whether to create NAT Gateways for private subnets"
}

variable "tags" {
  type        = map(string)
  default     = {}
  description = "Additional tags to apply to all resources"
}

Key Principle: Expose only values that should vary across environments (CIDR ranges, instance sizes, names, timeouts) as variables. Encapsulate internal implementation details (IAM policy structures, logging configurations, tagging schemes) inside the module.

Output Design

# modules/vpc/outputs.tf
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "The ID of the VPC"
}

output "private_subnet_ids" {
  value       = aws_subnet.private[*].id
  description = "List of private subnet IDs"
}

output "database_subnet_ids" {
  value       = aws_subnet.database[*].id
  description = "List of database subnet IDs"
}

output "nat_gateway_ips" {
  value       = aws_eip.nat[*].public_ip
  description = "Elastic IPs of NAT Gateways"
}

Remote State Management

S3 + DynamoDB Backend (AWS)

The most widely used remote state configuration in AWS environments. S3 handles state file storage while DynamoDB provides state locking. Note that AWS is transitioning from DynamoDB-based locking to S3 native locking, so check the use_lockfile = true option in newer versions.

# backend.tf - S3 + DynamoDB remote state configuration
terraform {
  backend "s3" {
    bucket         = "my-company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    # use_lockfile = true  # S3 native locking (newer versions)
  }
}

State bucket bootstrap script:

#!/bin/bash
# bootstrap-backend.sh - Create state storage infrastructure

BUCKET_NAME="my-company-terraform-state"
DYNAMODB_TABLE="terraform-state-lock"
REGION="us-east-1"

# Create S3 bucket
aws s3api create-bucket \
  --bucket "$BUCKET_NAME" \
  --region "$REGION"

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket "$BUCKET_NAME" \
  --versioning-configuration Status=Enabled

# Block public access
aws s3api put-public-access-block \
  --bucket "$BUCKET_NAME" \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

# Configure KMS encryption
aws s3api put-bucket-encryption \
  --bucket "$BUCKET_NAME" \
  --server-side-encryption-configuration '{
    "Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms"}}]
  }'

# Create DynamoDB table for state locking
aws dynamodb create-table \
  --table-name "$DYNAMODB_TABLE" \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region "$REGION"

echo "Backend infrastructure created successfully"

GCS Backend (Google Cloud)

terraform {
  backend "gcs" {
    bucket = "my-company-tf-state"
    prefix = "prod/networking"
  }
}

Terraform Cloud / HCP Terraform

terraform {
  cloud {
    organization = "my-company"

    workspaces {
      name = "prod-networking"
    }
  }
}

Remote State Data Source (Cross-Stack References)

To reference outputs from one stack in another, use the terraform_remote_state data source.

# Referencing networking stack state from compute stack
data "terraform_remote_state" "networking" {
  backend = "s3"

  config = {
    bucket = "my-company-terraform-state"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "app" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"
  subnet_id     = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}

Workspace Strategy vs Directory Separation

Workspace Approach

Terraform workspaces share the same .tf files while maintaining independent state files per environment.

# Create and switch workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
terraform workspace select prod

# Check current workspace
terraform workspace show

Referencing workspaces in HCL:

resource "aws_instance" "app" {
  instance_type = terraform.workspace == "prod" ? "m5.xlarge" : "t3.medium"

  tags = {
    Environment = terraform.workspace
  }
}

Directory Separation Approach

infrastructure/
  modules/
    vpc/
    eks/
    rds/
  environments/
    dev/
      main.tf
      terraform.tfvars
      backend.tf
    staging/
      main.tf
      terraform.tfvars
      backend.tf
    prod/
      main.tf
      terraform.tfvars
      backend.tf

Workspaces vs Directories Comparison

CriteriaWorkspacesDirectory Separation
Code DuplicationNone (shared code)Some duplication
Environment IsolationWeak (same backend)Strong (separate backends)
IAM Permission SeparationDifficultPer-environment configuration
Blast RadiusWide (shared code)Narrow (independent)
Operational ComplexityLowMedium
Best Suited ForEphemeral environments, testingProduction environments

Recommendation: Use directory separation for production environments and workspaces for short-lived test environments. Many successful teams combine both approaches.

GitOps Automation with Atlantis

What is Atlantis?

Atlantis is a GitOps tool that automates Terraform plan and apply through pull request workflows. When a developer opens an infrastructure change PR, Atlantis automatically runs terraform plan and posts the results as a PR comment. Once reviewers approve, the changes can be applied with an atlantis apply comment.

Key Benefits

  • Consistent execution environment: All Terraform operations run on a dedicated server, eliminating "works on my machine" problems
  • Automatic state locking: While a PR is open, Atlantis locks the corresponding project state file to prevent concurrent modifications
  • Code review integration: Plan results are visible directly in the PR, ensuring visibility of infrastructure changes
  • Audit logging: All changes are recorded in PR history

atlantis.yaml Configuration

# atlantis.yaml - located at repository root
version: 3
automerge: false
parallel_plan: true
parallel_apply: false

projects:
  - name: prod-networking
    dir: environments/prod/networking
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
        - '../../../modules/networking/**/*.tf'
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: prod-compute
    dir: environments/prod/compute
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
        - '../../../modules/compute/**/*.tf'
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: dev-networking
    dir: environments/dev/networking
    workspace: default
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - '*.tf'
        - '*.tfvars'
      enabled: true

Custom Atlantis Workflows

# atlantis.yaml - custom workflow
workflows:
  custom:
    plan:
      steps:
        - run: terraform fmt -check -recursive
        - run: tflint --init
        - run: tflint
        - init
        - plan
    apply:
      steps:
        - apply

Module Versioning and Registry

Semantic Versioning

Terraform modules should follow Semantic Versioning (SemVer):

  • Major version bump: Adding required input variables, removing outputs -- breaking changes
  • Minor version bump: Adding optional input variables, new outputs
  • Patch version bump: Bug fixes, documentation updates
# Specifying version constraints
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"  # Latest within 5.x range
}

module "eks" {
  source  = "git::https://github.com/my-org/terraform-aws-eks.git?ref=v3.2.1"
}

Private Module Registry

Use Terraform Cloud or a self-hosted registry to manage internal modules.

# Using Terraform Cloud Private Registry
module "vpc" {
  source  = "app.terraform.io/my-org/vpc/aws"
  version = "2.1.0"
}

Comparison Tables

State Backend Comparison

FeatureS3 + DynamoDBGCSTerraform CloudAzure Blob
State LockingDynamoDB / S3 NativeBuilt-inBuilt-inBlob Lease
EncryptionKMSGoogle KMSIncludedAzure KeyVault
VersioningS3 VersioningObject VersioningIncludedBlob Snapshots
Access ControlIAM PolicyIAMTeams/RBACAzure RBAC
CostS3 + DynamoDB billingGCS billingFree tier limitedBlob billing
Setup DifficultyMediumLowLowMedium

IaC Tool Comparison

FeatureTerraform/OpenTofuPulumiCrossplaneCloudFormation
LanguageHCLTypeScript/Python/GoYAML/CRDJSON/YAML
State ManagementExternal backend requiredSelf-managed/externalKubernetes etcdAWS managed
Multi-CloudExcellentExcellentExcellentAWS only
Learning CurveMediumLow (existing languages)HighLow (AWS users)
CommunityVery largeGrowingGrowingAWS ecosystem
Drift DetectionManual via planManual via previewAutomatic (reconciliation)Drift Detection

Failure Cases and Recovery Procedures

Case 1: State Lock Conflict

Symptom: "Error acquiring the state lock" error when running terraform plan or apply

Cause: A previous Terraform operation terminated abnormally (network disconnection, CI runner timeout, Ctrl+C forced termination) and the lock was not released

Recovery procedure:

# 1. Check lock status - verify no other users are running operations
# Find the Lock ID from the error message

# 2. After confirming no other operations are running, force release
terraform force-unlock LOCK_ID

# 3. Force release without confirmation (caution: verify no operations running)
terraform force-unlock -force LOCK_ID

Prevention measures:

  • Set appropriate timeouts in CI/CD pipelines
  • Implement concurrency controls
  • Use Atlantis for PR-based automatic locking to prevent conflicts

Case 2: State Drift

Symptom: terraform plan shows unexpected changes. Resources manually modified in the console are inconsistent with Terraform state

Recovery procedure:

# 1. Refresh state file to match actual infrastructure
terraform refresh

# 2. Or check drift with plan and selectively import
terraform plan

# 3. Either update code to match manual changes or revert
terraform apply  # Restore infrastructure to match code

Case 3: Circular Dependencies

Symptom: "Cycle" error during terraform plan

Cause: Module A references outputs from Module B, and Module B references outputs from Module A

Solutions:

  • Extract common dependencies into a separate module
  • Use depends_on for explicit dependency specification
  • Switch to indirect references using data sources

Case 4: Large State File Performance Degradation

Symptom: terraform plan takes over 10 minutes, API rate limiting occurs

Solutions:

# Target specific modules for plan/apply
terraform plan -target=module.eks
terraform apply -target=module.eks

# Split state file (move state)
terraform state mv module.monitoring module.monitoring

Root cause fix: Split state files by component to reduce individual state file size. Manage networking, compute, database, and monitoring as separate state files, using terraform_remote_state data sources for cross-references.

Operational Checklists

Module Design Checklist

  • Does each module follow the single responsibility principle
  • Are output values defined for all resources
  • Do variables include type, description, and validation
  • Are provider and backend configured only in root modules
  • Does the module include README.md and an examples directory
  • Are semantic version tags maintained

State Management Checklist

  • Is a remote backend configured (no local state files)
  • Is state locking enabled
  • Are state files encrypted
  • Is S3 bucket versioning enabled
  • Is public access blocked
  • Are state files separated by component

Atlantis / CI-CD Checklist

  • Is atlantis.yaml configured at the repository root
  • Are approved + mergeable requirements set before apply
  • Do dependent projects auto-plan when modules change
  • Are webhook secrets securely managed
  • Is credential rotation performed periodically

Conclusion

Terraform module design and state management become exponentially more important as infrastructure code scales. Composing small modules with the Composition pattern, separating remote state by component, and automating GitOps workflows with Atlantis represents the most mature operational model as of 2026.

The key principle is "start small and split when needed." Rather than attempting to design a perfect module structure from the beginning, start with a single module and refactor when duplication arises, split state files when they grow too large, and introduce Atlantis when the team expands. This incremental approach is the most practical path forward.

Infrastructure code should be subject to the same engineering disciplines as application code -- code review, testing, version control, and CI/CD. The patterns and tools presented in this guide aim to provide practical assistance on that journey.