- Published on
Complete Guide to Terraform Module Design Patterns: State Management, Workspaces, and Atlantis Automation
- Authors
- Name
- Introduction
- Terraform Module Structure and Design Principles
- Module Design Patterns
- Variable Design and Output Strategy
- Remote State Management
- Workspace Strategy vs Directory Separation
- GitOps Automation with Atlantis
- Module Versioning and Registry
- Comparison Tables
- Failure Cases and Recovery Procedures
- Operational Checklists
- Conclusion

Introduction
In the Infrastructure as Code (IaC) ecosystem, Terraform has established itself as the de facto standard tool for managing multi-cloud environments. However, as Terraform projects grow in scale, the complexity of module design, state management, and team collaboration workflows increases exponentially.
The early approach of listing hundreds of resources in a single main.tf file quickly degenerates into unmaintainable "spaghetti infrastructure." Even modularized code suffers when state files are consolidated into a single backend -- terraform plan can take over 10 minutes, and state conflicts between team members become frequent.
This guide covers three core Terraform module design patterns (Composition, Facade, Factory) with real HCL code examples, remote state management (S3+DynamoDB, GCS, Terraform Cloud), workspace strategies, and GitOps automation with Atlantis. It also includes failure cases encountered in production operations -- state lock conflicts, drift detection, and circular dependencies -- along with recovery procedures.
Terraform Module Structure and Design Principles
Module Directory Structure
A well-designed Terraform module follows a clear file structure. Based on HashiCorp official guidelines and Google Cloud Best Practices, the standard structure is:
modules/
networking/
main.tf # Core resource definitions
variables.tf # Input variable declarations
outputs.tf # Output value definitions
versions.tf # Provider/terraform version constraints
README.md # Module usage documentation
examples/
simple/
main.tf # Simple usage example
complete/
main.tf # Full-featured usage example
tests/
networking_test.go # Terratest tests
Core Design Principles
1. Single Responsibility Principle
Each module should handle exactly one logical function. As HashiCorp states, "If a module's function or purpose is hard to explain, the module is probably too complex."
2. Loose Coupling
Minimize direct dependencies between modules. If running terraform plan reveals that a change in one module unexpectedly alters the state of several others, that is a signal of excessive coupling.
3. No Provider Configuration in Shared Modules
Shared modules must never configure provider or backend blocks directly. Provider configuration should always be done in root modules.
# Bad example - provider configured inside module
# modules/vpc/main.tf
provider "aws" {
region = "us-east-1" # Hardcoded region
}
resource "aws_vpc" "main" {
cidr_block = var.cidr_block
}
# Good example - provider configured in root module
# environments/prod/main.tf
provider "aws" {
region = "us-east-1"
}
module "vpc" {
source = "../../modules/vpc"
cidr_block = "10.0.0.0/16"
}
4. Mandatory Output Values
Define at least one output for every resource created by a module. Without outputs, dependency inference between modules is impossible, and other modules cannot reference resources from your module.
Module Design Patterns
1. Composition Pattern
The Composition pattern combines small, focused modules to build complex infrastructure. It applies the software engineering principle of "Composition over Inheritance" to infrastructure code and is the most recommended pattern.
# environments/prod/main.tf - Composition Pattern
module "vpc" {
source = "../../modules/networking/vpc"
cidr_block = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
module "security_group" {
source = "../../modules/networking/security-group"
vpc_id = module.vpc.vpc_id
ingress_rules = [
{
port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
]
}
module "eks" {
source = "../../modules/compute/eks"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
security_group_id = module.security_group.sg_id
cluster_version = "1.31"
}
module "rds" {
source = "../../modules/database/rds"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.database_subnet_ids
security_group_id = module.security_group.sg_id
engine = "postgres"
engine_version = "16.4"
}
Each module can be independently tested, versioned, and reused. Data flows between modules through output values.
2. Facade Pattern
The Facade pattern hides complex internal implementation and provides consumers with a simple interface. Like a TV remote control, a single button (variable) controls complex internal operations (multiple resource creation).
# modules/platform/main.tf - Facade Pattern
variable "environment" {
type = string
}
variable "app_name" {
type = string
}
variable "instance_type" {
type = string
default = "t3.medium"
}
# Internally composes multiple sub-modules
module "networking" {
source = "../networking/vpc"
cidr_block = var.environment == "prod" ? "10.0.0.0/16" : "10.1.0.0/16"
environment = var.environment
}
module "compute" {
source = "../compute/eks"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
instance_type = var.instance_type
cluster_name = "cluster-name-placeholder"
}
module "monitoring" {
source = "../observability/cloudwatch"
cluster_id = module.compute.cluster_id
alarm_sns = module.compute.alarm_topic_arn
}
# Consumer uses it simply
# environments/prod/main.tf
module "platform" {
source = "../../modules/platform"
environment = "prod"
app_name = "my-service"
instance_type = "m5.xlarge"
}
3. Factory Pattern
The Factory pattern uses for_each to create identical resource structures in bulk based on data-driven configuration.
# modules/multi-region/main.tf - Factory Pattern
variable "regions" {
type = map(object({
cidr_block = string
instance_type = string
replicas = number
}))
}
module "regional_stack" {
source = "../regional-stack"
for_each = var.regions
region = each.key
cidr_block = each.value.cidr_block
instance_type = each.value.instance_type
replicas = each.value.replicas
}
# Usage example
module "global_infra" {
source = "../../modules/multi-region"
regions = {
"us-east-1" = {
cidr_block = "10.0.0.0/16"
instance_type = "m5.xlarge"
replicas = 3
}
"eu-west-1" = {
cidr_block = "10.1.0.0/16"
instance_type = "m5.large"
replicas = 2
}
}
}
Variable Design and Output Strategy
Variable Design Guidelines
Effective variable design determines the reusability and stability of your modules.
# modules/vpc/variables.tf
variable "cidr_block" {
type = string
description = "VPC CIDR block (e.g., 10.0.0.0/16)"
validation {
condition = can(cidrnetmask(var.cidr_block))
error_message = "Must be a valid CIDR block."
}
}
variable "environment" {
type = string
description = "Environment name (dev, staging, prod)"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "enable_nat_gateway" {
type = bool
default = true
description = "Whether to create NAT Gateways for private subnets"
}
variable "tags" {
type = map(string)
default = {}
description = "Additional tags to apply to all resources"
}
Key Principle: Expose only values that should vary across environments (CIDR ranges, instance sizes, names, timeouts) as variables. Encapsulate internal implementation details (IAM policy structures, logging configurations, tagging schemes) inside the module.
Output Design
# modules/vpc/outputs.tf
output "vpc_id" {
value = aws_vpc.main.id
description = "The ID of the VPC"
}
output "private_subnet_ids" {
value = aws_subnet.private[*].id
description = "List of private subnet IDs"
}
output "database_subnet_ids" {
value = aws_subnet.database[*].id
description = "List of database subnet IDs"
}
output "nat_gateway_ips" {
value = aws_eip.nat[*].public_ip
description = "Elastic IPs of NAT Gateways"
}
Remote State Management
S3 + DynamoDB Backend (AWS)
The most widely used remote state configuration in AWS environments. S3 handles state file storage while DynamoDB provides state locking. Note that AWS is transitioning from DynamoDB-based locking to S3 native locking, so check the use_lockfile = true option in newer versions.
# backend.tf - S3 + DynamoDB remote state configuration
terraform {
backend "s3" {
bucket = "my-company-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# use_lockfile = true # S3 native locking (newer versions)
}
}
State bucket bootstrap script:
#!/bin/bash
# bootstrap-backend.sh - Create state storage infrastructure
BUCKET_NAME="my-company-terraform-state"
DYNAMODB_TABLE="terraform-state-lock"
REGION="us-east-1"
# Create S3 bucket
aws s3api create-bucket \
--bucket "$BUCKET_NAME" \
--region "$REGION"
# Enable versioning
aws s3api put-bucket-versioning \
--bucket "$BUCKET_NAME" \
--versioning-configuration Status=Enabled
# Block public access
aws s3api put-public-access-block \
--bucket "$BUCKET_NAME" \
--public-access-block-configuration \
BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
# Configure KMS encryption
aws s3api put-bucket-encryption \
--bucket "$BUCKET_NAME" \
--server-side-encryption-configuration '{
"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms"}}]
}'
# Create DynamoDB table for state locking
aws dynamodb create-table \
--table-name "$DYNAMODB_TABLE" \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region "$REGION"
echo "Backend infrastructure created successfully"
GCS Backend (Google Cloud)
terraform {
backend "gcs" {
bucket = "my-company-tf-state"
prefix = "prod/networking"
}
}
Terraform Cloud / HCP Terraform
terraform {
cloud {
organization = "my-company"
workspaces {
name = "prod-networking"
}
}
}
Remote State Data Source (Cross-Stack References)
To reference outputs from one stack in another, use the terraform_remote_state data source.
# Referencing networking stack state from compute stack
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "my-company-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "app" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}
Workspace Strategy vs Directory Separation
Workspace Approach
Terraform workspaces share the same .tf files while maintaining independent state files per environment.
# Create and switch workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
terraform workspace select prod
# Check current workspace
terraform workspace show
Referencing workspaces in HCL:
resource "aws_instance" "app" {
instance_type = terraform.workspace == "prod" ? "m5.xlarge" : "t3.medium"
tags = {
Environment = terraform.workspace
}
}
Directory Separation Approach
infrastructure/
modules/
vpc/
eks/
rds/
environments/
dev/
main.tf
terraform.tfvars
backend.tf
staging/
main.tf
terraform.tfvars
backend.tf
prod/
main.tf
terraform.tfvars
backend.tf
Workspaces vs Directories Comparison
| Criteria | Workspaces | Directory Separation |
|---|---|---|
| Code Duplication | None (shared code) | Some duplication |
| Environment Isolation | Weak (same backend) | Strong (separate backends) |
| IAM Permission Separation | Difficult | Per-environment configuration |
| Blast Radius | Wide (shared code) | Narrow (independent) |
| Operational Complexity | Low | Medium |
| Best Suited For | Ephemeral environments, testing | Production environments |
Recommendation: Use directory separation for production environments and workspaces for short-lived test environments. Many successful teams combine both approaches.
GitOps Automation with Atlantis
What is Atlantis?
Atlantis is a GitOps tool that automates Terraform plan and apply through pull request workflows. When a developer opens an infrastructure change PR, Atlantis automatically runs terraform plan and posts the results as a PR comment. Once reviewers approve, the changes can be applied with an atlantis apply comment.
Key Benefits
- Consistent execution environment: All Terraform operations run on a dedicated server, eliminating "works on my machine" problems
- Automatic state locking: While a PR is open, Atlantis locks the corresponding project state file to prevent concurrent modifications
- Code review integration: Plan results are visible directly in the PR, ensuring visibility of infrastructure changes
- Audit logging: All changes are recorded in PR history
atlantis.yaml Configuration
# atlantis.yaml - located at repository root
version: 3
automerge: false
parallel_plan: true
parallel_apply: false
projects:
- name: prod-networking
dir: environments/prod/networking
workspace: default
terraform_version: v1.9.0
autoplan:
when_modified:
- '*.tf'
- '*.tfvars'
- '../../../modules/networking/**/*.tf'
enabled: true
apply_requirements:
- approved
- mergeable
- name: prod-compute
dir: environments/prod/compute
workspace: default
terraform_version: v1.9.0
autoplan:
when_modified:
- '*.tf'
- '*.tfvars'
- '../../../modules/compute/**/*.tf'
enabled: true
apply_requirements:
- approved
- mergeable
- name: dev-networking
dir: environments/dev/networking
workspace: default
terraform_version: v1.9.0
autoplan:
when_modified:
- '*.tf'
- '*.tfvars'
enabled: true
Custom Atlantis Workflows
# atlantis.yaml - custom workflow
workflows:
custom:
plan:
steps:
- run: terraform fmt -check -recursive
- run: tflint --init
- run: tflint
- init
- plan
apply:
steps:
- apply
Module Versioning and Registry
Semantic Versioning
Terraform modules should follow Semantic Versioning (SemVer):
- Major version bump: Adding required input variables, removing outputs -- breaking changes
- Minor version bump: Adding optional input variables, new outputs
- Patch version bump: Bug fixes, documentation updates
# Specifying version constraints
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0" # Latest within 5.x range
}
module "eks" {
source = "git::https://github.com/my-org/terraform-aws-eks.git?ref=v3.2.1"
}
Private Module Registry
Use Terraform Cloud or a self-hosted registry to manage internal modules.
# Using Terraform Cloud Private Registry
module "vpc" {
source = "app.terraform.io/my-org/vpc/aws"
version = "2.1.0"
}
Comparison Tables
State Backend Comparison
| Feature | S3 + DynamoDB | GCS | Terraform Cloud | Azure Blob |
|---|---|---|---|---|
| State Locking | DynamoDB / S3 Native | Built-in | Built-in | Blob Lease |
| Encryption | KMS | Google KMS | Included | Azure KeyVault |
| Versioning | S3 Versioning | Object Versioning | Included | Blob Snapshots |
| Access Control | IAM Policy | IAM | Teams/RBAC | Azure RBAC |
| Cost | S3 + DynamoDB billing | GCS billing | Free tier limited | Blob billing |
| Setup Difficulty | Medium | Low | Low | Medium |
IaC Tool Comparison
| Feature | Terraform/OpenTofu | Pulumi | Crossplane | CloudFormation |
|---|---|---|---|---|
| Language | HCL | TypeScript/Python/Go | YAML/CRD | JSON/YAML |
| State Management | External backend required | Self-managed/external | Kubernetes etcd | AWS managed |
| Multi-Cloud | Excellent | Excellent | Excellent | AWS only |
| Learning Curve | Medium | Low (existing languages) | High | Low (AWS users) |
| Community | Very large | Growing | Growing | AWS ecosystem |
| Drift Detection | Manual via plan | Manual via preview | Automatic (reconciliation) | Drift Detection |
Failure Cases and Recovery Procedures
Case 1: State Lock Conflict
Symptom: "Error acquiring the state lock" error when running terraform plan or apply
Cause: A previous Terraform operation terminated abnormally (network disconnection, CI runner timeout, Ctrl+C forced termination) and the lock was not released
Recovery procedure:
# 1. Check lock status - verify no other users are running operations
# Find the Lock ID from the error message
# 2. After confirming no other operations are running, force release
terraform force-unlock LOCK_ID
# 3. Force release without confirmation (caution: verify no operations running)
terraform force-unlock -force LOCK_ID
Prevention measures:
- Set appropriate timeouts in CI/CD pipelines
- Implement concurrency controls
- Use Atlantis for PR-based automatic locking to prevent conflicts
Case 2: State Drift
Symptom: terraform plan shows unexpected changes. Resources manually modified in the console are inconsistent with Terraform state
Recovery procedure:
# 1. Refresh state file to match actual infrastructure
terraform refresh
# 2. Or check drift with plan and selectively import
terraform plan
# 3. Either update code to match manual changes or revert
terraform apply # Restore infrastructure to match code
Case 3: Circular Dependencies
Symptom: "Cycle" error during terraform plan
Cause: Module A references outputs from Module B, and Module B references outputs from Module A
Solutions:
- Extract common dependencies into a separate module
- Use
depends_onfor explicit dependency specification - Switch to indirect references using data sources
Case 4: Large State File Performance Degradation
Symptom: terraform plan takes over 10 minutes, API rate limiting occurs
Solutions:
# Target specific modules for plan/apply
terraform plan -target=module.eks
terraform apply -target=module.eks
# Split state file (move state)
terraform state mv module.monitoring module.monitoring
Root cause fix: Split state files by component to reduce individual state file size. Manage networking, compute, database, and monitoring as separate state files, using terraform_remote_state data sources for cross-references.
Operational Checklists
Module Design Checklist
- Does each module follow the single responsibility principle
- Are output values defined for all resources
- Do variables include type, description, and validation
- Are provider and backend configured only in root modules
- Does the module include README.md and an examples directory
- Are semantic version tags maintained
State Management Checklist
- Is a remote backend configured (no local state files)
- Is state locking enabled
- Are state files encrypted
- Is S3 bucket versioning enabled
- Is public access blocked
- Are state files separated by component
Atlantis / CI-CD Checklist
- Is atlantis.yaml configured at the repository root
- Are approved + mergeable requirements set before apply
- Do dependent projects auto-plan when modules change
- Are webhook secrets securely managed
- Is credential rotation performed periodically
Conclusion
Terraform module design and state management become exponentially more important as infrastructure code scales. Composing small modules with the Composition pattern, separating remote state by component, and automating GitOps workflows with Atlantis represents the most mature operational model as of 2026.
The key principle is "start small and split when needed." Rather than attempting to design a perfect module structure from the beginning, start with a single module and refactor when duplication arises, split state files when they grow too large, and introduce Atlantis when the team expands. This incremental approach is the most practical path forward.
Infrastructure code should be subject to the same engineering disciplines as application code -- code review, testing, version control, and CI/CD. The patterns and tools presented in this guide aim to provide practical assistance on that journey.