Skip to content
Published on

Terraform Module Design Patterns and State Management Operations Playbook 2026

Authors
  • Name
    Twitter
Terraform Module Design Patterns and State Management Operations Playbook 2026

Overview

As of 2026, Terraform has reached version 1.11 with significantly enhanced security-focused features like Ephemeral Values and Write-Only Attributes, while OpenTofu independently supports State encryption and is approaching 10 million cumulative downloads. This article covers how to design and compose Terraform modules in practice, along with a playbook for safely operating State at the team level. Going beyond simple Remote Backend configuration, it encompasses the entire IaC operations lifecycle from module composition architecture to Terratest-based verification and State migration recovery procedures for each scenario.

The scope of this article is as follows:

  • Module single responsibility principle and interface design
  • Hierarchical structure of Root Module, Child Module, and Composition Module
  • Remote backend comparison and selection criteria for S3, GCS, Azure Blob, etc.
  • Workspace vs Directory-based environment isolation strategies
  • State migration using moved, import, and removed blocks
  • Testing strategies combining Terratest and terraform test
  • Troubleshooting and recovery procedures based on real incident cases

Module Design Principles

To design Terraform modules correctly, you need to apply core software engineering principles to HCL code. The most important principle is the Single Responsibility Principle (SRP). A single module should handle only one domain: networking, compute, or storage.

Module Directory Structure

Well-designed Terraform modules follow a standard structure. Here is a layout verified in practice, based on the structure recommended by HashiCorp's official guidelines.

# Standard module directory structure
modules/
  networking/
    main.tf          # Core resources: VPC, Subnet, Route Table, etc.
    variables.tf     # Input variable definitions (CIDR, AZ, tags, etc.)
    outputs.tf       # Output values: VPC ID, Subnet ID, etc.
    versions.tf      # required_providers, required_version
    README.md        # Module usage documentation
    tests/           # terraform test or Terratest code
      networking_test.go
  compute/
    main.tf
    variables.tf
    outputs.tf
    versions.tf
    iam.tf           # Compute-specific IAM Role/Policy
    userdata.tf      # Launch Template, User Data
  database/
    main.tf
    variables.tf
    outputs.tf
    versions.tf
    security_group.tf  # DB-specific Security Group

Variable Validation

From Terraform 1.9 onward, you can set complex validation rules on variables. Invalid input values are blocked early at the plan stage, greatly improving operational stability.

# variables.tf - Variable validation rule examples
variable "environment" {
  type        = string
  description = "Deployment environment (dev, staging, prod)"

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be one of dev, staging, or prod."
  }
}

variable "vpc_cidr" {
  type        = string
  description = "VPC CIDR block"

  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "Must be a valid CIDR format (e.g., 10.0.0.0/16)."
  }

  validation {
    condition     = tonumber(split("/", var.vpc_cidr)[1]) >= 16 && tonumber(split("/", var.vpc_cidr)[1]) <= 24
    error_message = "VPC CIDR prefix length must be between /16 and /24."
  }
}

variable "instance_type" {
  type        = string
  default     = "t3.medium"
  description = "EC2 instance type"

  validation {
    condition     = can(regex("^(t3|t3a|m5|m6i|c5|c6i)\\.", var.instance_type))
    error_message = "Only approved instance families are allowed: t3, t3a, m5, m6i, c5, c6i."
  }
}

Module Interface Design Principles

Module inputs and outputs are like API contracts. Following these principles maximizes reusability and maintainability.

PrincipleDescriptionExample
Minimal exposureExport only necessary outputsExport VPC ID but hide internal Route Table IDs
Explicit dependenciesInject via variables instead of implicit dependenciesvpc_id = module.networking.vpc_id
Provide defaultsSet reasonable defaults for optional variablesdefault = "t3.medium"
Type constraintsUse specific types like object, map, listtype = map(object(...))
Description requiredWrite descriptions for all variables and outputsReduces onboarding time for new team members

Module Composition Patterns

In practice, infrastructure is never composed of a single module. The key is the Composition pattern that combines multiple modules to create a complete environment.

Module Hierarchy

Terraform modules can be broadly divided into three layers.

Leaf Module (Basic Module): Manages a single resource or a closely related group of resources. For example, a VPC module includes VPC, Subnet, Internet Gateway, and NAT Gateway. This module should be independently testable without depending on other modules.

Composition Module: Combines multiple Leaf Modules to form a single service stack. A web service stack would combine networking + compute + database + monitoring modules. The Composition Module itself does not create resources directly; it only connects outputs from child modules to inputs of other modules.

Root Module: The top-level entry point where terraform apply is actually executed. It is located in per-environment directories (dev, staging, prod) and calls Composition Modules while injecting environment-specific variables.

# environments/prod/main.tf - Root Module example
module "web_service" {
  source = "../../compositions/web-service"

  environment    = "prod"
  vpc_cidr       = "10.1.0.0/16"
  instance_type  = "m6i.xlarge"
  min_size       = 3
  max_size       = 10
  db_instance_class = "db.r6g.xlarge"
  multi_az       = true

  tags = {
    Team        = "platform"
    CostCenter  = "engineering"
    ManagedBy   = "terraform"
  }
}

# compositions/web-service/main.tf - Composition Module example
module "networking" {
  source = "../../modules/networking"

  vpc_cidr    = var.vpc_cidr
  environment = var.environment
  tags        = var.tags
}

module "compute" {
  source = "../../modules/compute"

  vpc_id         = module.networking.vpc_id
  subnet_ids     = module.networking.private_subnet_ids
  instance_type  = var.instance_type
  min_size       = var.min_size
  max_size       = var.max_size
  environment    = var.environment
  tags           = var.tags
}

module "database" {
  source = "../../modules/database"

  vpc_id            = module.networking.vpc_id
  subnet_ids        = module.networking.database_subnet_ids
  instance_class    = var.db_instance_class
  multi_az          = var.multi_az
  security_group_id = module.compute.app_security_group_id
  environment       = var.environment
  tags              = var.tags
}

Leveraging Terraform 1.10/1.11 Ephemeral Values

Ephemeral Values introduced in Terraform 1.10 and Write-Only Attributes in 1.11 fundamentally changed how sensitive information is passed between modules. Previously, secrets like database passwords were stored in plaintext in State files, but now they can be passed without being recorded in State.

# Terraform 1.10+ Ephemeral Resource example
ephemeral "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/db/master-password"
}

resource "aws_db_instance" "main" {
  identifier     = "prod-primary"
  engine         = "postgres"
  engine_version = "16.2"
  instance_class = "db.r6g.xlarge"

  # write_only attribute - Not stored in State (Terraform 1.11+)
  password_wo         = ephemeral.aws_secretsmanager_secret_version.db_password.secret_string
  password_wo_version = 1
}

The key point of this approach is that the password_wo value is never recorded in plan files or state files. When the password needs to be changed, incrementing the password_wo_version value triggers Terraform to perform an update.

State Management Strategy

State management is the most critical area of Terraform operations. Without a proper strategy, team collaboration is impossible, and poor management can put the entire infrastructure at risk.

Workspace vs Directory Environment Isolation

There are two main strategies for isolating environments (dev, staging, prod). You need to clearly understand the pros and cons of each before choosing.

Comparison ItemWorkspace ApproachDirectory Approach
State file locationSame backend, different keysCompletely separated backends
Code duplicationNone (single codebase)Exists (per-environment dirs)
Expressing env diffsConditionals, tfvars filesDirect configuration per dir
Accidental blast radiusCan modify dev in prod workspacePhysically separated, safe
Team size suitabilitySmall teams (5 or fewer)Medium to large teams
CI/CD pipelineSingle pipeline + workspace variableIndependent pipeline per env
Recommended scenarioPersonal projects, small servicesProduction ops, enterprise

In practice, the Directory approach is overwhelmingly recommended. With the Workspace approach, accidentally running terraform workspace select prod and then applying dev changes can impact production. The Directory approach physically prevents such mistakes through separation.

# Directory-based environment isolation structure
infrastructure/
  modules/          # Reusable modules
  environments/
    dev/
      main.tf       # module source = "../../modules/..."
      backend.tf    # S3 key = "dev/terraform.tfstate"
      terraform.tfvars
    staging/
      main.tf
      backend.tf    # S3 key = "staging/terraform.tfstate"
      terraform.tfvars
    prod/
      main.tf
      backend.tf    # S3 key = "prod/terraform.tfstate"
      terraform.tfvars

Remote Backend Configuration

Cloud-Specific Remote Backend Comparison

Comparison ItemAWS S3 + DynamoDBGCS (Google Cloud Storage)Azure Blob Storage
State storageS3 BucketGCS BucketBlob Container
State LockingDynamoDB TableBuilt-in GCS lockingAzure Blob Lease
Encryption (at rest)SSE-S3, SSE-KMSGoogle-managed, CMEKMicrosoft-managed, CMEK
Encryption (transit)TLS 1.2+TLS 1.2+TLS 1.2+
VersioningS3 VersioningObject VersioningBlob Versioning
Additional costDynamoDB billed separatelyNo additional costNo additional cost
Config complexityHigh (2 services)Low (1 service)Medium
IAM integrationAWS IAMGoogle IAMAzure RBAC

This is the complete bootstrap configuration for the S3 backend, the most widely used in production environments.

# bootstrap/main.tf - State backend infrastructure bootstrap
terraform {
  required_version = ">= 1.9.0"
}

provider "aws" {
  region = "ap-northeast-2"
}

resource "aws_s3_bucket" "terraform_state" {
  bucket = "mycompany-terraform-state-prod"

  lifecycle {
    prevent_destroy = true
  }

  tags = {
    Name      = "Terraform State"
    ManagedBy = "bootstrap"
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.terraform_state.arn
    }
    bucket_key_enabled = true
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_kms_key" "terraform_state" {
  description             = "KMS key for Terraform state encryption"
  deletion_window_in_days = 30
  enable_key_rotation     = true
}

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name      = "Terraform State Lock"
    ManagedBy = "bootstrap"
  }
}

# Consumer-side backend.tf
# terraform {
#   backend "s3" {
#     bucket         = "mycompany-terraform-state-prod"
#     key            = "prod/networking/terraform.tfstate"
#     region         = "ap-northeast-2"
#     encrypt        = true
#     kms_key_id     = "arn:aws:kms:ap-northeast-2:123456789:key/xxx"
#     dynamodb_table = "terraform-state-lock"
#   }
# }

State Locking and Concurrency

State Locking is a critical mechanism that prevents State file corruption when multiple users run terraform apply simultaneously. Without locking, if two engineers run apply at the same time, one's changes can be overwritten by the other.

How Locking Works

Terraform acquires a lock on all operations that require State modification (plan, apply, destroy, state subcommands, etc.). If a lock already exists, it waits or returns an error.

# Message displayed during lock conflict
Error: Error acquiring the state lock
Lock Info:
  ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
  Path:      s3://mycompany-terraform-state-prod/prod/terraform.tfstate
  Operation: OperationTypeApply
  Who:       engineer@workstation
  Version:   1.11.0
  Created:   2026-03-04 09:15:23.456789 +0000 UTC

# Force unlock (use only in emergencies)
terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890

# Items to verify before force unlocking
# 1. Confirm that the operation for that Lock ID has truly ended
# 2. Check with teammates via Slack if anyone is currently running apply
# 3. Directly check the Lock entry in the DynamoDB table
aws dynamodb get-item \
  --table-name terraform-state-lock \
  --key '{"LockID": {"S": "mycompany-terraform-state-prod/prod/terraform.tfstate"}}'

Concurrency Control Best Practices

State locking alone does not provide complete concurrency control. Additional measures are needed in CI/CD pipelines.

First, ensure serial execution. In GitHub Actions, use the concurrency key to prevent simultaneous deployments to the same environment. Setting concurrency: group: terraform-prod ensures only one workflow runs at a time. Second, separate Plan and Apply. Run only plan on PRs, and apply after merge. Third, minimize State access permissions. Restrict IAM roles with write access to production State so they can only be assumed during the apply stage of the CI/CD pipeline.

State Migration

State migration inevitably occurs during resource renaming, module restructuring, backend transitions, and similar situations. The moved block introduced in Terraform 1.1 and the import block introduced in Terraform 1.5 have greatly simplified migration work.

Refactoring with moved Blocks

The moved block automatically updates State when a resource's address changes. Unlike the traditional terraform state mv command, it declaratively expresses migration intent in code, ensuring the entire team follows the same migration path.

# Resource rename: aws_instance.web -> aws_instance.web_server
moved {
  from = aws_instance.web
  to   = aws_instance.web_server
}

resource "aws_instance" "web_server" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"
}

# Move to module: aws_vpc.main -> module.networking.aws_vpc.main
moved {
  from = aws_vpc.main
  to   = module.networking.aws_vpc.main
}

# Module rename
moved {
  from = module.old_networking
  to   = module.networking
}

# for_each key change
moved {
  from = aws_subnet.private["az-a"]
  to   = aws_subnet.private["ap-northeast-2a"]
}

HashiCorp recommends keeping moved blocks permanently in the code for shared modules. This ensures consumers using various module versions can safely upgrade.

Importing Existing Resources with import Blocks

# import block (Terraform 1.5+) - Declarative import
import {
  to = aws_s3_bucket.existing_logs
  id = "my-existing-log-bucket"
}

resource "aws_s3_bucket" "existing_logs" {
  bucket = "my-existing-log-bucket"
}

# Preview differences with terraform plan
# Apply to State with terraform apply

Backend Migration Procedure

Follow this procedure when transitioning from a local backend to an S3 remote backend.

# 1. Back up the current State
cp terraform.tfstate terraform.tfstate.backup.$(date +%Y%m%d_%H%M%S)

# 2. Create/modify backend.tf
cat > backend.tf << 'HEREDOC'
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "services/web/terraform.tfstate"
    region         = "ap-northeast-2"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}
HEREDOC

# 3. Execute migration with terraform init
terraform init -migrate-state

# 4. Confirm migration
terraform state list

# 5. Verify remote State
terraform plan  # No changes expected

# 6. Clean up local State files (after verification)
rm terraform.tfstate terraform.tfstate.backup

Terraform vs OpenTofu Comparison

After HashiCorp changed the Terraform license to BSL (Business Source License) in 2024, the OpenTofu fork managed by the Linux Foundation has been growing rapidly. As of early 2026, OpenTofu has recorded approximately 9.8 million cumulative downloads from GitHub releases alone, and production adoption is expanding.

Comparison ItemTerraform (HashiCorp)OpenTofu (Linux Foundation)
LicenseBSL 1.1 (commercial restrictions)MPL 2.0 (fully open source)
GovernanceHashiCorp soleCommunity vote-based
State encryptionNot supported (external tools needed)Native support
Ephemeral Values1.10+ supportedSeparate implementation
Write-Only Attributes1.11+ supported1.11 equivalent (July 2025)
Provider compatibilityFull compatibilityOver 99% compatible
Registryregistry.terraform.ioregistry.opentofu.org + compat
Commercial supportHCP Terraform, Terraform EnterpriseSpacelift, env0, Scalr, etc.
PerformanceSame architectureSame architecture (minimal diff)

To summarize the selection criteria: If there are no license constraints and you are using the existing HashiCorp ecosystem (Vault, Consul, etc.), Terraform is the safe choice. On the other hand, if an open-source license is mandatory, native State encryption is needed, or you prefer community-driven development, OpenTofu is a good fit. The HCL syntax and CLI workflows of both tools are nearly identical, so the transition cost is low.

Terratest Testing

Infrastructure code, like application code, requires automated testing. Terratest is a Go-based testing library developed by Gruntwork that supports end-to-end testing by provisioning actual cloud resources, verifying them, and then cleaning up.

Terratest vs terraform test Comparison

The built-in terraform test command from Terraform 1.6 and Terratest cover different testing domains. In practice, using both tools together is most effective.

  • terraform test: Written in HCL, suitable for plan-level unit testing. No separate language learning required and fast execution.
  • Terratest: Written in Go, deploys actual resources and performs integration tests including HTTP requests, SSH connections, etc. Wider test coverage but longer execution time and costs involved.

Terratest Code Example

// test/networking_test.go
package test

import (
	"testing"

	"github.com/gruntwork-io/terratest/modules/aws"
	"github.com/gruntwork-io/terratest/modules/terraform"
	"github.com/stretchr/testify/assert"
)

func TestNetworkingModule(t *testing.T) {
	t.Parallel()

	awsRegion := "ap-northeast-2"

	terraformOptions := &terraform.Options{
		TerraformDir: "../modules/networking",
		Vars: map[string]interface{}{
			"environment": "test",
			"vpc_cidr":    "10.99.0.0/16",
		},
		EnvVars: map[string]string{
			"AWS_DEFAULT_REGION": awsRegion,
		},
	}

	// Clean up resources when test ends
	defer terraform.Destroy(t, terraformOptions)

	// Deploy infrastructure
	terraform.InitAndApply(t, terraformOptions)

	// Verify output values
	vpcID := terraform.Output(t, terraformOptions, "vpc_id")
	assert.NotEmpty(t, vpcID)

	// Verify actual resources via AWS API
	vpc := aws.GetVpcById(t, vpcID, awsRegion)
	assert.Equal(t, "10.99.0.0/16", vpc.CidrBlock)

	// Verify Subnet count
	privateSubnetIDs := terraform.OutputList(t, terraformOptions, "private_subnet_ids")
	assert.Equal(t, 3, len(privateSubnetIDs))

	// Verify tags
	actualTags := aws.GetTagsForVpc(t, vpcID, awsRegion)
	assert.Equal(t, "test", actualTags["Environment"])
}

terraform test Example

# tests/networking.tftest.hcl
provider "aws" {
  region = "ap-northeast-2"
}

variables {
  environment = "test"
  vpc_cidr    = "10.99.0.0/16"
}

run "vpc_creation" {
  command = plan

  assert {
    condition     = aws_vpc.main.cidr_block == "10.99.0.0/16"
    error_message = "VPC CIDR block does not match the expected value."
  }

  assert {
    condition     = aws_vpc.main.tags["Environment"] == "test"
    error_message = "Environment tag is incorrect."
  }
}

run "subnet_count" {
  command = plan

  assert {
    condition     = length(aws_subnet.private) == 3
    error_message = "There must be 3 Private Subnets."
  }
}

Troubleshooting

Here is a summary of the most common problems encountered during Terraform operations and their solutions.

When State Lock Is Not Released

If a CI/CD pipeline terminates midway, or a network disconnection occurs during terraform apply, the State Lock may remain.

# 1. Check current Lock status
aws dynamodb scan \
  --table-name terraform-state-lock \
  --filter-expression "attribute_exists(LockID)"

# 2. Safely release after confirming Lock owner
terraform force-unlock <LOCK_ID>

# Warning: Only use force-unlock after confirming no other apply is in progress

State and Actual Infrastructure Mismatch

Manually changing resources in the AWS Console causes drift between State and the actual infrastructure.

# Detect drift
terraform plan -detailed-exitcode
# Exit code 0: No changes
# Exit code 1: Error
# Exit code 2: Changes present (drift detected)

# Refresh specific resource State (update State based on actual infrastructure)
terraform apply -refresh-only -target=aws_instance.web_server

# Remove resource from State (actual infrastructure is preserved)
terraform state rm aws_instance.legacy_server

State File Corruption

Though extremely rare, State files can become corrupted. If you have S3 versioning enabled, recovery from a previous version is possible.

# List previous versions of State file in S3
aws s3api list-object-versions \
  --bucket mycompany-terraform-state-prod \
  --prefix prod/terraform.tfstate

# Recover a specific version
aws s3api get-object \
  --bucket mycompany-terraform-state-prod \
  --key prod/terraform.tfstate \
  --version-id "abc123def456" \
  terraform.tfstate.recovered

# Verify recovered State
terraform show terraform.tfstate.recovered

# Replace State file (after confirming DynamoDB Lock)
aws s3 cp terraform.tfstate.recovered \
  s3://mycompany-terraform-state-prod/prod/terraform.tfstate

Operations Checklist

Here is a summary of items that must be verified at each stage of Terraform IaC operations.

Pre-Module Release Checklist

  • Are description and type defined for all variables
  • Are input values validated with validation blocks
  • Does outputs.tf expose only necessary output values
  • Are required_version and required_providers specified in versions.tf
  • Does the README.md contain usage examples and input/output descriptions
  • Are changes recorded in CHANGELOG.md
  • Do Terratest or terraform test pass
  • Do terraform fmt and terraform validate succeed

State Management Checklist

  • Is a remote backend configured (no local State usage)
  • Is State Locking enabled
  • Is versioning enabled on the State storage
  • Is the State storage encrypted (KMS/CMEK)
  • Is public access blocked
  • Is State fully isolated per environment (Directory approach recommended)
  • Does the State access IAM policy follow the principle of least privilege
  • Is concurrent execution prevention configured in CI/CD

Pre-Change Application Checklist

  • Have you reviewed the terraform plan output
  • Are the resources targeted for destroy intentional
  • Is sensitive information not hardcoded in the code
  • When using moved blocks, are from and to accurate
  • Has the production change gone through a separate approval process

Failure Cases and Recovery Procedures

Case 1: Accidental terraform destroy

An engineer was working in the dev environment but ran terraform destroy while the prod workspace was selected. This case demonstrates the critical weakness of the Workspace approach.

Recovery Procedure:

  1. Immediately press Ctrl+C to stop destroy (if in progress)
  2. Check previous State versions via S3 versioning
  3. Roll back to the previous State
  4. Identify deleted resources with terraform plan
  5. Recreate deleted resources with terraform apply
  6. Root cause fix: Migrate from Workspace approach to Directory approach

This case is the biggest reason the Directory approach is recommended. Working in the prod directory requires explicitly running cd environments/prod, and CI/CD pipelines are also separated per environment.

Case 2: State Lock Deadlock

A situation where the DynamoDB Lock is not released, preventing all team members from running apply.

Recovery Procedure:

  1. Check the Info column of the Lock entry in DynamoDB (who, when, what operation)
  2. Confirm the operation status with the engineer (via Slack, messenger, etc.)
  3. If confirmed that the operation has already ended, run terraform force-unlock
  4. Verify State consistency with terraform plan
  5. Root cause fix: Set timeouts in CI/CD pipelines, add graceful shutdown handlers

Case 3: Sensitive Information Exposed in State File

A case where an RDS password was found stored in plaintext in the State file during an audit.

Recovery Procedure:

  1. Check S3 bucket access logs to determine if unauthorized access occurred
  2. Immediately change the exposed password
  3. Migrate to Terraform 1.10+ Ephemeral Resources or 1.11+ Write-Only Attributes
  4. Previous State file versions also contain sensitive information, so set S3 Lifecycle Policy to delete or expire old versions
  5. Root cause fix: Consider adopting OpenTofu's State encryption feature, or fully apply Terraform 1.11 Write-Only Attributes

References