- Authors
- Name

- Overview
- Module Design Principles
- Module Composition Patterns
- State Management Strategy
- Remote Backend Configuration
- State Locking and Concurrency
- State Migration
- Terraform vs OpenTofu Comparison
- Terratest Testing
- Troubleshooting
- Operations Checklist
- Failure Cases and Recovery Procedures
- References
Overview
As of 2026, Terraform has reached version 1.11 with significantly enhanced security-focused features like Ephemeral Values and Write-Only Attributes, while OpenTofu independently supports State encryption and is approaching 10 million cumulative downloads. This article covers how to design and compose Terraform modules in practice, along with a playbook for safely operating State at the team level. Going beyond simple Remote Backend configuration, it encompasses the entire IaC operations lifecycle from module composition architecture to Terratest-based verification and State migration recovery procedures for each scenario.
The scope of this article is as follows:
- Module single responsibility principle and interface design
- Hierarchical structure of Root Module, Child Module, and Composition Module
- Remote backend comparison and selection criteria for S3, GCS, Azure Blob, etc.
- Workspace vs Directory-based environment isolation strategies
- State migration using moved, import, and removed blocks
- Testing strategies combining Terratest and terraform test
- Troubleshooting and recovery procedures based on real incident cases
Module Design Principles
To design Terraform modules correctly, you need to apply core software engineering principles to HCL code. The most important principle is the Single Responsibility Principle (SRP). A single module should handle only one domain: networking, compute, or storage.
Module Directory Structure
Well-designed Terraform modules follow a standard structure. Here is a layout verified in practice, based on the structure recommended by HashiCorp's official guidelines.
# Standard module directory structure
modules/
networking/
main.tf # Core resources: VPC, Subnet, Route Table, etc.
variables.tf # Input variable definitions (CIDR, AZ, tags, etc.)
outputs.tf # Output values: VPC ID, Subnet ID, etc.
versions.tf # required_providers, required_version
README.md # Module usage documentation
tests/ # terraform test or Terratest code
networking_test.go
compute/
main.tf
variables.tf
outputs.tf
versions.tf
iam.tf # Compute-specific IAM Role/Policy
userdata.tf # Launch Template, User Data
database/
main.tf
variables.tf
outputs.tf
versions.tf
security_group.tf # DB-specific Security Group
Variable Validation
From Terraform 1.9 onward, you can set complex validation rules on variables. Invalid input values are blocked early at the plan stage, greatly improving operational stability.
# variables.tf - Variable validation rule examples
variable "environment" {
type = string
description = "Deployment environment (dev, staging, prod)"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be one of dev, staging, or prod."
}
}
variable "vpc_cidr" {
type = string
description = "VPC CIDR block"
validation {
condition = can(cidrhost(var.vpc_cidr, 0))
error_message = "Must be a valid CIDR format (e.g., 10.0.0.0/16)."
}
validation {
condition = tonumber(split("/", var.vpc_cidr)[1]) >= 16 && tonumber(split("/", var.vpc_cidr)[1]) <= 24
error_message = "VPC CIDR prefix length must be between /16 and /24."
}
}
variable "instance_type" {
type = string
default = "t3.medium"
description = "EC2 instance type"
validation {
condition = can(regex("^(t3|t3a|m5|m6i|c5|c6i)\\.", var.instance_type))
error_message = "Only approved instance families are allowed: t3, t3a, m5, m6i, c5, c6i."
}
}
Module Interface Design Principles
Module inputs and outputs are like API contracts. Following these principles maximizes reusability and maintainability.
| Principle | Description | Example |
|---|---|---|
| Minimal exposure | Export only necessary outputs | Export VPC ID but hide internal Route Table IDs |
| Explicit dependencies | Inject via variables instead of implicit dependencies | vpc_id = module.networking.vpc_id |
| Provide defaults | Set reasonable defaults for optional variables | default = "t3.medium" |
| Type constraints | Use specific types like object, map, list | type = map(object(...)) |
| Description required | Write descriptions for all variables and outputs | Reduces onboarding time for new team members |
Module Composition Patterns
In practice, infrastructure is never composed of a single module. The key is the Composition pattern that combines multiple modules to create a complete environment.
Module Hierarchy
Terraform modules can be broadly divided into three layers.
Leaf Module (Basic Module): Manages a single resource or a closely related group of resources. For example, a VPC module includes VPC, Subnet, Internet Gateway, and NAT Gateway. This module should be independently testable without depending on other modules.
Composition Module: Combines multiple Leaf Modules to form a single service stack. A web service stack would combine networking + compute + database + monitoring modules. The Composition Module itself does not create resources directly; it only connects outputs from child modules to inputs of other modules.
Root Module: The top-level entry point where terraform apply is actually executed. It is located in per-environment directories (dev, staging, prod) and calls Composition Modules while injecting environment-specific variables.
# environments/prod/main.tf - Root Module example
module "web_service" {
source = "../../compositions/web-service"
environment = "prod"
vpc_cidr = "10.1.0.0/16"
instance_type = "m6i.xlarge"
min_size = 3
max_size = 10
db_instance_class = "db.r6g.xlarge"
multi_az = true
tags = {
Team = "platform"
CostCenter = "engineering"
ManagedBy = "terraform"
}
}
# compositions/web-service/main.tf - Composition Module example
module "networking" {
source = "../../modules/networking"
vpc_cidr = var.vpc_cidr
environment = var.environment
tags = var.tags
}
module "compute" {
source = "../../modules/compute"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
instance_type = var.instance_type
min_size = var.min_size
max_size = var.max_size
environment = var.environment
tags = var.tags
}
module "database" {
source = "../../modules/database"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.database_subnet_ids
instance_class = var.db_instance_class
multi_az = var.multi_az
security_group_id = module.compute.app_security_group_id
environment = var.environment
tags = var.tags
}
Leveraging Terraform 1.10/1.11 Ephemeral Values
Ephemeral Values introduced in Terraform 1.10 and Write-Only Attributes in 1.11 fundamentally changed how sensitive information is passed between modules. Previously, secrets like database passwords were stored in plaintext in State files, but now they can be passed without being recorded in State.
# Terraform 1.10+ Ephemeral Resource example
ephemeral "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/db/master-password"
}
resource "aws_db_instance" "main" {
identifier = "prod-primary"
engine = "postgres"
engine_version = "16.2"
instance_class = "db.r6g.xlarge"
# write_only attribute - Not stored in State (Terraform 1.11+)
password_wo = ephemeral.aws_secretsmanager_secret_version.db_password.secret_string
password_wo_version = 1
}
The key point of this approach is that the password_wo value is never recorded in plan files or state files. When the password needs to be changed, incrementing the password_wo_version value triggers Terraform to perform an update.
State Management Strategy
State management is the most critical area of Terraform operations. Without a proper strategy, team collaboration is impossible, and poor management can put the entire infrastructure at risk.
Workspace vs Directory Environment Isolation
There are two main strategies for isolating environments (dev, staging, prod). You need to clearly understand the pros and cons of each before choosing.
| Comparison Item | Workspace Approach | Directory Approach |
|---|---|---|
| State file location | Same backend, different keys | Completely separated backends |
| Code duplication | None (single codebase) | Exists (per-environment dirs) |
| Expressing env diffs | Conditionals, tfvars files | Direct configuration per dir |
| Accidental blast radius | Can modify dev in prod workspace | Physically separated, safe |
| Team size suitability | Small teams (5 or fewer) | Medium to large teams |
| CI/CD pipeline | Single pipeline + workspace variable | Independent pipeline per env |
| Recommended scenario | Personal projects, small services | Production ops, enterprise |
In practice, the Directory approach is overwhelmingly recommended. With the Workspace approach, accidentally running terraform workspace select prod and then applying dev changes can impact production. The Directory approach physically prevents such mistakes through separation.
# Directory-based environment isolation structure
infrastructure/
modules/ # Reusable modules
environments/
dev/
main.tf # module source = "../../modules/..."
backend.tf # S3 key = "dev/terraform.tfstate"
terraform.tfvars
staging/
main.tf
backend.tf # S3 key = "staging/terraform.tfstate"
terraform.tfvars
prod/
main.tf
backend.tf # S3 key = "prod/terraform.tfstate"
terraform.tfvars
Remote Backend Configuration
Cloud-Specific Remote Backend Comparison
| Comparison Item | AWS S3 + DynamoDB | GCS (Google Cloud Storage) | Azure Blob Storage |
|---|---|---|---|
| State storage | S3 Bucket | GCS Bucket | Blob Container |
| State Locking | DynamoDB Table | Built-in GCS locking | Azure Blob Lease |
| Encryption (at rest) | SSE-S3, SSE-KMS | Google-managed, CMEK | Microsoft-managed, CMEK |
| Encryption (transit) | TLS 1.2+ | TLS 1.2+ | TLS 1.2+ |
| Versioning | S3 Versioning | Object Versioning | Blob Versioning |
| Additional cost | DynamoDB billed separately | No additional cost | No additional cost |
| Config complexity | High (2 services) | Low (1 service) | Medium |
| IAM integration | AWS IAM | Google IAM | Azure RBAC |
S3 + DynamoDB Backend Configuration (Production Recommended)
This is the complete bootstrap configuration for the S3 backend, the most widely used in production environments.
# bootstrap/main.tf - State backend infrastructure bootstrap
terraform {
required_version = ">= 1.9.0"
}
provider "aws" {
region = "ap-northeast-2"
}
resource "aws_s3_bucket" "terraform_state" {
bucket = "mycompany-terraform-state-prod"
lifecycle {
prevent_destroy = true
}
tags = {
Name = "Terraform State"
ManagedBy = "bootstrap"
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.terraform_state.arn
}
bucket_key_enabled = true
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_kms_key" "terraform_state" {
description = "KMS key for Terraform state encryption"
deletion_window_in_days = 30
enable_key_rotation = true
}
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Lock"
ManagedBy = "bootstrap"
}
}
# Consumer-side backend.tf
# terraform {
# backend "s3" {
# bucket = "mycompany-terraform-state-prod"
# key = "prod/networking/terraform.tfstate"
# region = "ap-northeast-2"
# encrypt = true
# kms_key_id = "arn:aws:kms:ap-northeast-2:123456789:key/xxx"
# dynamodb_table = "terraform-state-lock"
# }
# }
State Locking and Concurrency
State Locking is a critical mechanism that prevents State file corruption when multiple users run terraform apply simultaneously. Without locking, if two engineers run apply at the same time, one's changes can be overwritten by the other.
How Locking Works
Terraform acquires a lock on all operations that require State modification (plan, apply, destroy, state subcommands, etc.). If a lock already exists, it waits or returns an error.
# Message displayed during lock conflict
Error: Error acquiring the state lock
Lock Info:
ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Path: s3://mycompany-terraform-state-prod/prod/terraform.tfstate
Operation: OperationTypeApply
Who: engineer@workstation
Version: 1.11.0
Created: 2026-03-04 09:15:23.456789 +0000 UTC
# Force unlock (use only in emergencies)
terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890
# Items to verify before force unlocking
# 1. Confirm that the operation for that Lock ID has truly ended
# 2. Check with teammates via Slack if anyone is currently running apply
# 3. Directly check the Lock entry in the DynamoDB table
aws dynamodb get-item \
--table-name terraform-state-lock \
--key '{"LockID": {"S": "mycompany-terraform-state-prod/prod/terraform.tfstate"}}'
Concurrency Control Best Practices
State locking alone does not provide complete concurrency control. Additional measures are needed in CI/CD pipelines.
First, ensure serial execution. In GitHub Actions, use the concurrency key to prevent simultaneous deployments to the same environment. Setting concurrency: group: terraform-prod ensures only one workflow runs at a time. Second, separate Plan and Apply. Run only plan on PRs, and apply after merge. Third, minimize State access permissions. Restrict IAM roles with write access to production State so they can only be assumed during the apply stage of the CI/CD pipeline.
State Migration
State migration inevitably occurs during resource renaming, module restructuring, backend transitions, and similar situations. The moved block introduced in Terraform 1.1 and the import block introduced in Terraform 1.5 have greatly simplified migration work.
Refactoring with moved Blocks
The moved block automatically updates State when a resource's address changes. Unlike the traditional terraform state mv command, it declaratively expresses migration intent in code, ensuring the entire team follows the same migration path.
# Resource rename: aws_instance.web -> aws_instance.web_server
moved {
from = aws_instance.web
to = aws_instance.web_server
}
resource "aws_instance" "web_server" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
}
# Move to module: aws_vpc.main -> module.networking.aws_vpc.main
moved {
from = aws_vpc.main
to = module.networking.aws_vpc.main
}
# Module rename
moved {
from = module.old_networking
to = module.networking
}
# for_each key change
moved {
from = aws_subnet.private["az-a"]
to = aws_subnet.private["ap-northeast-2a"]
}
HashiCorp recommends keeping moved blocks permanently in the code for shared modules. This ensures consumers using various module versions can safely upgrade.
Importing Existing Resources with import Blocks
# import block (Terraform 1.5+) - Declarative import
import {
to = aws_s3_bucket.existing_logs
id = "my-existing-log-bucket"
}
resource "aws_s3_bucket" "existing_logs" {
bucket = "my-existing-log-bucket"
}
# Preview differences with terraform plan
# Apply to State with terraform apply
Backend Migration Procedure
Follow this procedure when transitioning from a local backend to an S3 remote backend.
# 1. Back up the current State
cp terraform.tfstate terraform.tfstate.backup.$(date +%Y%m%d_%H%M%S)
# 2. Create/modify backend.tf
cat > backend.tf << 'HEREDOC'
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "services/web/terraform.tfstate"
region = "ap-northeast-2"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
HEREDOC
# 3. Execute migration with terraform init
terraform init -migrate-state
# 4. Confirm migration
terraform state list
# 5. Verify remote State
terraform plan # No changes expected
# 6. Clean up local State files (after verification)
rm terraform.tfstate terraform.tfstate.backup
Terraform vs OpenTofu Comparison
After HashiCorp changed the Terraform license to BSL (Business Source License) in 2024, the OpenTofu fork managed by the Linux Foundation has been growing rapidly. As of early 2026, OpenTofu has recorded approximately 9.8 million cumulative downloads from GitHub releases alone, and production adoption is expanding.
| Comparison Item | Terraform (HashiCorp) | OpenTofu (Linux Foundation) |
|---|---|---|
| License | BSL 1.1 (commercial restrictions) | MPL 2.0 (fully open source) |
| Governance | HashiCorp sole | Community vote-based |
| State encryption | Not supported (external tools needed) | Native support |
| Ephemeral Values | 1.10+ supported | Separate implementation |
| Write-Only Attributes | 1.11+ supported | 1.11 equivalent (July 2025) |
| Provider compatibility | Full compatibility | Over 99% compatible |
| Registry | registry.terraform.io | registry.opentofu.org + compat |
| Commercial support | HCP Terraform, Terraform Enterprise | Spacelift, env0, Scalr, etc. |
| Performance | Same architecture | Same architecture (minimal diff) |
To summarize the selection criteria: If there are no license constraints and you are using the existing HashiCorp ecosystem (Vault, Consul, etc.), Terraform is the safe choice. On the other hand, if an open-source license is mandatory, native State encryption is needed, or you prefer community-driven development, OpenTofu is a good fit. The HCL syntax and CLI workflows of both tools are nearly identical, so the transition cost is low.
Terratest Testing
Infrastructure code, like application code, requires automated testing. Terratest is a Go-based testing library developed by Gruntwork that supports end-to-end testing by provisioning actual cloud resources, verifying them, and then cleaning up.
Terratest vs terraform test Comparison
The built-in terraform test command from Terraform 1.6 and Terratest cover different testing domains. In practice, using both tools together is most effective.
- terraform test: Written in HCL, suitable for plan-level unit testing. No separate language learning required and fast execution.
- Terratest: Written in Go, deploys actual resources and performs integration tests including HTTP requests, SSH connections, etc. Wider test coverage but longer execution time and costs involved.
Terratest Code Example
// test/networking_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestNetworkingModule(t *testing.T) {
t.Parallel()
awsRegion := "ap-northeast-2"
terraformOptions := &terraform.Options{
TerraformDir: "../modules/networking",
Vars: map[string]interface{}{
"environment": "test",
"vpc_cidr": "10.99.0.0/16",
},
EnvVars: map[string]string{
"AWS_DEFAULT_REGION": awsRegion,
},
}
// Clean up resources when test ends
defer terraform.Destroy(t, terraformOptions)
// Deploy infrastructure
terraform.InitAndApply(t, terraformOptions)
// Verify output values
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcID)
// Verify actual resources via AWS API
vpc := aws.GetVpcById(t, vpcID, awsRegion)
assert.Equal(t, "10.99.0.0/16", vpc.CidrBlock)
// Verify Subnet count
privateSubnetIDs := terraform.OutputList(t, terraformOptions, "private_subnet_ids")
assert.Equal(t, 3, len(privateSubnetIDs))
// Verify tags
actualTags := aws.GetTagsForVpc(t, vpcID, awsRegion)
assert.Equal(t, "test", actualTags["Environment"])
}
terraform test Example
# tests/networking.tftest.hcl
provider "aws" {
region = "ap-northeast-2"
}
variables {
environment = "test"
vpc_cidr = "10.99.0.0/16"
}
run "vpc_creation" {
command = plan
assert {
condition = aws_vpc.main.cidr_block == "10.99.0.0/16"
error_message = "VPC CIDR block does not match the expected value."
}
assert {
condition = aws_vpc.main.tags["Environment"] == "test"
error_message = "Environment tag is incorrect."
}
}
run "subnet_count" {
command = plan
assert {
condition = length(aws_subnet.private) == 3
error_message = "There must be 3 Private Subnets."
}
}
Troubleshooting
Here is a summary of the most common problems encountered during Terraform operations and their solutions.
When State Lock Is Not Released
If a CI/CD pipeline terminates midway, or a network disconnection occurs during terraform apply, the State Lock may remain.
# 1. Check current Lock status
aws dynamodb scan \
--table-name terraform-state-lock \
--filter-expression "attribute_exists(LockID)"
# 2. Safely release after confirming Lock owner
terraform force-unlock <LOCK_ID>
# Warning: Only use force-unlock after confirming no other apply is in progress
State and Actual Infrastructure Mismatch
Manually changing resources in the AWS Console causes drift between State and the actual infrastructure.
# Detect drift
terraform plan -detailed-exitcode
# Exit code 0: No changes
# Exit code 1: Error
# Exit code 2: Changes present (drift detected)
# Refresh specific resource State (update State based on actual infrastructure)
terraform apply -refresh-only -target=aws_instance.web_server
# Remove resource from State (actual infrastructure is preserved)
terraform state rm aws_instance.legacy_server
State File Corruption
Though extremely rare, State files can become corrupted. If you have S3 versioning enabled, recovery from a previous version is possible.
# List previous versions of State file in S3
aws s3api list-object-versions \
--bucket mycompany-terraform-state-prod \
--prefix prod/terraform.tfstate
# Recover a specific version
aws s3api get-object \
--bucket mycompany-terraform-state-prod \
--key prod/terraform.tfstate \
--version-id "abc123def456" \
terraform.tfstate.recovered
# Verify recovered State
terraform show terraform.tfstate.recovered
# Replace State file (after confirming DynamoDB Lock)
aws s3 cp terraform.tfstate.recovered \
s3://mycompany-terraform-state-prod/prod/terraform.tfstate
Operations Checklist
Here is a summary of items that must be verified at each stage of Terraform IaC operations.
Pre-Module Release Checklist
- Are
descriptionandtypedefined for all variables - Are input values validated with
validationblocks - Does
outputs.tfexpose only necessary output values - Are
required_versionandrequired_providersspecified inversions.tf - Does the README.md contain usage examples and input/output descriptions
- Are changes recorded in CHANGELOG.md
- Do Terratest or terraform test pass
- Do
terraform fmtandterraform validatesucceed
State Management Checklist
- Is a remote backend configured (no local State usage)
- Is State Locking enabled
- Is versioning enabled on the State storage
- Is the State storage encrypted (KMS/CMEK)
- Is public access blocked
- Is State fully isolated per environment (Directory approach recommended)
- Does the State access IAM policy follow the principle of least privilege
- Is concurrent execution prevention configured in CI/CD
Pre-Change Application Checklist
- Have you reviewed the
terraform planoutput - Are the resources targeted for destroy intentional
- Is sensitive information not hardcoded in the code
- When using
movedblocks, arefromandtoaccurate - Has the production change gone through a separate approval process
Failure Cases and Recovery Procedures
Case 1: Accidental terraform destroy
An engineer was working in the dev environment but ran terraform destroy while the prod workspace was selected. This case demonstrates the critical weakness of the Workspace approach.
Recovery Procedure:
- Immediately press
Ctrl+Cto stop destroy (if in progress) - Check previous State versions via S3 versioning
- Roll back to the previous State
- Identify deleted resources with
terraform plan - Recreate deleted resources with
terraform apply - Root cause fix: Migrate from Workspace approach to Directory approach
This case is the biggest reason the Directory approach is recommended. Working in the prod directory requires explicitly running cd environments/prod, and CI/CD pipelines are also separated per environment.
Case 2: State Lock Deadlock
A situation where the DynamoDB Lock is not released, preventing all team members from running apply.
Recovery Procedure:
- Check the
Infocolumn of the Lock entry in DynamoDB (who, when, what operation) - Confirm the operation status with the engineer (via Slack, messenger, etc.)
- If confirmed that the operation has already ended, run
terraform force-unlock - Verify State consistency with
terraform plan - Root cause fix: Set timeouts in CI/CD pipelines, add graceful shutdown handlers
Case 3: Sensitive Information Exposed in State File
A case where an RDS password was found stored in plaintext in the State file during an audit.
Recovery Procedure:
- Check S3 bucket access logs to determine if unauthorized access occurred
- Immediately change the exposed password
- Migrate to Terraform 1.10+ Ephemeral Resources or 1.11+ Write-Only Attributes
- Previous State file versions also contain sensitive information, so set S3 Lifecycle Policy to delete or expire old versions
- Root cause fix: Consider adopting OpenTofu's State encryption feature, or fully apply Terraform 1.11 Write-Only Attributes
References
- Terraform 1.10 - Improved Secret Management in State with Ephemeral Values
- Terraform 1.11 - Ephemeral Values for Managed Resources with Write-Only Arguments
- HashiCorp - Terraform Module Creation Recommended Patterns
- Terraform moved Block for Refactoring
- Terratest - Infrastructure Code Automated Testing Library
- Spacelift - OpenTofu vs Terraform Comparison
- AWS - Terraform Backend Best Practices
- Google Cloud - Terraform Style and Structure Best Practices
- Terraform State Refactoring Official Documentation