- Published on
AWS Well-Architected Framework Complete Guide 2025: Six Pillars, Practical Adoption, Cost/Security/Performance
- Authors

- Name
- Youngju Kim
- @fjvbn20031
TL;DR
- Six Pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability.
- Trade-offs: No system can satisfy every pillar 100%. Decide based on business priorities.
- Well-Architected Tool: Free self-assessment inside the AWS Console.
- Sustainability (added 2021): Considers environmental impact — efficient code equals less carbon.
- Applicable to any cloud: It is an AWS framework, but the principles hold on GCP/Azure too.
1. What Is the Well-Architected Framework?
1.1 Background
After reviewing tens of thousands of customer architectures, AWS distilled its findings into a best-practice guide. It started with five pillars; Sustainability was added in 2021, making six.
1.2 The Six Pillars (2025)
| Pillar | Core Question |
|---|---|
| Operational Excellence | How do we run the workload efficiently? |
| Security | How do we protect data and systems? |
| Reliability | How do we recover from failure? |
| Performance Efficiency | How do we use resources efficiently? |
| Cost Optimization | How do we minimize cost per unit of value? |
| Sustainability | How do we minimize environmental impact? |
1.3 General Principles
Five principles that apply to every pillar:
- Measure, don't guess — data-driven decisions.
- Test at production scale — staging is not production.
- Increase experimentation with automation — manual work equals human error.
- Allow evolutionary architectures — nothing built once lasts forever.
- Improve operations through game days — chaos engineering.
2. Pillar 1: Operational Excellence
2.1 Core
"The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures."
2.2 Design Principles
- Operations as code — codify infra, policy, and procedure.
- Small, frequent changes — big deployments are risky.
- Continuously refine procedures — retros, runbook updates.
- Anticipate failure — map failure modes.
- Learn from every operational failure — postmortems.
2.3 Practical Checklist
Plan:
- Business objectives are explicit?
- Success measured by metrics?
- Operational priorities defined?
Prepare:
- All infra managed as IaC (CloudFormation/Terraform)?
- CI/CD pipeline in place?
- Monitoring and alerting configured?
Operate:
- Runbooks current?
- Clear on-call procedures?
- Incident response times measured?
Evolve:
- Regular retrospectives?
- Chaos engineering drills?
- Metrics-driven improvement?
2.4 AWS Tools
- CloudFormation / CDK — IaC.
- Systems Manager — automation (runbooks, patching).
- CloudWatch — monitoring, alerting.
- X-Ray — distributed tracing.
- Config — resource change tracking.
3. Pillar 2: Security
3.1 Core
"Protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies."
3.2 Design Principles
- Implement a strong identity foundation — IAM, MFA, least privilege.
- Apply security at all layers — defense in depth.
- Encrypt data at rest and in transit — encryption everywhere.
- Keep people away from data — prefer automation.
- Prepare for security events — incident response plan.
- Understand the shared responsibility model — AWS secures the cloud, you secure what is in it.
3.3 IAM Best Practices
BAD: use the root account
BAD: access keys inside code
BAD: wildcard (*) permissions
BAD: shared users
GOOD: MFA for every user
GOOD: role-based access (IAM Roles)
GOOD: least-privilege principle
GOOD: Access Analyzer
GOOD: temporary credentials via STS
3.4 Data Protection
At rest:
- S3: default SSE-S3, or SSE-KMS.
- EBS: enable default encryption.
- RDS: KMS-managed keys.
- DynamoDB: always encrypted (default).
In transit:
- TLS 1.2+ mandatory.
- Encrypt intra-VPC traffic (service mesh, mTLS).
- VPN / Direct Connect.
3.5 Network Security
[Public Subnet] <- Internet Gateway
|
[Private Subnet] <- NAT Gateway (egress only)
|
[Isolated Subnet] <- DB only, no internet
Security Group vs. NACL:
- Security Group: instance-level, stateful.
- NACL: subnet-level, stateless.
3.6 AWS Security Tools
| Tool | Purpose |
|---|---|
| IAM | access control |
| GuardDuty | threat detection (ML-based) |
| Security Hub | findings aggregation |
| Inspector | vulnerability scanning |
| Macie | sensitive data discovery in S3 |
| WAF | web application firewall |
| Shield | DDoS protection |
| Secrets Manager | secret storage |
| KMS | key management |
4. Pillar 3: Reliability
4.1 Core
"The ability of a workload to perform its intended function correctly and consistently."
4.2 Design Principles
- Automatically recover from failure — auto-healing.
- Test recovery procedures — chaos engineering.
- Scale horizontally — no single giant instance.
- Stop guessing capacity — Auto Scaling.
- Manage change through automation — IaC.
4.3 AWS Availability Model
- Region: geographic area (us-east-1, ap-northeast-2).
- Availability Zone (AZ): isolated datacenter inside a region (3–6 typical).
- Edge Location: CloudFront PoP.
Multi-AZ: spread across AZs to survive single-AZ failure. Multi-Region: spread across regions to survive natural disasters or regional outages.
4.4 Availability Targets
| Availability | Downtime/year | Downtime/month |
|---|---|---|
| 99% | 87.6 h | 7.2 h |
| 99.9% (three nines) | 8.76 h | 43.8 min |
| 99.95% | 4.38 h | 21.9 min |
| 99.99% (four nines) | 52.6 min | 4.38 min |
| 99.999% (five nines) | 5.26 min | 26.3 s |
Reality:
- 99.9% is hard on a single instance.
- 99.99% requires Multi-AZ + Auto Scaling.
- 99.999% requires Multi-Region with sophisticated failover.
4.5 Practical Pattern
ELB + Auto Scaling Group + Multi-AZ:
[Route 53]
|
[ALB (multi-AZ)]
/ | \
[EC2 az-a] [EC2 az-b] [EC2 az-c]
| | |
[RDS Multi-AZ Standby]
- EC2 dies: ASG launches a replacement.
- AZ dies: ALB routes traffic to the remaining AZs.
- Primary RDS dies: automatic failover to standby.
4.6 RTO and RPO
- RTO (Recovery Time Objective): acceptable time until recovery.
- RPO (Recovery Point Objective): acceptable data loss.
| Strategy | RTO | RPO | Cost |
|---|---|---|---|
| Backup & Restore | hours | hours | low |
| Pilot Light | minutes | minutes | medium |
| Warm Standby | minutes | seconds | high |
| Multi-Site Active-Active | 0 | 0 | very high |
Pick the strategy that matches the RTO/RPO the business can absorb.
5. Pillar 4: Performance Efficiency
5.1 Core
"The ability to use computing resources efficiently to meet requirements, and to maintain that efficiency as demand and technology evolve."
5.2 Design Principles
- Democratize advanced technologies — AI/ML and big data for everyone.
- Go global — get close to users.
- Prefer serverless — less operational burden.
- Experiment more often — comparison tests.
- Have mechanical sympathy — pick the right tool.
5.3 Compute Choices
| Option | Use case |
|---|---|
| Lambda | short, event-driven tasks |
| Fargate | containers without server management |
| EC2 | long-running or custom environments |
| ECS/EKS | container orchestration |
| Batch | large batch jobs |
| Lightsail | simple web hosting |
5.4 Storage Choices
| Option | Use |
|---|---|
| S3 | objects, backups, static assets |
| EBS | EC2 block storage |
| EFS | shared filesystem (NFS) |
| FSx | Windows / Lustre filesystem |
| Glacier | long-term archive |
5.5 Database Choices
| Option | Use case |
|---|---|
| RDS | classic relational |
| Aurora | high-performance, RDS-compatible |
| DynamoDB | NoSQL, single-ms latency |
| DocumentDB | MongoDB-compatible |
| ElastiCache | Redis/Memcached cache |
| Neptune | graph DB |
| Timestream | time series |
| OpenSearch | search, log analytics |
5.6 Caching
Layered caching:
- CloudFront (CDN) — global edge.
- API Gateway cache — API responses.
- ElastiCache (Redis) — application cache.
- DynamoDB DAX — DynamoDB cache.
- RDS Read Replica — offload read traffic.
Result: response time cut by 90%+, DB load drops dramatically.
6. Pillar 5: Cost Optimization
6.1 Core
"The ability to run systems that deliver business value at the lowest price point."
6.2 Design Principles
- Implement cloud financial management — FinOps.
- Adopt a consumption model — pay for what you use.
- Measure overall efficiency — cost per business metric.
- Stop spending on undifferentiated heavy lifting — use managed services.
- Analyze and attribute expenditure — tagging and allocation.
6.3 Pricing Models
Compute (EC2):
- On-Demand: most expensive, most flexible.
- Reserved Instance (1–3 years): up to 75% savings.
- Savings Plans: more flexible commitments.
- Spot: up to 90% savings, interruptible.
Storage (S3):
- Standard: frequent access.
- Standard-IA: infrequent access, ~30% cheaper.
- Glacier: long-term retention, ~95% cheaper.
- Glacier Deep Archive: cheapest, 95%+ cheaper.
6.4 Savings Strategies
1. Right-sizing:
aws compute-optimizer get-ec2-instance-recommendations
Drop an EC2 averaging 50% utilization to a smaller instance.
2. Auto Scaling: contract when traffic is low, expand when it spikes. Cost scales with usage.
3. Spot Instances: ideal for batch and CI/CD workloads, up to 90% off; design for interruption.
4. Reserved Instances / Savings Plans: for steady workloads — roughly 30% savings for a 1-year commit and 60% for 3 years.
5. S3 Lifecycle Policies:
{
"Rules": [{
"Id": "MoveToIA",
"Status": "Enabled",
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
]
}]
}
6. Cut data transfer cost: same-region transfers are free or cheap; prefer CloudFront over direct S3 egress; use VPC Endpoints to skip NAT Gateway charges.
6.5 Cost Monitoring
| Tool | Use |
|---|---|
| Cost Explorer | cost visualization |
| Budgets | budget alerts |
| Cost and Usage Reports | detailed data |
| Trusted Advisor | cost recommendations |
| Compute Optimizer | EC2 sizing suggestions |
| Savings Plans Recommendations | savings suggestions |
6.6 Tagging Strategy
Environment: production
Project: ecommerce
Owner: team-checkout
CostCenter: 1234
Then break down cost by tag in Cost Explorer.
7. Pillar 6: Sustainability (added 2021)
7.1 Core
"Addresses the environmental impact, particularly energy consumption and efficiency."
7.2 Design Principles
- Understand your impact — where does the carbon come from?
- Set sustainability goals — measurable targets.
- Maximize utilization — no idle resources.
- Adopt more efficient new technology — Graviton, ARM.
- Use managed services — ride AWS's efficiency.
- Reduce downstream impact — lighter clients too.
7.3 Sustainability Metrics
- Carbon footprint.
- Energy per user.
- Idle-resource ratio.
- Data-transfer optimization.
7.4 Practical Actions
- Efficient instance types: Graviton (ARM) is roughly 60% more efficient than comparable Intel/AMD.
- Auto Scaling: shut down what you are not using; auto-stop dev environments overnight.
- Object lifecycle: move cold data to cold storage and delete the rest.
- CDN: serve closer to users — fewer network hops, less energy.
- Region selection: favor regions with high renewable-energy ratios; AWS Customer Carbon Footprint Tool helps here.
7.5 AWS Commitments
- 100% renewable energy by 2025.
- Net-zero carbon by 2040.
- Efficient datacenter design.
8. Handling Trade-offs
8.1 Pillars in Tension
- Security vs. Performance: strong crypto has CPU cost; decide by data sensitivity.
- Cost vs. Reliability: Multi-Region doubles cost; decide by SLA.
- Performance vs. Cost: bigger instance, faster and pricier; decide by latency targets.
- Ops Excellence vs. Cost: automation front-loads cost; decide by long-term ROI.
8.2 Decision Framework
For each decision, ask:
- Business goal — what matters most?
- Per-pillar impact — which pillars are affected?
- Trade-off — what are you giving up?
- Risk — what if it goes wrong?
- Reversibility — how expensive is changing later?
8.3 Example
Scenario: an e-commerce site serving 1M daily users.
| Pillar | Priority | Decision |
|---|---|---|
| Reliability | High | Multi-AZ, Auto Scaling |
| Performance | High | CloudFront + ElastiCache |
| Security | High | WAF, KMS, MFA |
| Cost | Medium | Reserved + Spot mix |
| Operational | Medium | IaC, CI/CD |
| Sustainability | Low | Graviton |
9. Using the Well-Architected Tool
9.1 Free Self-Assessment
In the AWS Console:
- Define the workload (name, environment, region).
- Answer ~50 questions across the six pillars.
- Identify improvement areas.
- Prioritize.
- Re-assess after changes.
9.2 Lenses
A Lens is additional guidance specialized for a particular workload or technology.
| Lens | Target |
|---|---|
| Serverless Lens | Lambda, API Gateway |
| SaaS Lens | multi-tenant SaaS |
| Machine Learning Lens | ML workloads |
| Foundational Technical Review | AWS Partners |
| IoT Lens | IoT systems |
| Streaming Media Lens | media |
9.3 Well-Architected Review
Run with an AWS Solutions Architect (free or via a partner):
- Architecture walk-through.
- Six-pillar assessment.
- Prioritized recommendations.
- Improvement roadmap.
10. Anti-Pattern Catalog
10.1 Operations
- BAD: manual deploys → human error. GOOD: CI/CD pipeline.
- BAD: only local testing. GOOD: production-like environments.
- BAD: no postmortems. GOOD: blameless postmortems.
10.2 Security
- BAD: using the root account. GOOD: IAM users and roles.
- BAD: wildcard permissions. GOOD: least privilege.
- BAD: hard-coded credentials. GOOD: Secrets Manager / Parameter Store.
- BAD: HTTP only. GOOD: enforced HTTPS, HSTS.
10.3 Reliability
- BAD: single-AZ deployment. GOOD: Multi-AZ minimum.
- BAD: no backups or untested restores. GOOD: regular backups with restore drills.
- BAD: single points of failure. GOOD: redundancy across every layer.
10.4 Performance
- BAD: oversized instances for everything. GOOD: workload-appropriate sizing.
- BAD: no caching. GOOD: CloudFront + ElastiCache.
- BAD: everything synchronous. GOOD: async where it fits.
10.5 Cost
- BAD: everything On-Demand. GOOD: mix Reserved / Savings / Spot.
- BAD: no tagging. GOOD: consistent tagging strategy.
- BAD: abandoned resources. GOOD: regular cleanup, Trusted Advisor.
Quiz
1. What are the six pillars of the Well-Architected Framework?
Answer: Operational Excellence (efficient operations), Security (protect data and systems), Reliability (recover from failure), Performance Efficiency (use resources efficiently), Cost Optimization (value per dollar), and Sustainability (environmental impact, added in 2021). It began with five pillars; sustainability was added as environmental awareness grew. Because no workload can satisfy every pillar 100%, trade-offs follow business priorities.
2. Difference between RTO and RPO?
Answer: RTO (Recovery Time Objective) is the acceptable time until recovery — "must be back online within 30 minutes." RPO (Recovery Point Objective) is the acceptable data loss — "at most 5 minutes of data loss is OK." Both come from the business. Tighter RTO/RPO explodes cost: real-time replication gives RPO 0 but is expensive. Most systems land around RTO 30 min and RPO 1 hour.
3. How do you use Spot Instances safely?
Answer: Spot instances can be terminated at any time (typically with a 2-minute notice). Safe use: (1) stateless workloads — no data loss on termination; (2) checkpointing — persist progress; (3) Spot Fleet — mix of instance types; (4) Auto Scaling Group with mixed instances — balance On-Demand and Spot; (5) graceful shutdown handlers. Ideal for CI/CD, batch, and analytics. Up to 90% savings.
4. Key sustainability practices?
Answer: (1) Graviton (ARM) — roughly 60% more efficient than Intel equivalents; (2) Auto Scaling — turn off unused capacity; (3) S3 lifecycle — move old data to cold storage; (4) CDN — fewer network hops; (5) choose renewable-heavy regions via the AWS Customer Carbon Footprint Tool; (6) delete idle resources such as unused EBS volumes or idle EC2. Efficient code equals less carbon equals lower cost — cost savings and sustainability align naturally.
5. Multi-AZ vs. Multi-Region?
Answer: Multi-AZ spreads across Availability Zones within one region; automatic failover on single-AZ failure; supported by most AWS services (RDS, ELB, ASG); little or no extra cost. Multi-Region spreads across regions to survive natural disasters, large-scale outages, or compliance demands; at least double the cost, plus complex data replication (DynamoDB Global Tables, Aurora Global Database). 99.99% is achievable with Multi-AZ; 99.999% typically requires Multi-Region.