Split View: AWS Well-Architected Framework 완전 가이드 2025: 6개 기둥, 실전 적용, 비용/보안/성능

AWS Well-Architected Framework 완전 가이드 2025: 6개 기둥, 실전 적용, 비용/보안/성능

TL;DR

6개 기둥: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability
트레이드오프: 모든 기둥을 100% 만족시킬 수 없음. 비즈니스 우선순위에 따라 결정
Well-Architected Tool: AWS 콘솔에서 무료 자체 평가
Sustainability (2021 추가): 환경 영향 고려. 효율적 코드 = 적은 탄소
모든 클라우드에 적용 가능: AWS의 프레임워크지만 GCP/Azure에서도 유효

1. Well-Architected Framework란?

1.1 배경

AWS가 수만 고객의 아키텍처를 본 후 정리한 모범 사례 가이드. 처음 5개 기둥, 2021년 Sustainability 추가하여 6개.

1.2 6개 기둥 (2025)

기둥	핵심 질문
Operational Excellence	어떻게 효율적으로 운영하나?
Security	어떻게 데이터와 시스템을 보호하나?
Reliability	장애에서 어떻게 회복하나?
Performance Efficiency	자원을 어떻게 효율적으로 사용하나?
Cost Optimization	어떻게 가치 대비 비용을 최소화하나?
Sustainability	환경 영향을 어떻게 최소화하나?

1.3 일반 원칙

5가지 일반 원칙 (모든 기둥에 적용):

추측 대신 측정 — 데이터 기반 결정
프로덕션 규모로 테스트 — staging은 작은 시스템, 프로덕션처럼 테스트
자동화로 실험 빈도 증가 — 수동 작업 = 인적 실수
진화하는 아키텍처 허용 — 한 번 만든 것이 영원하지 않음
게임 데이로 운영 개선 — Chaos engineering

2. 기둥 1: Operational Excellence

2.1 핵심

"비즈니스 가치를 전달하기 위해 시스템을 효과적으로 운영하고 모니터링하며, 지속적으로 개선하는 능력."

2.2 설계 원칙

운영을 코드로 (Operations as Code) — 인프라, 정책, 절차를 모두 코드화
작고 빈번한 변경 — 큰 배포는 위험. 작은 단위로 자주
운영 절차를 자주 개선 — 회고, 룬북 업데이트
장애 예측 — 가능한 실패 모드 파악
모든 운영 실패에서 배움 — 포스트모템

2.3 실전 체크리스트

계획:

비즈니스 목표가 명확한가?
메트릭으로 성공을 측정할 수 있는가?
운영 우선순위가 정의되어 있는가?

준비:

모든 인프라가 IaC (CloudFormation/Terraform)로?
CI/CD 파이프라인이 있는가?
모니터링과 알림이 설정되어 있는가?

운영:

룬북이 최신 상태인가?
On-call 절차가 명확한가?
장애 대응 시간이 측정되는가?

진화:

정기적인 retrospective?
카오스 엔지니어링 실시?
메트릭 기반 개선?

2.4 AWS 도구

CloudFormation / CDK — IaC
Systems Manager — 자동화 (런북, 패치)
CloudWatch — 모니터링, 알림
X-Ray — 분산 트레이싱
Config — 리소스 변경 추적

3. 기둥 2: Security

3.1 핵심

"데이터, 시스템, 자산을 보호하면서 클라우드 기술로 비즈니스 가치를 전달."

3.2 설계 원칙

강력한 신원 기반 구현 — IAM, MFA, 최소 권한
모든 레이어에서 보안 적용 — Defense in depth
저장 중 + 전송 중 데이터 암호화 — Encryption everywhere
데이터에 사람 접근 제한 — 자동화 우선
보안 이벤트 대비 — IR (Incident Response) 계획
공유 책임 모델 이해 — AWS는 클라우드의 보안, 고객은 클라우드 안의 보안

3.3 IAM 모범 사례

❌ root 계정 사용
❌ 액세스 키를 코드에
❌ * 권한 부여
❌ 공유 사용자

✅ MFA 모든 사용자
✅ Role 기반 권한 (IAM Roles)
✅ 최소 권한 원칙
✅ Access Analyzer 사용
✅ 임시 자격증명 (STS)

3.4 데이터 보호

저장 중 (At rest):

S3: 기본 SSE-S3, 또는 SSE-KMS
EBS: 기본 암호화 활성화
RDS: KMS 키로 암호화
DynamoDB: 항상 암호화 (기본)

전송 중 (In transit):

TLS 1.2+ 필수
VPC 내부도 암호화 (Service mesh, mTLS)
VPN/Direct Connect

3.5 네트워크 보안

[Public Subnet]    ← Internet Gateway
    ↓
[Private Subnet]   ← NAT Gateway (egress only)
    ↓
[Isolated Subnet]  ← DB만, 인터넷 없음

Security Group + NACL:

Security Group: 인스턴스 레벨, stateful
NACL: 서브넷 레벨, stateless

3.6 AWS 보안 도구

도구	용도
IAM	접근 제어
GuardDuty	위협 탐지 (ML 기반)
Security Hub	보안 발견 사항 통합
Inspector	취약점 스캔
Macie	S3 민감 데이터 탐지
WAF	웹 애플리케이션 방화벽
Shield	DDoS 보호
Secrets Manager	비밀 관리
KMS	키 관리

4. 기둥 3: Reliability

4.1 핵심

"워크로드가 의도된 대로 정확하고 일관되게 기능을 수행하는 능력."

4.2 설계 원칙

장애에서 자동 복구 — Auto-healing
복구 절차 테스트 — Chaos engineering
수평 확장으로 가용성 증가 — 단일 큰 인스턴스 X
용량 추측 중단 — Auto Scaling
자동화로 변경 관리 — IaC

4.3 AWS 가용성 모델

Region: 지리적 영역 (us-east-1, ap-northeast-2) Availability Zone (AZ): Region 안의 격리된 데이터센터 (보통 3-6개) Edge Location: CloudFront PoP

Multi-AZ: 여러 AZ에 분산 → 한 AZ 장애에도 살아남음. Multi-Region: 여러 region → 자연재해, 광역 장애 대비.

4.4 가용성 목표

가용성	다운타임/년	다운타임/월
99%	87.6시간	7.2시간
99.9% (3 nines)	8.76시간	43.8분
99.95%	4.38시간	21.9분
99.99% (4 nines)	52.6분	4.38분
99.999% (5 nines)	5.26분	26.3초

현실:

99.9% = 단일 인스턴스로 달성 어려움
99.99% = Multi-AZ + Auto Scaling
99.999% = Multi-Region + 정교한 failover

4.5 실전 패턴

ELB + Auto Scaling Group + Multi-AZ:

        [Route 53]
            ↓
       [ALB (multi-AZ)]
       /        |        \
   [EC2 az-a] [EC2 az-b] [EC2 az-c]
       ↓        ↓        ↓
       [RDS Multi-AZ Standby]

한 EC2 죽으면: ASG가 자동으로 새 인스턴스 시작. 한 AZ 죽으면: ALB가 다른 AZ로 트래픽 라우팅. Primary RDS 죽으면: Standby로 자동 failover.

4.6 RTO와 RPO

RTO (Recovery Time Objective): 복구까지 허용 시간 RPO (Recovery Point Objective): 허용 가능한 데이터 손실

전략	RTO	RPO	비용
Backup & Restore	시간	시간	낮음
Pilot Light	분	분	보통
Warm Standby	분	초	높음
Multi-Site Active-Active	0	0	매우 높음

비즈니스가 견딜 수 있는 RTO/RPO를 정한 후 그에 맞는 전략 선택.

5. 기둥 4: Performance Efficiency

5.1 핵심

"시스템 요구사항을 충족하기 위해 컴퓨팅 리소스를 효율적으로 사용하고, 수요 변화와 기술 진화에 따라 그 효율성을 유지하는 능력."

5.2 설계 원칙

고급 기술의 민주화 — AI/ML, 빅데이터를 모두 사용 가능
글로벌 배포 — 사용자에 가까이
서버리스 우선 — 관리 부담 감소
더 자주 실험 — 비교 테스트
기술 친화 — 워크로드에 맞는 도구 선택

5.3 컴퓨트 선택

옵션	사용 사례
Lambda	짧은 작업, 이벤트 드리븐
Fargate	컨테이너, 서버 관리 X
EC2	긴 작업, 커스텀 환경
ECS/EKS	컨테이너 오케스트레이션
Batch	대량 배치 작업
Lightsail	단순 웹 호스팅

5.4 스토리지 선택

옵션	용도
S3	객체 저장, 백업, 정적 파일
EBS	EC2 블록 스토리지
EFS	공유 파일시스템 (NFS)
FSx	Windows/Lustre 파일시스템
Glacier	장기 아카이브

5.5 데이터베이스 선택

옵션	사용 사례
RDS	전통적 관계형
Aurora	고성능 RDS 호환
DynamoDB	NoSQL, 단일 ms latency
DocumentDB	MongoDB 호환
ElastiCache	Redis/Memcached 캐시
Neptune	그래프 DB
Timestream	시계열
OpenSearch	검색, 로그 분석

5.6 캐싱

계층적 캐싱:

CloudFront (CDN) — 전 세계 edge
API Gateway 캐시 — API 응답
ElastiCache (Redis) — 애플리케이션 캐시
DynamoDB DAX — DynamoDB 캐시
RDS Read Replica — 읽기 부하 분산

효과: 응답 시간 90%+ 감소, DB 부하 극단 감소.

6. 기둥 5: Cost Optimization

6.1 핵심

"최소 비용으로 비즈니스 가치를 전달하는 시스템."

6.2 설계 원칙

클라우드 재정 관리 구현 — FinOps
소비 모델 채택 — 사용한 만큼만
전체 효율 측정 — Cost per business metric
데이터센터 운영의 무거운 짐 해소 — 관리형 서비스 사용
비용 분석과 책임 — 태깅, 분배

6.3 가격 모델

Compute (EC2):

On-Demand: 가장 비쌈, 유연
Reserved Instance (1-3년): 최대 75% 절감
Savings Plans: 더 유연한 약정
Spot: 최대 90% 절감, 중단 가능

스토리지 (S3):

Standard: 자주 접근
Standard-IA: 가끔 접근, 30% 저렴
Glacier: 장기 보관, 95% 저렴
Glacier Deep Archive: 최저가, 95%+ 저렴

6.4 비용 절감 전략

1. 우측 사이징 (Right-sizing):

aws compute-optimizer get-ec2-instance-recommendations

50% 사용률 EC2를 → 더 작은 인스턴스로.

2. Auto Scaling:

트래픽 적을 때 자동 축소
트래픽 많을 때 자동 확장
비용 = 사용량에 비례

3. Spot Instances:

배치 작업, CI/CD에 적합
90% 절감 가능
중단에 대비한 설계 필요

4. Reserved Instances / Savings Plans:

안정적 워크로드에 적합
1년 약정: 30% 절감
3년 약정: 60% 절감

5. S3 Lifecycle Policies:

{
  "Rules": [{
    "Id": "MoveToIA",
    "Status": "Enabled",
    "Transitions": [
      { "Days": 30, "StorageClass": "STANDARD_IA" },
      { "Days": 90, "StorageClass": "GLACIER" },
      { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
    ]
  }]
}

6. 데이터 전송 비용 줄이기:

같은 region 내 transfer는 무료/저렴
CloudFront 사용 (S3 직접보다 저렴)
VPC Endpoints (NAT Gateway 회피)

6.5 비용 모니터링

도구	용도
Cost Explorer	비용 분석 시각화
Budgets	예산 알림
Cost and Usage Reports	상세 데이터
Trusted Advisor	비용 최적화 권장
Compute Optimizer	EC2 사이즈 추천
Savings Plans Recommendations	절감 추천

6.6 태깅 전략

Environment: production
Project: ecommerce
Owner: team-checkout
CostCenter: 1234

→ Cost Explorer에서 태그별 비용 분석 가능.

7. 기둥 6: Sustainability (2021 추가)

7.1 핵심

"환경 영향, 특히 에너지 소비와 효율성을 다룸."

7.2 설계 원칙

영향 이해 — 어디서 탄소를 배출하나?
지속가능성 목표 설정 — 측정 가능한 목표
사용량 극대화 — 유휴 자원 최소화
새 효율적 기술 채택 — Graviton, ARM
관리형 서비스 사용 — AWS의 효율성 활용
다운스트림 영향 줄이기 — 사용자에게 효율적

7.3 지속가능성 메트릭

탄소 발자국 (Carbon Footprint)
사용자당 에너지 (가장 효율적)
유휴 자원 비율
데이터 전송 최적화

7.4 실전 적용

1. 효율적인 instance type 선택:

Graviton (ARM) — Intel/AMD 대비 60% 더 효율적
같은 성능에 적은 전력

2. 자동 스케일링:

사용 안 할 때 종료
야간 자동 종료 (개발 환경)

3. Object Lifecycle:

오래된 데이터를 cold storage로
사용 안 하는 데이터 삭제

4. CDN 사용:

사용자에 가까이 → 적은 네트워크 = 적은 에너지

5. Region 선택:

재생 에너지 비율이 높은 region 선택
AWS Customer Carbon Footprint Tool

7.5 AWS의 약속

2025년까지 100% 재생 에너지
2040년까지 net-zero carbon
효율적 데이터센터 설계

8. 트레이드오프 다루기

8.1 기둥 간 충돌

보안 vs 성능:

강한 암호화 → 약간의 CPU 비용
결정: 데이터 민감도에 따라

비용 vs 신뢰성:

Multi-region = 비용 2배+
결정: SLA에 따라

성능 vs 비용:

큰 인스턴스 = 빠름 + 비쌈
결정: 응답 시간 요구사항

운영 우수성 vs 비용:

모든 자동화 = 초기 비용
결정: 장기 ROI

8.2 의사결정 프레임워크

각 결정에 대해:

비즈니스 목표: 무엇이 중요한가?
각 기둥에 대한 영향: 어떤 기둥이 영향 받나?
트레이드오프: 무엇을 포기하나?
위험: 잘못되면?
반복 가능성: 결정 변경 비용?

8.3 예시

시나리오: e-commerce 사이트, 일일 100만 사용자

기둥	우선순위	결정
Reliability	높음	Multi-AZ, Auto Scaling
Performance	높음	CloudFront + ElastiCache
Security	높음	WAF, KMS, MFA
Cost	중간	Reserved + Spot 혼합
Operational	중간	IaC, CI/CD
Sustainability	낮음	Graviton 사용

9. Well-Architected Tool 사용

9.1 무료 자체 평가

AWS 콘솔에서 Well-Architected Tool 사용:

워크로드 정의 (이름, 환경, region)
6개 기둥에 대한 질문 답변 (~50개)
개선 영역 식별
우선순위 결정
개선 후 재평가

9.2 Lens

Lens = 특정 워크로드/기술에 특화된 추가 가이드.

Lens	대상
Serverless Lens	Lambda, API Gateway 기반
SaaS Lens	멀티테넌트 SaaS
Machine Learning Lens	ML 워크로드
Foundational Technical Review	AWS Partner
IoT Lens	IoT 시스템
Streaming Media Lens	미디어

9.3 Well-Architected Review

AWS Solutions Architect와 함께 (무료 또는 파트너):

아키텍처 검토
6개 기둥 평가
우선순위 권장사항
개선 로드맵

10. 안티패턴 카탈로그

10.1 운영 안티패턴

❌ 수동 배포 — 인적 실수의 원인 ✅ CI/CD 파이프라인

❌ 로컬에서만 테스트 ✅ 프로덕션 유사 환경에서 테스트

❌ 장애 후 회고 없음 ✅ Blameless postmortem

10.2 보안 안티패턴

❌ 루트 계정 사용 ✅ IAM 사용자 + Role

❌ 모든 사용자에게 : 권한 ✅ 최소 권한 원칙

❌ 하드코딩된 자격증명 ✅ Secrets Manager / Parameter Store

❌ HTTP만 사용 ✅ HTTPS 강제, HSTS

10.3 신뢰성 안티패턴

❌ 단일 AZ 배포 ✅ Multi-AZ 최소

❌ 백업 없음 또는 복원 테스트 없음 ✅ 정기 백업 + 복원 훈련

❌ 단일 장애점 (single point of failure) ✅ 모든 컴포넌트 다중화

10.4 성능 안티패턴

❌ 대형 인스턴스로 모든 것 ✅ 워크로드별 적절한 인스턴스

❌ 캐싱 없음 ✅ CloudFront + ElastiCache

❌ 모든 것이 동기 ✅ 적절한 비동기 처리

10.5 비용 안티패턴

❌ 모든 것을 On-Demand ✅ Reserved/Savings/Spot 혼합

❌ 태그 없음 ✅ 일관된 태깅 전략

❌ 사용 안 하는 자원 방치 ✅ 정기적 정리, AWS Trusted Advisor

퀴즈

1. Well-Architected Framework의 6개 기둥은?

답: (1) Operational Excellence — 효율적 운영, (2) Security — 데이터/시스템 보호, (3) Reliability — 장애 회복, (4) Performance Efficiency — 자원 효율, (5) Cost Optimization — 가치 대비 비용, (6) Sustainability — 환경 영향 (2021 추가). 5개에서 시작했지만 환경 의식 증가로 Sustainability 추가. 모든 기둥이 동시에 100% 만족 불가 → 비즈니스 우선순위에 따라 트레이드오프.

2. RTO와 RPO의 차이는?

답: RTO (Recovery Time Objective): 장애 후 복구까지 허용 시간. 예: "30분 안에 복구해야 함." RPO (Recovery Point Objective): 허용 가능한 데이터 손실. 예: "최대 5분치 데이터까지 잃어도 OK." 둘 다 비즈니스 요구사항에서 도출. RTO/RPO가 작을수록 비용 폭증 (실시간 복제 = 비싸지만 RPO 0). 대부분의 시스템은 RTO 30분, RPO 1시간 정도.

3. Spot Instance를 어떻게 안전하게 사용하나요?

답: Spot은 언제든 종료될 수 있음 (보통 2분 알림). 안전한 사용: (1) 상태 비저장 워크로드 — 종료 시 데이터 손실 없음, (2) 체크포인트 — 작업 진행 상태 저장, (3) Spot Fleet — 여러 instance type 혼합, (4) Auto Scaling Group with Mixed Instances — On-Demand + Spot 자동 균형, (5) Graceful shutdown 핸들러 — 종료 알림 받으면 정리. CI/CD, 배치 작업, 분석 작업에 이상적. 90% 비용 절감 가능.

4. Sustainability 기둥의 핵심 실천은?

답: (1) Graviton (ARM) 사용 — Intel 대비 60% 더 효율적, (2) Auto Scaling — 사용 안 할 때 자동 축소, (3) S3 Lifecycle — 오래된 데이터를 cold storage로, (4) CDN — 사용자에 가까이 = 적은 네트워크, (5) 재생 에너지 region 선택 — AWS Customer Carbon Footprint Tool 활용, (6) 유휴 자원 정리 — 사용 안 하는 EBS volumes, idle EC2. 효율적 코드 = 적은 탄소 = 더 적은 비용. 비용 절감과 자연스럽게 연결됩니다.

5. Multi-AZ와 Multi-Region의 차이는?

답: Multi-AZ: 한 region 안의 여러 가용 영역(AZ)에 배포. 한 AZ 장애에 자동 failover. 대부분 AWS 서비스가 지원 (RDS, ELB, ASG). 비용 증가 X (또는 약간). Multi-Region: 여러 region에 배포. 자연재해, 광역 장애, 규제 준수에 대비. 비용 2배+, 데이터 동기화 복잡 (DynamoDB Global Tables, Aurora Global Database). 99.99% 가용성은 Multi-AZ로 충분, 99.999%는 Multi-Region 필요.

참고 자료

AWS Well-Architected Framework
Well-Architected Tool
AWS Architecture Center
AWS Solutions Library
AWS Customer Carbon Footprint Tool
Cloud Adoption Framework
The Twelve-Factor App — 클라우드 네이티브 원칙
Site Reliability Engineering Book — Google
AWS This is My Architecture — 실전 사례
re:Invent 발표 영상들
AWS Trusted Advisor

AWS Well-Architected Framework Complete Guide 2025: Six Pillars, Practical Adoption, Cost/Security/Performance

TL;DR

Six Pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability.
Trade-offs: No system can satisfy every pillar 100%. Decide based on business priorities.
Well-Architected Tool: Free self-assessment inside the AWS Console.
Sustainability (added 2021): Considers environmental impact — efficient code equals less carbon.
Applicable to any cloud: It is an AWS framework, but the principles hold on GCP/Azure too.

1. What Is the Well-Architected Framework?

1.1 Background

After reviewing tens of thousands of customer architectures, AWS distilled its findings into a best-practice guide. It started with five pillars; Sustainability was added in 2021, making six.

1.2 The Six Pillars (2025)

Pillar	Core Question
Operational Excellence	How do we run the workload efficiently?
Security	How do we protect data and systems?
Reliability	How do we recover from failure?
Performance Efficiency	How do we use resources efficiently?
Cost Optimization	How do we minimize cost per unit of value?
Sustainability	How do we minimize environmental impact?

1.3 General Principles

Five principles that apply to every pillar:

Measure, don't guess — data-driven decisions.
Test at production scale — staging is not production.
Increase experimentation with automation — manual work equals human error.
Allow evolutionary architectures — nothing built once lasts forever.
Improve operations through game days — chaos engineering.

2. Pillar 1: Operational Excellence

2.1 Core

"The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures."

2.2 Design Principles

Operations as code — codify infra, policy, and procedure.
Small, frequent changes — big deployments are risky.
Continuously refine procedures — retros, runbook updates.
Anticipate failure — map failure modes.
Learn from every operational failure — postmortems.

2.3 Practical Checklist

Plan:

Business objectives are explicit?
Success measured by metrics?
Operational priorities defined?

Prepare:

All infra managed as IaC (CloudFormation/Terraform)?
CI/CD pipeline in place?
Monitoring and alerting configured?

Operate:

Runbooks current?
Clear on-call procedures?
Incident response times measured?

Evolve:

Regular retrospectives?
Chaos engineering drills?
Metrics-driven improvement?

2.4 AWS Tools

CloudFormation / CDK — IaC.
Systems Manager — automation (runbooks, patching).
CloudWatch — monitoring, alerting.
X-Ray — distributed tracing.
Config — resource change tracking.

3. Pillar 2: Security

3.1 Core

"Protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies."

3.2 Design Principles

Implement a strong identity foundation — IAM, MFA, least privilege.
Apply security at all layers — defense in depth.
Encrypt data at rest and in transit — encryption everywhere.
Keep people away from data — prefer automation.
Prepare for security events — incident response plan.
Understand the shared responsibility model — AWS secures the cloud, you secure what is in it.

3.3 IAM Best Practices

BAD: use the root account
BAD: access keys inside code
BAD: wildcard (*) permissions
BAD: shared users

GOOD: MFA for every user
GOOD: role-based access (IAM Roles)
GOOD: least-privilege principle
GOOD: Access Analyzer
GOOD: temporary credentials via STS

3.4 Data Protection

At rest:

S3: default SSE-S3, or SSE-KMS.
EBS: enable default encryption.
RDS: KMS-managed keys.
DynamoDB: always encrypted (default).

In transit:

TLS 1.2+ mandatory.
Encrypt intra-VPC traffic (service mesh, mTLS).
VPN / Direct Connect.

3.5 Network Security

[Public Subnet]    <- Internet Gateway
    |
[Private Subnet]   <- NAT Gateway (egress only)
    |
[Isolated Subnet]  <- DB only, no internet

Security Group vs. NACL:

Security Group: instance-level, stateful.
NACL: subnet-level, stateless.

3.6 AWS Security Tools

Tool	Purpose
IAM	access control
GuardDuty	threat detection (ML-based)
Security Hub	findings aggregation
Inspector	vulnerability scanning
Macie	sensitive data discovery in S3
WAF	web application firewall
Shield	DDoS protection
Secrets Manager	secret storage
KMS	key management

4. Pillar 3: Reliability

4.1 Core

"The ability of a workload to perform its intended function correctly and consistently."

4.2 Design Principles

Automatically recover from failure — auto-healing.
Test recovery procedures — chaos engineering.
Scale horizontally — no single giant instance.
Stop guessing capacity — Auto Scaling.
Manage change through automation — IaC.

4.3 AWS Availability Model

Region: geographic area (us-east-1, ap-northeast-2).
Availability Zone (AZ): isolated datacenter inside a region (3–6 typical).
Edge Location: CloudFront PoP.

Multi-AZ: spread across AZs to survive single-AZ failure. Multi-Region: spread across regions to survive natural disasters or regional outages.

4.4 Availability Targets

Availability	Downtime/year	Downtime/month
99%	87.6 h	7.2 h
99.9% (three nines)	8.76 h	43.8 min
99.95%	4.38 h	21.9 min
99.99% (four nines)	52.6 min	4.38 min
99.999% (five nines)	5.26 min	26.3 s

Reality:

99.9% is hard on a single instance.
99.99% requires Multi-AZ + Auto Scaling.
99.999% requires Multi-Region with sophisticated failover.

4.5 Practical Pattern

ELB + Auto Scaling Group + Multi-AZ:

        [Route 53]
            |
       [ALB (multi-AZ)]
       /        |        \
   [EC2 az-a] [EC2 az-b] [EC2 az-c]
       |        |        |
       [RDS Multi-AZ Standby]

EC2 dies: ASG launches a replacement.
AZ dies: ALB routes traffic to the remaining AZs.
Primary RDS dies: automatic failover to standby.

4.6 RTO and RPO

RTO (Recovery Time Objective): acceptable time until recovery.
RPO (Recovery Point Objective): acceptable data loss.

Strategy	RTO	RPO	Cost
Backup & Restore	hours	hours	low
Pilot Light	minutes	minutes	medium
Warm Standby	minutes	seconds	high
Multi-Site Active-Active	0	0	very high

Pick the strategy that matches the RTO/RPO the business can absorb.

5. Pillar 4: Performance Efficiency

5.1 Core

"The ability to use computing resources efficiently to meet requirements, and to maintain that efficiency as demand and technology evolve."

5.2 Design Principles

Democratize advanced technologies — AI/ML and big data for everyone.
Go global — get close to users.
Prefer serverless — less operational burden.
Experiment more often — comparison tests.
Have mechanical sympathy — pick the right tool.

5.3 Compute Choices

Option	Use case
Lambda	short, event-driven tasks
Fargate	containers without server management
EC2	long-running or custom environments
ECS/EKS	container orchestration
Batch	large batch jobs
Lightsail	simple web hosting

5.4 Storage Choices

Option	Use
S3	objects, backups, static assets
EBS	EC2 block storage
EFS	shared filesystem (NFS)
FSx	Windows / Lustre filesystem
Glacier	long-term archive

5.5 Database Choices

Option	Use case
RDS	classic relational
Aurora	high-performance, RDS-compatible
DynamoDB	NoSQL, single-ms latency
DocumentDB	MongoDB-compatible
ElastiCache	Redis/Memcached cache
Neptune	graph DB
Timestream	time series
OpenSearch	search, log analytics

5.6 Caching

Layered caching:

CloudFront (CDN) — global edge.
API Gateway cache — API responses.
ElastiCache (Redis) — application cache.
DynamoDB DAX — DynamoDB cache.
RDS Read Replica — offload read traffic.

Result: response time cut by 90%+, DB load drops dramatically.

6. Pillar 5: Cost Optimization

6.1 Core

"The ability to run systems that deliver business value at the lowest price point."

6.2 Design Principles

Implement cloud financial management — FinOps.
Adopt a consumption model — pay for what you use.
Measure overall efficiency — cost per business metric.
Stop spending on undifferentiated heavy lifting — use managed services.
Analyze and attribute expenditure — tagging and allocation.

6.3 Pricing Models

Compute (EC2):

On-Demand: most expensive, most flexible.
Reserved Instance (1–3 years): up to 75% savings.
Savings Plans: more flexible commitments.
Spot: up to 90% savings, interruptible.

Storage (S3):

Standard: frequent access.
Standard-IA: infrequent access, ~30% cheaper.
Glacier: long-term retention, ~95% cheaper.
Glacier Deep Archive: cheapest, 95%+ cheaper.

6.4 Savings Strategies

1. Right-sizing:

aws compute-optimizer get-ec2-instance-recommendations

Drop an EC2 averaging 50% utilization to a smaller instance.

2. Auto Scaling: contract when traffic is low, expand when it spikes. Cost scales with usage.

3. Spot Instances: ideal for batch and CI/CD workloads, up to 90% off; design for interruption.

4. Reserved Instances / Savings Plans: for steady workloads — roughly 30% savings for a 1-year commit and 60% for 3 years.

5. S3 Lifecycle Policies:

{
  "Rules": [{
    "Id": "MoveToIA",
    "Status": "Enabled",
    "Transitions": [
      { "Days": 30, "StorageClass": "STANDARD_IA" },
      { "Days": 90, "StorageClass": "GLACIER" },
      { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
    ]
  }]
}

6. Cut data transfer cost: same-region transfers are free or cheap; prefer CloudFront over direct S3 egress; use VPC Endpoints to skip NAT Gateway charges.

6.5 Cost Monitoring

Tool	Use
Cost Explorer	cost visualization
Budgets	budget alerts
Cost and Usage Reports	detailed data
Trusted Advisor	cost recommendations
Compute Optimizer	EC2 sizing suggestions
Savings Plans Recommendations	savings suggestions

6.6 Tagging Strategy

Environment: production
Project: ecommerce
Owner: team-checkout
CostCenter: 1234

Then break down cost by tag in Cost Explorer.

7. Pillar 6: Sustainability (added 2021)

7.1 Core

"Addresses the environmental impact, particularly energy consumption and efficiency."

7.2 Design Principles

Understand your impact — where does the carbon come from?
Set sustainability goals — measurable targets.
Maximize utilization — no idle resources.
Adopt more efficient new technology — Graviton, ARM.
Use managed services — ride AWS's efficiency.
Reduce downstream impact — lighter clients too.

7.3 Sustainability Metrics

Carbon footprint.
Energy per user.
Idle-resource ratio.
Data-transfer optimization.

7.4 Practical Actions

Efficient instance types: Graviton (ARM) is roughly 60% more efficient than comparable Intel/AMD.
Auto Scaling: shut down what you are not using; auto-stop dev environments overnight.
Object lifecycle: move cold data to cold storage and delete the rest.
CDN: serve closer to users — fewer network hops, less energy.
Region selection: favor regions with high renewable-energy ratios; AWS Customer Carbon Footprint Tool helps here.

7.5 AWS Commitments

100% renewable energy by 2025.
Net-zero carbon by 2040.
Efficient datacenter design.

8. Handling Trade-offs

8.1 Pillars in Tension

Security vs. Performance: strong crypto has CPU cost; decide by data sensitivity.
Cost vs. Reliability: Multi-Region doubles cost; decide by SLA.
Performance vs. Cost: bigger instance, faster and pricier; decide by latency targets.
Ops Excellence vs. Cost: automation front-loads cost; decide by long-term ROI.

8.2 Decision Framework

For each decision, ask:

Business goal — what matters most?
Per-pillar impact — which pillars are affected?
Trade-off — what are you giving up?
Risk — what if it goes wrong?
Reversibility — how expensive is changing later?

8.3 Example

Scenario: an e-commerce site serving 1M daily users.

Pillar	Priority	Decision
Reliability	High	Multi-AZ, Auto Scaling
Performance	High	CloudFront + ElastiCache
Security	High	WAF, KMS, MFA
Cost	Medium	Reserved + Spot mix
Operational	Medium	IaC, CI/CD
Sustainability	Low	Graviton

9. Using the Well-Architected Tool

9.1 Free Self-Assessment

In the AWS Console:

Define the workload (name, environment, region).
Answer ~50 questions across the six pillars.
Identify improvement areas.
Prioritize.
Re-assess after changes.

9.2 Lenses

A Lens is additional guidance specialized for a particular workload or technology.

Lens	Target
Serverless Lens	Lambda, API Gateway
SaaS Lens	multi-tenant SaaS
Machine Learning Lens	ML workloads
Foundational Technical Review	AWS Partners
IoT Lens	IoT systems
Streaming Media Lens	media

9.3 Well-Architected Review

Run with an AWS Solutions Architect (free or via a partner):

Architecture walk-through.
Six-pillar assessment.
Prioritized recommendations.
Improvement roadmap.

10. Anti-Pattern Catalog

10.1 Operations

BAD: manual deploys → human error. GOOD: CI/CD pipeline.
BAD: only local testing. GOOD: production-like environments.
BAD: no postmortems. GOOD: blameless postmortems.

10.2 Security

BAD: using the root account. GOOD: IAM users and roles.
BAD: wildcard permissions. GOOD: least privilege.
BAD: hard-coded credentials. GOOD: Secrets Manager / Parameter Store.
BAD: HTTP only. GOOD: enforced HTTPS, HSTS.

10.3 Reliability

BAD: single-AZ deployment. GOOD: Multi-AZ minimum.
BAD: no backups or untested restores. GOOD: regular backups with restore drills.
BAD: single points of failure. GOOD: redundancy across every layer.

10.4 Performance

BAD: oversized instances for everything. GOOD: workload-appropriate sizing.
BAD: no caching. GOOD: CloudFront + ElastiCache.
BAD: everything synchronous. GOOD: async where it fits.

10.5 Cost

BAD: everything On-Demand. GOOD: mix Reserved / Savings / Spot.
BAD: no tagging. GOOD: consistent tagging strategy.
BAD: abandoned resources. GOOD: regular cleanup, Trusted Advisor.

Quiz

1. What are the six pillars of the Well-Architected Framework?

Answer: Operational Excellence (efficient operations), Security (protect data and systems), Reliability (recover from failure), Performance Efficiency (use resources efficiently), Cost Optimization (value per dollar), and Sustainability (environmental impact, added in 2021). It began with five pillars; sustainability was added as environmental awareness grew. Because no workload can satisfy every pillar 100%, trade-offs follow business priorities.

2. Difference between RTO and RPO?

Answer: RTO (Recovery Time Objective) is the acceptable time until recovery — "must be back online within 30 minutes." RPO (Recovery Point Objective) is the acceptable data loss — "at most 5 minutes of data loss is OK." Both come from the business. Tighter RTO/RPO explodes cost: real-time replication gives RPO 0 but is expensive. Most systems land around RTO 30 min and RPO 1 hour.

3. How do you use Spot Instances safely?

Answer: Spot instances can be terminated at any time (typically with a 2-minute notice). Safe use: (1) stateless workloads — no data loss on termination; (2) checkpointing — persist progress; (3) Spot Fleet — mix of instance types; (4) Auto Scaling Group with mixed instances — balance On-Demand and Spot; (5) graceful shutdown handlers. Ideal for CI/CD, batch, and analytics. Up to 90% savings.

4. Key sustainability practices?

Answer: (1) Graviton (ARM) — roughly 60% more efficient than Intel equivalents; (2) Auto Scaling — turn off unused capacity; (3) S3 lifecycle — move old data to cold storage; (4) CDN — fewer network hops; (5) choose renewable-heavy regions via the AWS Customer Carbon Footprint Tool; (6) delete idle resources such as unused EBS volumes or idle EC2. Efficient code equals less carbon equals lower cost — cost savings and sustainability align naturally.

5. Multi-AZ vs. Multi-Region?

Answer: Multi-AZ spreads across Availability Zones within one region; automatic failover on single-AZ failure; supported by most AWS services (RDS, ELB, ASG); little or no extra cost. Multi-Region spreads across regions to survive natural disasters, large-scale outages, or compliance demands; at least double the cost, plus complex data replication (DynamoDB Global Tables, Aurora Global Database). 99.99% is achievable with Multi-AZ; 99.999% typically requires Multi-Region.