Split View: FinOps & 클라우드 비용 최적화 완전 가이드 2025: AWS/GCP/Azure 비용 절감 전략
FinOps & 클라우드 비용 최적화 완전 가이드 2025: AWS/GCP/Azure 비용 절감 전략
목차
1. FinOps란 무엇인가?
1.1 FinOps의 정의와 배경
클라우드 비용이 폭발적으로 증가하면서, 기업들은 새로운 비용 관리 프레임워크가 필요해졌습니다. FinOps(Financial Operations)는 Finance와 DevOps를 결합한 개념으로, 엔지니어링 팀이 클라우드 비용에 대한 소유권을 갖고 비즈니스 가치와 비용 사이의 균형을 잡는 문화적 실천입니다.
FinOps Foundation에 따르면 FinOps는 다음과 같은 원칙을 기반으로 합니다:
- 팀 간 협업: 엔지니어링, 재무, 비즈니스 팀이 함께 비용을 관리
- 비용에 대한 소유권: 각 팀이 자신의 클라우드 지출을 책임짐
- 적시 의사결정: 실시간 데이터를 기반으로 한 비용 최적화
- 비즈니스 가치 중심: 단순 비용 절감이 아닌 비즈니스 가치 극대화
1.2 FinOps 라이프사이클: Inform - Optimize - Operate
FinOps 프레임워크는 세 가지 반복적인 단계로 구성됩니다.
┌─────────────────────────────────────────────────────┐
│ FinOps Lifecycle │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Inform │───>│ Optimize │───>│ Operate │ │
│ │ │ │ │ │ │ │
│ │ 가시성 │ │ 최적화 │ │ 운영 │ │
│ │ 확보 │ │ 실행 │ │ 지속 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ^ | │
│ └────────────────────────────────┘ │
│ 반복 (Iterate) │
└─────────────────────────────────────────────────────┘
Inform (알리기): 비용 가시성을 확보하고, 누가 무엇에 얼마를 쓰고 있는지 파악합니다.
Optimize (최적화): 리소스 라이트사이징, 예약 할인, 미사용 리소스 제거 등을 실행합니다.
Operate (운영): 최적화를 지속하기 위한 프로세스와 거버넌스를 구축합니다.
1.3 FinOps 성숙도 모델
Level 1: Crawl (기초)
├── 기본적인 비용 리포팅
├── 태깅 전략 수립 시작
└── 주요 비용 동인 파악
Level 2: Walk (중급)
├── 부서별 비용 할당
├── 예약 인스턴스 도입
├── 자동화된 리포팅
└── 비용 이상 감지
Level 3: Run (고급)
├── 실시간 비용 최적화
├── 자동 스케일링 최적화
├── Unit Economics 기반 추적
├── 비용 예측 및 예산 관리
└── FinOps 문화 완전 정착
2. 클라우드 비용 모델 이해
2.1 주요 과금 모델 비교
클라우드 비용을 최적화하려면 먼저 과금 모델을 정확히 이해해야 합니다.
# 클라우드 과금 모델 비교
On-Demand:
설명: "사용한 만큼 지불. 약정 없음"
할인율: "0% (기준 가격)"
장점: "유연성, 즉시 시작/중지"
단점: "가장 비싼 옵션"
적합한 워크로드: "개발/테스트, 단기 프로젝트, 불규칙 워크로드"
Reserved Instances (RI):
설명: "1년 또는 3년 약정으로 할인"
할인율: "최대 72% (AWS), 최대 57% (Azure)"
장점: "큰 할인, 용량 예약"
단점: "장기 약정, 유연성 부족"
적합한 워크로드: "안정적인 프로덕션 워크로드"
Savings Plans:
설명: "시간당 사용량 약정으로 할인 (AWS)"
할인율: "최대 72%"
장점: "RI보다 유연, 인스턴스 패밀리/리전 변경 가능"
단점: "약정 필요"
적합한 워크로드: "안정적이지만 변동 가능한 워크로드"
Spot/Preemptible:
설명: "여유 용량을 대폭 할인"
할인율: "최대 90% (AWS), 최대 91% (GCP)"
장점: "가장 저렴"
단점: "언제든 중단될 수 있음"
적합한 워크로드: "배치 처리, CI/CD, 내결함성 워크로드"
2.2 멀티클라우드 과금 용어 매핑
AWS GCP Azure
─────────────────────────────────────────────────────────
On-Demand On-Demand Pay-as-you-go
Reserved Instances Committed Use Reserved VM Instances
Savings Plans Flex CUDs Azure Savings Plan
Spot Instances Preemptible/Spot VMs Spot VMs
Cost Explorer Billing Reports Cost Management
AWS Budgets GCP Budgets Azure Budgets
Compute Optimizer Recommender Azure Advisor
3. 비용 가시성 확보 (Inform 단계)
3.1 태깅 전략
태깅은 FinOps의 기초입니다. 리소스에 적절한 태그가 없으면 비용을 누가 쓰는지 알 수 없습니다.
# 필수 태그 체계 예시
mandatory_tags:
- key: "Environment"
values: ["production", "staging", "development", "sandbox"]
description: "리소스가 속한 환경"
- key: "Team"
values: ["platform", "backend", "data", "ml", "frontend"]
description: "리소스를 소유한 팀"
- key: "Service"
values: ["user-api", "payment", "notification", "analytics"]
description: "리소스가 지원하는 서비스"
- key: "CostCenter"
values: ["CC-1001", "CC-1002", "CC-2001"]
description: "비용 센터 코드"
- key: "ManagedBy"
values: ["terraform", "cloudformation", "manual", "pulumi"]
description: "리소스 관리 도구"
optional_tags:
- key: "Project"
description: "프로젝트명"
- key: "ExpiryDate"
description: "리소스 만료일 (임시 리소스용)"
3.2 태그 강제화 자동화
# AWS Config 규칙으로 태그 미준수 리소스 감지
# config_rule_required_tags.py
import json
import boto3
REQUIRED_TAGS = ['Environment', 'Team', 'Service', 'CostCenter']
def lambda_handler(event, context):
"""AWS Config 규칙: 필수 태그 확인"""
config = boto3.client('config')
configuration_item = json.loads(
event['invokingEvent']
)['configurationItem']
resource_tags = configuration_item.get('tags', {})
missing_tags = [
tag for tag in REQUIRED_TAGS
if tag not in resource_tags
]
compliance_type = (
'NON_COMPLIANT' if missing_tags else 'COMPLIANT'
)
annotation = (
f"Missing tags: {', '.join(missing_tags)}"
if missing_tags else "All required tags present"
)
config.put_evaluations(
Evaluations=[{
'ComplianceResourceType': configuration_item['resourceType'],
'ComplianceResourceId': configuration_item['resourceId'],
'ComplianceType': compliance_type,
'Annotation': annotation,
'OrderingTimestamp': configuration_item['configurationItemCaptureTime']
}],
ResultToken=event['resultToken']
)
return {
'compliance_type': compliance_type,
'annotation': annotation
}
3.3 AWS Cost Explorer 활용
# AWS Cost Explorer API로 비용 분석
# cost_analysis.py
import boto3
from datetime import datetime, timedelta
def get_cost_by_service(days=30):
"""서비스별 비용 분석"""
ce = boto3.client('ce')
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (
datetime.now() - timedelta(days=days)
).strftime('%Y-%m-%d')
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{
'Type': 'DIMENSION',
'Key': 'SERVICE'
}]
)
costs = []
for time_period in response['ResultsByTime']:
for group in time_period['Groups']:
service = group['Keys'][0]
amount = float(
group['Metrics']['UnblendedCost']['Amount']
)
if amount > 0:
costs.append({
'service': service,
'cost': round(amount, 2)
})
# 비용 순으로 정렬
costs.sort(key=lambda x: x['cost'], reverse=True)
return costs
def get_cost_by_tag(tag_key='Team', days=30):
"""태그별 비용 분석"""
ce = boto3.client('ce')
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (
datetime.now() - timedelta(days=days)
).strftime('%Y-%m-%d')
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{
'Type': 'TAG',
'Key': tag_key
}]
)
return response
def detect_cost_anomalies(threshold_pct=20):
"""비용 이상 감지: 전월 대비 threshold% 이상 증가 시 알림"""
ce = boto3.client('ce')
current_month_start = datetime.now().replace(day=1).strftime('%Y-%m-%d')
current_date = datetime.now().strftime('%Y-%m-%d')
last_month_start = (
datetime.now().replace(day=1) - timedelta(days=1)
).replace(day=1).strftime('%Y-%m-%d')
last_month_end = (
datetime.now().replace(day=1) - timedelta(days=1)
).strftime('%Y-%m-%d')
# 현재 월 비용
current = ce.get_cost_and_usage(
TimePeriod={
'Start': current_month_start,
'End': current_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
# 전월 비용
previous = ce.get_cost_and_usage(
TimePeriod={
'Start': last_month_start,
'End': last_month_end
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
anomalies = []
# 비교 로직 (일평균 기준)
# ... 이상 감지 결과 반환
return anomalies
3.4 GCP Billing Export와 BigQuery 분석
-- GCP Billing Export를 BigQuery에서 분석
-- 서비스별 월간 비용 트렌드
SELECT
invoice.month AS billing_month,
service.description AS service_name,
SUM(cost) + SUM(IFNULL(
(SELECT SUM(c.amount) FROM UNNEST(credits) c), 0
)) AS net_cost
FROM
`project.dataset.gcp_billing_export_v1_XXXXXX`
WHERE
invoice.month >= '202501'
GROUP BY
billing_month, service_name
HAVING
net_cost > 10
ORDER BY
billing_month DESC, net_cost DESC;
-- 프로젝트별 일간 비용 추적
SELECT
FORMAT_DATE('%Y-%m-%d', usage_start_time) AS usage_date,
project.name AS project_name,
SUM(cost) AS daily_cost
FROM
`project.dataset.gcp_billing_export_v1_XXXXXX`
WHERE
usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
usage_date, project_name
ORDER BY
usage_date DESC, daily_cost DESC;
-- 미사용 리소스 탐지 (Committed Use 대비 실사용)
SELECT
sku.description,
SUM(usage.amount) AS total_usage,
SUM(cost) AS total_cost,
SUM(usage.amount) / COUNT(DISTINCT FORMAT_DATE('%Y-%m-%d', usage_start_time)) AS avg_daily_usage
FROM
`project.dataset.gcp_billing_export_v1_XXXXXX`
WHERE
usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY
sku.description
HAVING
total_cost > 100
ORDER BY
total_cost DESC;
4. 라이트사이징 (Right-sizing)
4.1 라이트사이징이란?
라이트사이징은 워크로드에 맞는 최적의 인스턴스 타입과 크기를 선택하는 과정입니다. 대부분의 클라우드 워크로드는 과다 프로비저닝되어 있습니다.
일반적인 EC2 인스턴스 활용률:
CPU 활용률 메모리 활용률
┌──────────────┐ ┌──────────────┐
│████░░░░░░░░░░│ │██████░░░░░░░░│
│ 20-30% │ │ 40-50% │
└──────────────┘ └──────────────┘
=> 70-80%의 CPU가 낭비됨!
=> 50-60%의 메모리가 낭비됨!
4.2 AWS Compute Optimizer 활용
# AWS Compute Optimizer로 라이트사이징 권장사항 조회
# rightsizing_analyzer.py
import boto3
def get_ec2_recommendations():
"""EC2 인스턴스 라이트사이징 권장사항 조회"""
co = boto3.client('compute-optimizer')
response = co.get_ec2_instance_recommendations(
filters=[{
'name': 'Finding',
'values': ['OVER_PROVISIONED']
}]
)
recommendations = []
for rec in response['instanceRecommendations']:
current = rec['currentInstanceType']
finding = rec['finding']
options = []
for option in rec['recommendationOptions']:
projected_savings = 0
for util in option.get('projectedUtilizationMetrics', []):
pass
options.append({
'instance_type': option['instanceType'],
'migration_effort': option.get('migrationEffort', 'Unknown'),
'performance_risk': option.get('performanceRisk', 0)
})
recommendations.append({
'instance_id': rec['instanceArn'].split('/')[-1],
'current_type': current,
'finding': finding,
'options': options,
'utilization': rec.get('utilizationMetrics', [])
})
return recommendations
def get_ebs_recommendations():
"""EBS 볼륨 라이트사이징 권장사항"""
co = boto3.client('compute-optimizer')
response = co.get_ebs_volume_recommendations(
filters=[{
'name': 'Finding',
'values': ['NotOptimized']
}]
)
recommendations = []
for rec in response['volumeRecommendations']:
current_config = rec['currentConfiguration']
recommendations.append({
'volume_id': rec['volumeArn'].split('/')[-1],
'current_type': current_config['volumeType'],
'current_size': current_config['volumeSize'],
'current_iops': current_config.get(
'volumeBaselineIOPS', 'N/A'
),
'finding': rec['finding'],
'options': rec['volumeRecommendationOptions']
})
return recommendations
def generate_rightsizing_report():
"""라이트사이징 보고서 생성"""
ec2_recs = get_ec2_recommendations()
ebs_recs = get_ebs_recommendations()
total_monthly_savings = 0
report = {
'ec2_recommendations': ec2_recs,
'ebs_recommendations': ebs_recs,
'total_over_provisioned_ec2': len(ec2_recs),
'total_not_optimized_ebs': len(ebs_recs),
'estimated_monthly_savings': total_monthly_savings
}
return report
4.3 GCP Recommender API
# GCP Recommender로 VM 라이트사이징 권장사항 조회
# gcp_rightsizing.py
from google.cloud import recommender_v1
def get_vm_rightsizing_recommendations(project_id, zone):
"""GCP VM 라이트사이징 권장사항"""
client = recommender_v1.RecommenderClient()
parent = (
f"projects/{project_id}/locations/{zone}/"
f"recommenders/google.compute.instance.MachineTypeRecommender"
)
recommendations = client.list_recommendations(
request={"parent": parent}
)
results = []
for rec in recommendations:
results.append({
'name': rec.name,
'description': rec.description,
'priority': rec.priority.name,
'state': rec.state_info.state.name,
'impact': {
'category': rec.primary_impact.category.name,
'cost_projection': str(
rec.primary_impact.cost_projection
) if rec.primary_impact.cost_projection else None
},
'operations': [
{
'action': op.action,
'resource_type': op.resource_type,
'resource': op.resource,
'path': op.path,
'value': op.value
}
for group in rec.content.operation_groups
for op in group.operations
]
})
return results
5. Reserved Instances vs Savings Plans
5.1 비교 분석
┌──────────────────────────────────────────────────────────────────┐
│ Reserved Instances vs Savings Plans │
├──────────────────┬──────────────────┬────────────────────────────┤
│ 항목 │ Reserved Instances│ Savings Plans │
├──────────────────┼──────────────────┼────────────────────────────┤
│ 할인율 │ 최대 72% │ 최대 72% │
│ 약정 기간 │ 1년 또는 3년 │ 1년 또는 3년 │
│ 인스턴스 유연성 │ 제한적(Standard) │ Compute SP: 매우 유연 │
│ │ 보통(Convertible)│ EC2 SP: 패밀리 내 유연 │
│ 리전 유연성 │ Zonal/Regional │ Compute SP: 모든 리전 │
│ 서비스 범위 │ EC2만 │ EC2 + Fargate + Lambda │
│ 결제 옵션 │ 전체/부분/없음 │ 전체/부분/없음 │
│ 권장 시나리오 │ 안정적 워크로드 │ 변동성 있는 워크로드 │
└──────────────────┴──────────────────┴────────────────────────────┘
5.2 손익 분기점 분석
# Reserved Instance 손익 분기점 분석
# break_even_analysis.py
def calculate_break_even(
on_demand_hourly,
ri_hourly,
ri_upfront,
term_months=12
):
"""RI 손익 분기점 계산"""
hours_per_month = 730 # 24 * 365 / 12
# 월간 절감액
monthly_savings = (on_demand_hourly - ri_hourly) * hours_per_month
if monthly_savings == 0:
return None
# 손익 분기점 (월)
break_even_months = ri_upfront / monthly_savings if ri_upfront > 0 else 0
# 총 절감액 (약정 기간 동안)
total_on_demand = on_demand_hourly * hours_per_month * term_months
total_ri = ri_hourly * hours_per_month * term_months + ri_upfront
total_savings = total_on_demand - total_ri
savings_pct = (total_savings / total_on_demand) * 100
return {
'break_even_months': round(break_even_months, 1),
'monthly_savings': round(monthly_savings, 2),
'total_savings': round(total_savings, 2),
'savings_percentage': round(savings_pct, 1),
'total_on_demand_cost': round(total_on_demand, 2),
'total_ri_cost': round(total_ri, 2)
}
# 예시: m5.xlarge (us-east-1)
result = calculate_break_even(
on_demand_hourly=0.192,
ri_hourly=0.0684, # 3년 부분선결제
ri_upfront=1248.0,
term_months=36
)
# 결과:
# break_even_months: 13.8
# total_savings: 약 58%
# => 14개월 후부터 순이익 발생
def recommend_commitment_strategy(
usage_history,
confidence_threshold=0.7
):
"""사용 패턴 기반 약정 전략 추천"""
# 최소 사용량 (P10): 안전하게 RI로 커버
# 평균 사용량 (P50): Savings Plans으로 커버
# 피크 사용량 (P90): On-Demand로 유지
import numpy as np
p10 = np.percentile(usage_history, 10)
p50 = np.percentile(usage_history, 50)
p90 = np.percentile(usage_history, 90)
return {
'reserved_instances': {
'coverage': 'P10 baseline',
'amount': p10,
'reason': '항상 사용하는 최소 용량'
},
'savings_plans': {
'coverage': 'P10 to P50',
'amount': p50 - p10,
'reason': '유연한 할인으로 변동 구간 커버'
},
'on_demand': {
'coverage': 'Above P50',
'amount': p90 - p50,
'reason': '피크 시 탄력적으로 대응'
}
}
6. Spot/Preemptible 인스턴스 전략
6.1 Spot 인스턴스 아키텍처
Spot 인스턴스는 최대 90% 할인을 제공하지만, 2분 경고 후 언제든 회수될 수 있습니다. 적절한 아키텍처가 핵심입니다.
# Spot 인스턴스 적합성 판단 기준
spot_suitable:
- 배치 처리 작업
- CI/CD 파이프라인
- 데이터 분석 / ETL
- 상태 없는(stateless) 웹 서버
- 컨테이너화된 마이크로서비스
- 머신러닝 학습 작업
- 테스트 환경
spot_not_suitable:
- 단일 인스턴스 데이터베이스
- 장시간 실행되는 상태 유지(stateful) 작업
- 중단 허용 불가 프로덕션 워크로드
6.2 AWS Spot Fleet 설정
{
"SpotFleetRequestConfig": {
"IamFleetRole": "arn:aws:iam::123456789012:role/spot-fleet-role",
"TargetCapacity": 10,
"SpotPrice": "0.10",
"TerminateInstancesWithExpiration": true,
"AllocationStrategy": "capacityOptimized",
"LaunchSpecifications": [
{
"InstanceType": "m5.xlarge",
"SubnetId": "subnet-abc123",
"WeightedCapacity": 1
},
{
"InstanceType": "m5a.xlarge",
"SubnetId": "subnet-abc123",
"WeightedCapacity": 1
},
{
"InstanceType": "m5d.xlarge",
"SubnetId": "subnet-def456",
"WeightedCapacity": 1
},
{
"InstanceType": "m4.xlarge",
"SubnetId": "subnet-def456",
"WeightedCapacity": 1
}
],
"OnDemandTargetCapacity": 2,
"OnDemandAllocationStrategy": "lowestPrice"
}
}
6.3 Spot 중단 핸들링
# Spot 인스턴스 중단 감지 및 처리
# spot_interruption_handler.py
import requests
import signal
import sys
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
METADATA_URL = "http://169.254.169.254/latest/meta-data"
SPOT_ACTION_URL = f"{METADATA_URL}/spot/instance-action"
def check_spot_interruption():
"""Spot 중단 알림 확인 (2분 경고)"""
try:
response = requests.get(
SPOT_ACTION_URL, timeout=2
)
if response.status_code == 200:
data = response.json()
return {
'action': data.get('action'),
'time': data.get('time'),
'interrupted': True
}
except requests.exceptions.RequestException:
pass
return {'interrupted': False}
def graceful_shutdown(checkpoint_func=None):
"""우아한 종료 수행"""
logger.info("Spot interruption detected! Starting graceful shutdown...")
# 1. 새로운 작업 수신 중지
logger.info("Stopping new task acceptance...")
# 2. 현재 작업 체크포인트 저장
if checkpoint_func:
logger.info("Saving checkpoint...")
checkpoint_func()
# 3. 로드밸런서에서 등록 해제
logger.info("Deregistering from load balancer...")
deregister_from_alb()
# 4. 진행 중인 요청 완료 대기 (최대 90초)
logger.info("Waiting for in-flight requests...")
time.sleep(10)
logger.info("Graceful shutdown complete")
def deregister_from_alb():
"""ALB에서 인스턴스 등록 해제"""
import boto3
instance_id = requests.get(
f"{METADATA_URL}/instance-id"
).text
elbv2 = boto3.client('elbv2')
# 실제 구현에서는 타겟 그룹 ARN을 설정에서 가져옴
# elbv2.deregister_targets(...)
def spot_monitor_loop(checkpoint_func=None, interval=5):
"""Spot 중단 모니터링 루프"""
logger.info("Starting Spot interruption monitor...")
while True:
status = check_spot_interruption()
if status['interrupted']:
graceful_shutdown(checkpoint_func)
sys.exit(0)
time.sleep(interval)
7. 자동 스케일링 최적화
7.1 스케일링 전략 비교
# Auto Scaling 전략 비교
target_tracking:
description: "목표 메트릭 값 유지"
example: "CPU 평균 60% 유지"
장점: "설정이 간단, 자동 조절"
단점: "단일 메트릭에 의존"
적합: "일반적인 웹 애플리케이션"
step_scaling:
description: "메트릭 범위별 스케일링 단계 정의"
example: "CPU 60-70%: +1, 70-80%: +2, 80%+: +4"
장점: "세밀한 제어"
단점: "설정 복잡"
적합: "복잡한 스케일링 패턴"
predictive_scaling:
description: "ML 기반 트래픽 예측 사전 스케일링"
example: "과거 14일 패턴 학습 후 미리 스케일 아웃"
장점: "사전 대응, 콜드 스타트 방지"
단점: "불규칙 트래픽에 부적합"
적합: "주기적 패턴이 있는 워크로드"
schedule_based:
description: "시간 기반 스케줄링"
example: "평일 9시-18시: 10대, 야간: 2대"
장점: "예측 가능한 비용"
단점: "갑작스러운 트래픽에 대응 부족"
적합: "업무 시간 패턴 워크로드"
7.2 Karpenter를 이용한 K8s 노드 자동 스케일링
# Karpenter NodePool 설정
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "karpenter.k8s.aws/instance-generation"
operator: Gt
values: ["4"]
nodeClassRef:
name: default
limits:
cpu: "1000"
memory: "1000Gi"
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h # 30일
weight: 50
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
instanceProfile: "KarpenterNodeInstanceProfile-my-cluster"
8. 스토리지 비용 최적화
8.1 S3 Lifecycle 정책
{
"Rules": [
{
"ID": "MoveToIntelligentTiering",
"Status": "Enabled",
"Filter": {
"Prefix": "data/"
},
"Transitions": [
{
"Days": 0,
"StorageClass": "INTELLIGENT_TIERING"
}
]
},
{
"ID": "ArchiveOldLogs",
"Status": "Enabled",
"Filter": {
"Prefix": "logs/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 730
}
},
{
"ID": "CleanupIncompleteUploads",
"Status": "Enabled",
"Filter": {},
"AbortIncompleteMultipartUpload": {
"DaysAfterInitiation": 7
}
}
]
}
8.2 EBS 최적화: gp2에서 gp3로 마이그레이션
# EBS gp2 -> gp3 마이그레이션 스크립트
# ebs_migration.py
import boto3
def find_gp2_volumes():
"""gp2 볼륨 찾기 및 절감 효과 계산"""
ec2 = boto3.client('ec2')
response = ec2.describe_volumes(
Filters=[{
'Name': 'volume-type',
'Values': ['gp2']
}]
)
volumes = []
total_monthly_savings = 0
for vol in response['Volumes']:
size_gb = vol['Size']
# gp2 가격: $0.10/GB-month
gp2_cost = size_gb * 0.10
# gp3 가격: $0.08/GB-month + IOPS/throughput 추가분
# gp3 기본: 3000 IOPS, 125 MB/s
gp3_cost = size_gb * 0.08
# gp2는 크기 * 3 = 기본 IOPS (최소 100)
gp2_baseline_iops = max(size_gb * 3, 100)
# gp3 추가 IOPS 필요시 (3000 초과분)
if gp2_baseline_iops > 3000:
extra_iops = gp2_baseline_iops - 3000
gp3_cost += extra_iops * 0.005 # $0.005/IOPS
savings = gp2_cost - gp3_cost
total_monthly_savings += max(savings, 0)
volumes.append({
'volume_id': vol['VolumeId'],
'size_gb': size_gb,
'state': vol['State'],
'gp2_monthly_cost': round(gp2_cost, 2),
'gp3_monthly_cost': round(gp3_cost, 2),
'monthly_savings': round(max(savings, 0), 2)
})
return {
'volumes': volumes,
'total_gp2_volumes': len(volumes),
'total_monthly_savings': round(total_monthly_savings, 2)
}
def migrate_to_gp3(volume_id, target_iops=3000, target_throughput=125):
"""gp2 볼륨을 gp3로 마이그레이션"""
ec2 = boto3.client('ec2')
response = ec2.modify_volume(
VolumeId=volume_id,
VolumeType='gp3',
Iops=target_iops,
Throughput=target_throughput
)
return {
'volume_id': volume_id,
'modification_state': response['VolumeModification']['ModificationState'],
'target_type': 'gp3',
'target_iops': target_iops,
'target_throughput': target_throughput
}
8.3 미사용 리소스 자동 정리
# 미사용 리소스 탐지 및 정리 자동화
# unused_resource_cleaner.py
import boto3
from datetime import datetime, timedelta
def find_unused_ebs_volumes():
"""연결되지 않은 EBS 볼륨 찾기"""
ec2 = boto3.client('ec2')
response = ec2.describe_volumes(
Filters=[{
'Name': 'status',
'Values': ['available']
}]
)
unused = []
for vol in response['Volumes']:
create_time = vol['CreateTime'].replace(tzinfo=None)
age_days = (datetime.utcnow() - create_time).days
if age_days > 7: # 7일 이상 미연결
unused.append({
'volume_id': vol['VolumeId'],
'size_gb': vol['Size'],
'type': vol['VolumeType'],
'age_days': age_days,
'monthly_cost': vol['Size'] * 0.10 # gp2 기준
})
return unused
def find_unused_elastic_ips():
"""미연결 Elastic IP 찾기"""
ec2 = boto3.client('ec2')
response = ec2.describe_addresses()
unused = [
{
'allocation_id': addr['AllocationId'],
'public_ip': addr['PublicIp'],
'monthly_cost': 3.65 # 미연결 EIP 비용
}
for addr in response['Addresses']
if 'InstanceId' not in addr
and 'NetworkInterfaceId' not in addr
]
return unused
def find_old_snapshots(days=90):
"""오래된 EBS 스냅샷 찾기"""
ec2 = boto3.client('ec2')
response = ec2.describe_snapshots(
OwnerIds=['self']
)
cutoff = datetime.utcnow() - timedelta(days=days)
old_snapshots = []
for snap in response['Snapshots']:
start_time = snap['StartTime'].replace(tzinfo=None)
if start_time < cutoff:
old_snapshots.append({
'snapshot_id': snap['SnapshotId'],
'volume_size': snap['VolumeSize'],
'start_time': str(start_time),
'age_days': (datetime.utcnow() - start_time).days
})
return old_snapshots
def find_unused_load_balancers():
"""트래픽 없는 로드밸런서 찾기"""
elbv2 = boto3.client('elbv2')
cw = boto3.client('cloudwatch')
response = elbv2.describe_load_balancers()
unused = []
for lb in response['LoadBalancers']:
# 지난 7일간 요청 수 확인
metrics = cw.get_metric_statistics(
Namespace='AWS/ApplicationELB',
MetricName='RequestCount',
Dimensions=[{
'Name': 'LoadBalancer',
'Value': lb['LoadBalancerArn'].split('/')[-3] + '/' +
lb['LoadBalancerArn'].split('/')[-2] + '/' +
lb['LoadBalancerArn'].split('/')[-1]
}],
StartTime=datetime.utcnow() - timedelta(days=7),
EndTime=datetime.utcnow(),
Period=86400 * 7,
Statistics=['Sum']
)
total_requests = sum(
dp['Sum'] for dp in metrics.get('Datapoints', [])
)
if total_requests == 0:
unused.append({
'lb_name': lb['LoadBalancerName'],
'lb_arn': lb['LoadBalancerArn'],
'type': lb['Type'],
'monthly_cost': 22.0 # ALB 기본 비용 약 $22/month
})
return unused
9. Kubernetes 비용 관리
9.1 Kubecost 설치 및 설정
# Kubecost 설치 (Helm)
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm repo update
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--set kubecostToken="YOUR_TOKEN" \
--set prometheus.server.retention="15d" \
--set persistentVolume.enabled=true \
--set persistentVolume.size="32Gi"
9.2 네임스페이스별 비용 할당
# Kubecost 비용 할당 설정
# kubecost-allocation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: allocation-config
namespace: kubecost
data:
allocation.json: |
{
"sharedNamespaces": [
"kube-system",
"kubecost",
"monitoring",
"istio-system"
],
"sharedOverhead": {
"cpu": "2",
"memory": "4Gi"
},
"labelConfig": {
"enabled": true,
"team_label": "team",
"department_label": "department",
"product_label": "product",
"environment_label": "environment"
}
}
9.3 리소스 쿼터와 리밋레인지
# 네임스페이스별 리소스 쿼터
# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: backend
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "50"
services.loadbalancers: "2"
persistentvolumeclaims: "10"
---
# 기본 리소스 제한
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: backend
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
type: Container
9.4 VPA (Vertical Pod Autoscaler) 설정
# VPA로 Pod 리소스 자동 최적화
# vpa-recommendation.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: user-api-vpa
namespace: backend
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: user-api
updatePolicy:
updateMode: "Auto" # Auto, Recreate, Initial, Off
resourcePolicy:
containerPolicies:
- containerName: user-api
minAllowed:
cpu: "50m"
memory: "64Mi"
maxAllowed:
cpu: "2"
memory: "4Gi"
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
10. AI/ML 비용 관리
10.1 GPU 인스턴스 비용 최적화
# GPU Spot 인스턴스를 활용한 ML 학습 비용 최적화
# ml_cost_optimizer.py
import boto3
import json
def get_gpu_spot_pricing():
"""GPU Spot 인스턴스 가격 조회"""
ec2 = boto3.client('ec2')
gpu_instances = [
'p3.2xlarge', 'p3.8xlarge', 'p3.16xlarge',
'p4d.24xlarge',
'g4dn.xlarge', 'g4dn.2xlarge', 'g4dn.4xlarge',
'g5.xlarge', 'g5.2xlarge', 'g5.4xlarge'
]
prices = []
for instance_type in gpu_instances:
response = ec2.describe_spot_price_history(
InstanceTypes=[instance_type],
ProductDescriptions=['Linux/UNIX'],
MaxResults=1
)
if response['SpotPriceHistory']:
spot_price = float(
response['SpotPriceHistory'][0]['SpotPrice']
)
# On-Demand 가격은 별도 API 또는 가격표에서 조회
prices.append({
'instance_type': instance_type,
'spot_price': spot_price,
'availability_zone': (
response['SpotPriceHistory'][0]
['AvailabilityZone']
)
})
return sorted(prices, key=lambda x: x['spot_price'])
def estimate_training_cost(
instance_type,
num_instances,
training_hours,
use_spot=True,
spot_discount=0.7
):
"""ML 학습 비용 추정"""
# 대략적인 On-Demand 시간당 가격
on_demand_prices = {
'p3.2xlarge': 3.06,
'p3.8xlarge': 12.24,
'p4d.24xlarge': 32.77,
'g4dn.xlarge': 0.526,
'g5.xlarge': 1.006,
'g5.2xlarge': 1.212,
}
hourly_rate = on_demand_prices.get(instance_type, 0)
if use_spot:
hourly_rate *= (1 - spot_discount)
total_cost = hourly_rate * num_instances * training_hours
return {
'instance_type': instance_type,
'num_instances': num_instances,
'training_hours': training_hours,
'use_spot': use_spot,
'hourly_rate_per_instance': round(hourly_rate, 3),
'total_estimated_cost': round(total_cost, 2)
}
10.2 SageMaker 비용 최적화
# SageMaker 비용 최적화 전략
# sagemaker_optimizer.py
import boto3
def setup_managed_spot_training(
training_job_name,
role_arn,
image_uri,
instance_type='ml.p3.2xlarge',
instance_count=1,
max_wait_seconds=7200,
max_runtime_seconds=3600
):
"""SageMaker Managed Spot Training 설정"""
sm = boto3.client('sagemaker')
response = sm.create_training_job(
TrainingJobName=training_job_name,
RoleArn=role_arn,
AlgorithmSpecification={
'TrainingImage': image_uri,
'TrainingInputMode': 'File'
},
ResourceConfig={
'InstanceType': instance_type,
'InstanceCount': instance_count,
'VolumeSizeInGB': 50
},
# Spot Training 핵심 설정
EnableManagedSpotTraining=True,
StoppingCondition={
'MaxRuntimeInSeconds': max_runtime_seconds,
'MaxWaitTimeInSeconds': max_wait_seconds
},
# 체크포인트 설정 (Spot 중단 시 복구용)
CheckpointConfig={
'S3Uri': f's3://my-bucket/checkpoints/{training_job_name}'
},
InputDataConfig=[{
'ChannelName': 'training',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://my-bucket/training-data/'
}
}
}],
OutputDataConfig={
'S3OutputPath': 's3://my-bucket/output/'
}
)
return response
def setup_inference_autoscaling(
endpoint_name,
variant_name='AllTraffic',
min_capacity=1,
max_capacity=10,
target_value=70.0
):
"""SageMaker 추론 엔드포인트 오토스케일링"""
aas = boto3.client('application-autoscaling')
# 스케일링 타겟 등록
aas.register_scalable_target(
ServiceNamespace='sagemaker',
ResourceId=f'endpoint/{endpoint_name}/variant/{variant_name}',
ScalableDimension=(
'sagemaker:variant:DesiredInstanceCount'
),
MinCapacity=min_capacity,
MaxCapacity=max_capacity
)
# Target Tracking 정책
aas.put_scaling_policy(
PolicyName=f'{endpoint_name}-scaling-policy',
ServiceNamespace='sagemaker',
ResourceId=f'endpoint/{endpoint_name}/variant/{variant_name}',
ScalableDimension=(
'sagemaker:variant:DesiredInstanceCount'
),
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': target_value,
'PredefinedMetricSpecification': {
'PredefinedMetricType': (
'SageMakerVariantInvocationsPerInstance'
)
},
'ScaleInCooldown': 300,
'ScaleOutCooldown': 60
}
)
11. 네트워킹 비용 최적화
11.1 데이터 전송 비용 이해
AWS 데이터 전송 비용 구조:
인터넷 -> AWS: 무료
AWS -> 인터넷: $0.09/GB (첫 10TB)
같은 AZ 내 (Private IP): 무료
같은 리전, 다른 AZ: $0.01/GB (각 방향)
다른 리전: $0.02/GB
AWS -> CloudFront: 무료
NAT Gateway 처리: $0.045/GB + $0.045/시간
주요 비용 트랩:
1. NAT Gateway를 통한 S3/DynamoDB 트래픽 -> VPC Endpoint로 절감
2. 크로스 AZ 트래픽 -> 서비스 메시 또는 AZ 인지 라우팅
3. 불필요한 리전 간 복제 -> 필요한 데이터만 선별 복제
11.2 VPC Endpoint로 NAT Gateway 비용 절감
# VPC Endpoint 설정 (Terraform)
# vpc_endpoints.tf
# S3 Gateway Endpoint (무료)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.ap-northeast-2.s3"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
]
tags = {
Name = "s3-vpc-endpoint"
}
}
# DynamoDB Gateway Endpoint (무료)
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.ap-northeast-2.dynamodb"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
]
tags = {
Name = "dynamodb-vpc-endpoint"
}
}
# ECR Interface Endpoint
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.ap-northeast-2.ecr.api"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = [
aws_subnet.private_a.id,
aws_subnet.private_b.id,
]
security_group_ids = [
aws_security_group.vpc_endpoints.id
]
tags = {
Name = "ecr-api-vpc-endpoint"
}
}
# CloudWatch Logs Interface Endpoint
resource "aws_vpc_endpoint" "logs" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.ap-northeast-2.logs"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = [
aws_subnet.private_a.id,
aws_subnet.private_b.id,
]
security_group_ids = [
aws_security_group.vpc_endpoints.id
]
tags = {
Name = "logs-vpc-endpoint"
}
}
12. FinOps 도구 생태계
12.1 오픈소스 도구
# FinOps 오픈소스 도구 비교
infracost:
설명: "Terraform PR에 비용 영향 추정 코멘트"
통합: "GitHub Actions, GitLab CI, Atlantis"
장점: "코드 변경 전 비용 인지"
URL: "https://www.infracost.io/"
opencost:
설명: "K8s 비용 모니터링 (CNCF 프로젝트)"
통합: "Prometheus, Grafana, K8s"
장점: "실시간 K8s 비용 모니터링, 완전 오픈소스"
URL: "https://www.opencost.io/"
komiser:
설명: "멀티클라우드 비용 대시보드"
통합: "AWS, GCP, Azure, DigitalOcean"
장점: "자체 호스팅, 멀티클라우드"
URL: "https://www.komiser.io/"
vantage:
설명: "클라우드 비용 투명성 플랫폼"
통합: "AWS, GCP, Azure, Datadog, Snowflake"
장점: "Unit Cost 추적, 비용 리포트"
URL: "https://www.vantage.sh/"
12.2 Infracost CI/CD 통합
# GitHub Actions에서 Infracost 사용
# .github/workflows/infracost.yml
name: Infracost
on:
pull_request:
paths:
- '**.tf'
- '**.tfvars'
jobs:
infracost:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Infracost
uses: infracost/actions/setup@v3
with:
api-key: "${{ secrets.INFRACOST_API_KEY }}"
- name: Generate Infracost diff
run: |
infracost diff \
--path=. \
--format=json \
--out-file=/tmp/infracost.json
- name: Post Infracost comment
uses: infracost/actions/comment@v1
with:
path: /tmp/infracost.json
behavior: update
13. FinOps 문화 구축
13.1 Showback vs Chargeback
Showback (쇼백):
├── 각 팀에게 비용을 "보여줌"
├── 실제 예산에서 차감하지 않음
├── 인식 제고 목적
├── FinOps 시작 단계에 적합
└── 위험: 행동 변화 동기 부족
Chargeback (차지백):
├── 각 팀의 예산에서 실제 차감
├── 강한 비용 인식 동기
├── 정확한 비용 할당 필요
├── FinOps 성숙 단계에 적합
└── 위험: 공유 리소스 할당 분쟁
13.2 Unit Economics 기반 추적
# Unit Economics 대시보드 데이터 생성
# unit_economics.py
def calculate_unit_economics(
total_cloud_cost,
total_revenue,
active_users,
total_transactions,
total_api_calls
):
"""비즈니스 단위당 클라우드 비용 계산"""
cloud_cost_ratio = (total_cloud_cost / total_revenue) * 100
return {
'cost_per_user': round(
total_cloud_cost / active_users, 4
),
'cost_per_transaction': round(
total_cloud_cost / total_transactions, 6
),
'cost_per_1k_api_calls': round(
(total_cloud_cost / total_api_calls) * 1000, 4
),
'cloud_cost_revenue_ratio': round(cloud_cost_ratio, 2),
'gross_margin_impact': round(100 - cloud_cost_ratio, 2)
}
# 예시
metrics = calculate_unit_economics(
total_cloud_cost=50000, # 월 클라우드 비용 $50,000
total_revenue=500000, # 월 매출 $500,000
active_users=100000, # 월간 활성 사용자 100,000명
total_transactions=2000000, # 월간 트랜잭션 200만건
total_api_calls=500000000 # 월간 API 호출 5억건
)
# 결과:
# cost_per_user: $0.50
# cost_per_transaction: $0.025
# cost_per_1k_api_calls: $0.10
# cloud_cost_revenue_ratio: 10.0%
13.3 FinOps 팀 구성과 역할
# FinOps 팀 구성 모범 사례
finops_team:
finops_lead:
역할: "FinOps 프로그램 총괄"
책임:
- "비용 최적화 전략 수립"
- "경영진 리포팅"
- "팀 간 조율"
배경: "재무 또는 클라우드 아키텍처"
cloud_analyst:
역할: "비용 데이터 분석"
책임:
- "비용 리포트 작성"
- "이상 감지 및 조사"
- "예산 vs 실제 비용 추적"
배경: "데이터 분석"
engineering_champion:
역할: "각 엔지니어링 팀의 FinOps 챔피언"
책임:
- "팀 내 비용 인식 전파"
- "리소스 최적화 실행"
- "태깅 준수 확인"
배경: "소프트웨어 엔지니어링"
governance:
weekly_review: "주간 비용 리뷰 미팅"
monthly_report: "월간 FinOps 리포트"
quarterly_optimization: "분기별 대규모 최적화 실행"
annual_planning: "연간 클라우드 예산 계획"
14. 실전 비용 최적화 체크리스트
# FinOps 비용 최적화 체크리스트
immediate_wins: # 즉시 실행 가능
- "미사용 EBS 볼륨 삭제"
- "미연결 Elastic IP 해제"
- "gp2 -> gp3 마이그레이션"
- "이전 세대 인스턴스 업그레이드 (m4 -> m6i)"
- "개발 환경 야간/주말 중지"
- "오래된 스냅샷 정리"
- "S3 Intelligent-Tiering 활성화"
short_term: # 1-3개월
- "Reserved Instances / Savings Plans 구매"
- "Spot 인스턴스 도입 (적합한 워크로드)"
- "Auto Scaling 최적화"
- "VPC Endpoint 설정"
- "태깅 전략 수립 및 적용"
- "비용 알림 설정"
long_term: # 3-12개월
- "Karpenter 도입 (K8s)"
- "Kubecost 기반 비용 할당"
- "Unit Economics 추적 체계"
- "FinOps 문화 정착"
- "멀티클라우드 비용 최적화"
- "Infracost CI/CD 통합"
- "자동화된 비용 거버넌스"
15. 퀴즈
Q1. FinOps 라이프사이클의 세 단계를 올바른 순서로 나열하고, 각 단계의 핵심 활동을 설명하세요.
정답:
-
Inform (알리기): 비용 가시성 확보. 태깅, Cost Explorer 분석, 부서/팀별 비용 할당, 비용 이상 감지. "누가 무엇에 얼마를 쓰는지" 파악하는 단계.
-
Optimize (최적화): 비용 절감 실행. 라이트사이징, Reserved Instances/Savings Plans 구매, Spot 인스턴스 도입, 미사용 리소스 정리, 스토리지 계층화.
-
Operate (운영): 지속적인 최적화 거버넌스. 자동화된 정책 적용, 비용 알림, 정기적인 리뷰, Unit Economics 추적, FinOps 문화 정착.
이 세 단계는 **반복적(iterative)**으로 수행됩니다.
Q2. Savings Plans과 Reserved Instances의 주요 차이점 3가지를 설명하세요.
정답:
-
유연성: Savings Plans(특히 Compute SP)은 인스턴스 패밀리, 리전, OS를 자유롭게 변경 가능. RI(Standard)는 특정 인스턴스 타입/리전에 고정.
-
서비스 범위: Savings Plans은 EC2, Fargate, Lambda를 모두 커버. RI는 EC2 전용(RDS, ElastiCache 등은 별도 RI).
-
약정 방식: Savings Plans은 시간당 사용 금액(dollar/hour)으로 약정. RI는 특정 인스턴스 수량으로 약정.
두 옵션 모두 최대 72% 할인, 1년/3년 약정 기간은 동일합니다.
Q3. Spot 인스턴스를 안정적으로 사용하기 위한 아키텍처 패턴 3가지를 설명하세요.
정답:
-
다양한 인스턴스 타입 풀: 단일 인스턴스 타입에 의존하지 않고, 여러 인스턴스 패밀리/크기를 지정하여 가용성 확보. capacityOptimized 전략 사용.
-
체크포인트와 재시도: 작업 진행 상태를 주기적으로 S3 등에 체크포인트로 저장. 중단 시 마지막 체크포인트부터 재개. SageMaker Managed Spot Training이 이 패턴을 자동화.
-
On-Demand 혼합: Spot Fleet 또는 Auto Scaling Group에서 On-Demand 인스턴스를 기본 용량(baseline)으로 유지하고, Spot을 추가 용량으로 활용. 중단 시에도 최소 서비스 보장.
Q4. NAT Gateway 비용을 절감하기 위한 VPC Endpoint 전략을 설명하세요.
정답:
NAT Gateway는 시간당 비용(0.045/GB) 모두 발생합니다.
절감 전략:
-
S3 Gateway Endpoint: 무료. S3 트래픽이 NAT Gateway를 우회하여 직접 S3에 접근. 대용량 데이터 처리 시 큰 절감.
-
DynamoDB Gateway Endpoint: 무료. DynamoDB 트래픽도 NAT Gateway 우회.
-
Interface Endpoint: ECR, CloudWatch, STS 등 자주 사용하는 AWS 서비스에 Interface Endpoint 설정. 시간당 비용이 있지만 대량 트래픽 시 NAT Gateway보다 저렴.
-
NAT Gateway 공유 최적화: 가능하면 AZ당 1개의 NAT Gateway를 사용하되, 크로스 AZ 트래픽 비용과 비교하여 판단.
Q5. Unit Economics 기반 FinOps 추적이 왜 중요하며, 어떤 지표를 추적해야 하는지 설명하세요.
정답:
중요성: 절대 클라우드 비용만으로는 비즈니스 효율성을 판단할 수 없습니다. 매출이 2배 성장하면서 비용이 1.5배 증가했다면 오히려 효율이 개선된 것입니다. Unit Economics는 비용을 비즈니스 성과와 연결하여 의미 있는 최적화 방향을 제시합니다.
핵심 추적 지표:
- 사용자당 비용 (Cost per User): 클라우드 비용 / MAU
- 거래당 비용 (Cost per Transaction): 클라우드 비용 / 총 거래 수
- API 호출당 비용: 클라우드 비용 / API 호출 수
- 매출 대비 클라우드 비용 비율: 클라우드 비용 / 총 매출 (목표: 10-15% 이하)
- COGS 내 클라우드 비율: 매출원가 중 클라우드 비용 비중
이 지표들의 추세를 추적하여 효율이 개선되고 있는지 확인합니다.
16. 참고 자료
- FinOps Foundation - https://www.finops.org/
- AWS Cost Optimization Pillar - AWS Well-Architected Framework
- AWS Cost Explorer API - AWS Documentation
- GCP Cost Management - Google Cloud Documentation
- Azure Cost Management - Microsoft Learn
- Kubecost Documentation - https://docs.kubecost.com/
- OpenCost Project - https://www.opencost.io/
- Infracost Documentation - https://www.infracost.io/docs/
- Karpenter Documentation - https://karpenter.sh/
- AWS Compute Optimizer - AWS Documentation
- GCP Recommender - Google Cloud Documentation
- FinOps Certified Practitioner - FinOps Foundation Certification
- Cloud FinOps (O'Reilly) - J.R. Storment & Mike Fuller
- Spot.io (NetApp) - https://spot.io/
마무리
FinOps는 단순한 비용 절감 도구가 아닌, 클라우드 비용을 비즈니스 가치와 연결하는 문화적 변환입니다. 태깅과 가시성 확보(Inform)에서 시작하여, 라이트사이징과 예약 할인(Optimize)을 실행하고, 지속적인 거버넌스(Operate)를 구축하는 것이 핵심입니다.
가장 중요한 것은 시작하는 것입니다. 완벽한 FinOps 체계를 한 번에 구축할 필요는 없습니다. Crawl 단계에서 기본 태깅과 비용 리포팅부터 시작하고, Walk 단계에서 예약 할인과 자동화를 도입하며, Run 단계에서 Unit Economics 기반의 정교한 최적화를 실현하세요.
FinOps & Cloud Cost Optimization Complete Guide 2025: AWS/GCP/Azure Cost Reduction Strategies
Table of Contents
1. What is FinOps?
1.1 Definition and Background
As cloud costs have grown explosively, organizations need a new framework for managing expenses. FinOps (Financial Operations) combines Finance and DevOps: a cultural practice where engineering teams take ownership of cloud costs and balance business value against spending.
According to the FinOps Foundation, FinOps is based on these principles:
- Cross-team collaboration: Engineering, finance, and business teams manage costs together
- Cost ownership: Each team is responsible for its own cloud spending
- Timely decision-making: Cost optimization based on real-time data
- Business value focus: Maximizing business value, not just cutting costs
1.2 FinOps Lifecycle: Inform - Optimize - Operate
The FinOps framework consists of three iterative phases:
┌─────────────────────────────────────────────────────┐
│ FinOps Lifecycle │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Inform │───>│ Optimize │───>│ Operate │ │
│ │ │ │ │ │ │ │
│ │ Gain │ │ Execute │ │ Sustain │ │
│ │ Visibility│ │ Savings │ │ Governance│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ^ | │
│ └────────────────────────────────┘ │
│ Iterate │
└─────────────────────────────────────────────────────┘
Inform: Gain cost visibility. Understand who is spending what, where, and how much.
Optimize: Execute right-sizing, reservation discounts, unused resource cleanup, and more.
Operate: Build processes and governance to sustain optimization continuously.
1.3 FinOps Maturity Model
Level 1: Crawl (Foundation)
├── Basic cost reporting
├── Begin tagging strategy
└── Identify major cost drivers
Level 2: Walk (Intermediate)
├── Department-level cost allocation
├── Reserved Instance adoption
├── Automated reporting
└── Cost anomaly detection
Level 3: Run (Advanced)
├── Real-time cost optimization
├── Auto-scaling optimization
├── Unit Economics-based tracking
├── Cost forecasting and budgeting
└── FinOps culture fully embedded
2. Understanding Cloud Pricing Models
2.1 Pricing Model Comparison
To optimize cloud costs, you first need a solid understanding of pricing models.
# Cloud pricing model comparison
On-Demand:
description: "Pay for what you use. No commitment"
discount: "0% (baseline price)"
pros: "Flexibility, instant start/stop"
cons: "Most expensive option"
ideal_workloads: "Dev/test, short-term projects, irregular workloads"
Reserved_Instances:
description: "Discount for 1 or 3 year commitment"
discount: "Up to 72% (AWS), up to 57% (Azure)"
pros: "Large discounts, capacity reservation"
cons: "Long commitment, limited flexibility"
ideal_workloads: "Stable production workloads"
Savings_Plans:
description: "Discount for hourly spend commitment (AWS)"
discount: "Up to 72%"
pros: "More flexible than RI, can change instance family/region"
cons: "Commitment required"
ideal_workloads: "Stable but variable workloads"
Spot_Preemptible:
description: "Deep discounts on spare capacity"
discount: "Up to 90% (AWS), up to 91% (GCP)"
pros: "Cheapest option"
cons: "Can be interrupted at any time"
ideal_workloads: "Batch processing, CI/CD, fault-tolerant workloads"
2.2 Multi-Cloud Terminology Mapping
AWS GCP Azure
─────────────────────────────────────────────────────────
On-Demand On-Demand Pay-as-you-go
Reserved Instances Committed Use Reserved VM Instances
Savings Plans Flex CUDs Azure Savings Plan
Spot Instances Preemptible/Spot VMs Spot VMs
Cost Explorer Billing Reports Cost Management
AWS Budgets GCP Budgets Azure Budgets
Compute Optimizer Recommender Azure Advisor
3. Gaining Cost Visibility (Inform Phase)
3.1 Tagging Strategy
Tagging is the foundation of FinOps. Without proper tags on resources, you cannot determine who is spending what.
# Required tag schema example
mandatory_tags:
- key: "Environment"
values: ["production", "staging", "development", "sandbox"]
description: "Environment the resource belongs to"
- key: "Team"
values: ["platform", "backend", "data", "ml", "frontend"]
description: "Team that owns the resource"
- key: "Service"
values: ["user-api", "payment", "notification", "analytics"]
description: "Service the resource supports"
- key: "CostCenter"
values: ["CC-1001", "CC-1002", "CC-2001"]
description: "Cost center code"
- key: "ManagedBy"
values: ["terraform", "cloudformation", "manual", "pulumi"]
description: "Resource management tool"
optional_tags:
- key: "Project"
description: "Project name"
- key: "ExpiryDate"
description: "Resource expiry date (for temporary resources)"
3.2 Tag Enforcement Automation
# Detect non-compliant resources with AWS Config Rules
# config_rule_required_tags.py
import json
import boto3
REQUIRED_TAGS = ['Environment', 'Team', 'Service', 'CostCenter']
def lambda_handler(event, context):
"""AWS Config Rule: Check required tags"""
config = boto3.client('config')
configuration_item = json.loads(
event['invokingEvent']
)['configurationItem']
resource_tags = configuration_item.get('tags', {})
missing_tags = [
tag for tag in REQUIRED_TAGS
if tag not in resource_tags
]
compliance_type = (
'NON_COMPLIANT' if missing_tags else 'COMPLIANT'
)
annotation = (
f"Missing tags: {', '.join(missing_tags)}"
if missing_tags else "All required tags present"
)
config.put_evaluations(
Evaluations=[{
'ComplianceResourceType': configuration_item['resourceType'],
'ComplianceResourceId': configuration_item['resourceId'],
'ComplianceType': compliance_type,
'Annotation': annotation,
'OrderingTimestamp': configuration_item['configurationItemCaptureTime']
}],
ResultToken=event['resultToken']
)
return {
'compliance_type': compliance_type,
'annotation': annotation
}
3.3 AWS Cost Explorer Usage
# Cost analysis with AWS Cost Explorer API
# cost_analysis.py
import boto3
from datetime import datetime, timedelta
def get_cost_by_service(days=30):
"""Analyze cost by service"""
ce = boto3.client('ce')
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (
datetime.now() - timedelta(days=days)
).strftime('%Y-%m-%d')
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{
'Type': 'DIMENSION',
'Key': 'SERVICE'
}]
)
costs = []
for time_period in response['ResultsByTime']:
for group in time_period['Groups']:
service = group['Keys'][0]
amount = float(
group['Metrics']['UnblendedCost']['Amount']
)
if amount > 0:
costs.append({
'service': service,
'cost': round(amount, 2)
})
costs.sort(key=lambda x: x['cost'], reverse=True)
return costs
def get_cost_by_tag(tag_key='Team', days=30):
"""Analyze cost by tag"""
ce = boto3.client('ce')
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (
datetime.now() - timedelta(days=days)
).strftime('%Y-%m-%d')
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{
'Type': 'TAG',
'Key': tag_key
}]
)
return response
def detect_cost_anomalies(threshold_pct=20):
"""Cost anomaly detection: alert when increase exceeds threshold%"""
ce = boto3.client('ce')
current_month_start = datetime.now().replace(day=1).strftime('%Y-%m-%d')
current_date = datetime.now().strftime('%Y-%m-%d')
last_month_start = (
datetime.now().replace(day=1) - timedelta(days=1)
).replace(day=1).strftime('%Y-%m-%d')
last_month_end = (
datetime.now().replace(day=1) - timedelta(days=1)
).strftime('%Y-%m-%d')
# Current month cost
current = ce.get_cost_and_usage(
TimePeriod={
'Start': current_month_start,
'End': current_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
# Previous month cost
previous = ce.get_cost_and_usage(
TimePeriod={
'Start': last_month_start,
'End': last_month_end
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
anomalies = []
# Comparison logic (daily average basis)
# ... Return anomaly detection results
return anomalies
3.4 GCP Billing Export with BigQuery Analysis
-- Analyze GCP Billing Export in BigQuery
-- Monthly cost trend by service
SELECT
invoice.month AS billing_month,
service.description AS service_name,
SUM(cost) + SUM(IFNULL(
(SELECT SUM(c.amount) FROM UNNEST(credits) c), 0
)) AS net_cost
FROM
`project.dataset.gcp_billing_export_v1_XXXXXX`
WHERE
invoice.month >= '202501'
GROUP BY
billing_month, service_name
HAVING
net_cost > 10
ORDER BY
billing_month DESC, net_cost DESC;
-- Daily cost tracking by project
SELECT
FORMAT_DATE('%Y-%m-%d', usage_start_time) AS usage_date,
project.name AS project_name,
SUM(cost) AS daily_cost
FROM
`project.dataset.gcp_billing_export_v1_XXXXXX`
WHERE
usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
usage_date, project_name
ORDER BY
usage_date DESC, daily_cost DESC;
4. Right-sizing
4.1 What is Right-sizing?
Right-sizing is the process of selecting the optimal instance type and size for your workloads. Most cloud workloads are over-provisioned.
Typical EC2 Instance Utilization:
CPU Utilization Memory Utilization
┌──────────────┐ ┌──────────────┐
│████░░░░░░░░░░│ │██████░░░░░░░░│
│ 20-30% │ │ 40-50% │
└──────────────┘ └──────────────┘
=> 70-80% of CPU is wasted!
=> 50-60% of memory is wasted!
4.2 AWS Compute Optimizer
# Query right-sizing recommendations from AWS Compute Optimizer
# rightsizing_analyzer.py
import boto3
def get_ec2_recommendations():
"""Get EC2 instance right-sizing recommendations"""
co = boto3.client('compute-optimizer')
response = co.get_ec2_instance_recommendations(
filters=[{
'name': 'Finding',
'values': ['OVER_PROVISIONED']
}]
)
recommendations = []
for rec in response['instanceRecommendations']:
current = rec['currentInstanceType']
finding = rec['finding']
options = []
for option in rec['recommendationOptions']:
options.append({
'instance_type': option['instanceType'],
'migration_effort': option.get('migrationEffort', 'Unknown'),
'performance_risk': option.get('performanceRisk', 0)
})
recommendations.append({
'instance_id': rec['instanceArn'].split('/')[-1],
'current_type': current,
'finding': finding,
'options': options,
'utilization': rec.get('utilizationMetrics', [])
})
return recommendations
def get_ebs_recommendations():
"""Get EBS volume right-sizing recommendations"""
co = boto3.client('compute-optimizer')
response = co.get_ebs_volume_recommendations(
filters=[{
'name': 'Finding',
'values': ['NotOptimized']
}]
)
recommendations = []
for rec in response['volumeRecommendations']:
current_config = rec['currentConfiguration']
recommendations.append({
'volume_id': rec['volumeArn'].split('/')[-1],
'current_type': current_config['volumeType'],
'current_size': current_config['volumeSize'],
'finding': rec['finding'],
'options': rec['volumeRecommendationOptions']
})
return recommendations
4.3 GCP Recommender API
# Query VM right-sizing recommendations from GCP Recommender
# gcp_rightsizing.py
from google.cloud import recommender_v1
def get_vm_rightsizing_recommendations(project_id, zone):
"""Get GCP VM right-sizing recommendations"""
client = recommender_v1.RecommenderClient()
parent = (
f"projects/{project_id}/locations/{zone}/"
f"recommenders/google.compute.instance.MachineTypeRecommender"
)
recommendations = client.list_recommendations(
request={"parent": parent}
)
results = []
for rec in recommendations:
results.append({
'name': rec.name,
'description': rec.description,
'priority': rec.priority.name,
'state': rec.state_info.state.name,
'impact': {
'category': rec.primary_impact.category.name,
'cost_projection': str(
rec.primary_impact.cost_projection
) if rec.primary_impact.cost_projection else None
}
})
return results
5. Reserved Instances vs Savings Plans
5.1 Comparison Analysis
┌──────────────────────────────────────────────────────────────────┐
│ Reserved Instances vs Savings Plans │
├──────────────────┬──────────────────┬────────────────────────────┤
│ Aspect │ Reserved Instances│ Savings Plans │
├──────────────────┼──────────────────┼────────────────────────────┤
│ Discount │ Up to 72% │ Up to 72% │
│ Term │ 1 or 3 years │ 1 or 3 years │
│ Instance flex │ Limited(Standard)│ Compute SP: Very flexible │
│ │ Moderate(Convert)│ EC2 SP: Within family │
│ Region flex │ Zonal/Regional │ Compute SP: All regions │
│ Service scope │ EC2 only │ EC2 + Fargate + Lambda │
│ Payment options │ All/Partial/None │ All/Partial/None │
│ Recommended for │ Stable workloads │ Variable workloads │
└──────────────────┴──────────────────┴────────────────────────────┘
5.2 Break-even Analysis
# Reserved Instance break-even analysis
# break_even_analysis.py
def calculate_break_even(
on_demand_hourly,
ri_hourly,
ri_upfront,
term_months=12
):
"""Calculate RI break-even point"""
hours_per_month = 730 # 24 * 365 / 12
# Monthly savings
monthly_savings = (on_demand_hourly - ri_hourly) * hours_per_month
if monthly_savings == 0:
return None
# Break-even point (months)
break_even_months = ri_upfront / monthly_savings if ri_upfront > 0 else 0
# Total savings over term
total_on_demand = on_demand_hourly * hours_per_month * term_months
total_ri = ri_hourly * hours_per_month * term_months + ri_upfront
total_savings = total_on_demand - total_ri
savings_pct = (total_savings / total_on_demand) * 100
return {
'break_even_months': round(break_even_months, 1),
'monthly_savings': round(monthly_savings, 2),
'total_savings': round(total_savings, 2),
'savings_percentage': round(savings_pct, 1),
'total_on_demand_cost': round(total_on_demand, 2),
'total_ri_cost': round(total_ri, 2)
}
# Example: m5.xlarge (us-east-1)
result = calculate_break_even(
on_demand_hourly=0.192,
ri_hourly=0.0684, # 3-year partial upfront
ri_upfront=1248.0,
term_months=36
)
# Result:
# break_even_months: 13.8
# total_savings: ~58%
# => Net positive after 14 months
def recommend_commitment_strategy(usage_history):
"""Recommend commitment strategy based on usage patterns"""
import numpy as np
# Minimum usage (P10): safely cover with RI
# Average usage (P50): cover with Savings Plans
# Peak usage (P90): keep On-Demand
p10 = np.percentile(usage_history, 10)
p50 = np.percentile(usage_history, 50)
p90 = np.percentile(usage_history, 90)
return {
'reserved_instances': {
'coverage': 'P10 baseline',
'amount': p10,
'reason': 'Always-on minimum capacity'
},
'savings_plans': {
'coverage': 'P10 to P50',
'amount': p50 - p10,
'reason': 'Flexible discount for variable range'
},
'on_demand': {
'coverage': 'Above P50',
'amount': p90 - p50,
'reason': 'Elastic response to peaks'
}
}
6. Spot/Preemptible Instance Strategies
6.1 Spot Instance Architecture
Spot Instances offer up to 90% discount but can be reclaimed with a 2-minute warning. Proper architecture is key.
# Spot Instance suitability criteria
spot_suitable:
- Batch processing jobs
- CI/CD pipelines
- Data analytics / ETL
- Stateless web servers
- Containerized microservices
- Machine learning training
- Test environments
spot_not_suitable:
- Single-instance databases
- Long-running stateful tasks
- Non-interruptible production workloads
6.2 AWS Spot Fleet Configuration
{
"SpotFleetRequestConfig": {
"IamFleetRole": "arn:aws:iam::123456789012:role/spot-fleet-role",
"TargetCapacity": 10,
"SpotPrice": "0.10",
"TerminateInstancesWithExpiration": true,
"AllocationStrategy": "capacityOptimized",
"LaunchSpecifications": [
{
"InstanceType": "m5.xlarge",
"SubnetId": "subnet-abc123",
"WeightedCapacity": 1
},
{
"InstanceType": "m5a.xlarge",
"SubnetId": "subnet-abc123",
"WeightedCapacity": 1
},
{
"InstanceType": "m5d.xlarge",
"SubnetId": "subnet-def456",
"WeightedCapacity": 1
},
{
"InstanceType": "m4.xlarge",
"SubnetId": "subnet-def456",
"WeightedCapacity": 1
}
],
"OnDemandTargetCapacity": 2,
"OnDemandAllocationStrategy": "lowestPrice"
}
}
6.3 Spot Interruption Handling
# Spot instance interruption detection and handling
# spot_interruption_handler.py
import requests
import sys
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
METADATA_URL = "http://169.254.169.254/latest/meta-data"
SPOT_ACTION_URL = f"{METADATA_URL}/spot/instance-action"
def check_spot_interruption():
"""Check for Spot interruption notice (2-minute warning)"""
try:
response = requests.get(SPOT_ACTION_URL, timeout=2)
if response.status_code == 200:
data = response.json()
return {
'action': data.get('action'),
'time': data.get('time'),
'interrupted': True
}
except requests.exceptions.RequestException:
pass
return {'interrupted': False}
def graceful_shutdown(checkpoint_func=None):
"""Perform graceful shutdown"""
logger.info("Spot interruption detected! Starting graceful shutdown...")
# 1. Stop accepting new work
logger.info("Stopping new task acceptance...")
# 2. Save checkpoint of current work
if checkpoint_func:
logger.info("Saving checkpoint...")
checkpoint_func()
# 3. Deregister from load balancer
logger.info("Deregistering from load balancer...")
# 4. Wait for in-flight requests (max 90 seconds)
logger.info("Waiting for in-flight requests...")
time.sleep(10)
logger.info("Graceful shutdown complete")
def spot_monitor_loop(checkpoint_func=None, interval=5):
"""Spot interruption monitoring loop"""
logger.info("Starting Spot interruption monitor...")
while True:
status = check_spot_interruption()
if status['interrupted']:
graceful_shutdown(checkpoint_func)
sys.exit(0)
time.sleep(interval)
7. Auto-Scaling Optimization
7.1 Scaling Strategy Comparison
# Auto Scaling strategy comparison
target_tracking:
description: "Maintain target metric value"
example: "Keep average CPU at 60%"
pros: "Simple configuration, automatic adjustment"
cons: "Depends on a single metric"
ideal_for: "General web applications"
step_scaling:
description: "Define scaling steps per metric range"
example: "CPU 60-70%: +1, 70-80%: +2, 80%+: +4"
pros: "Fine-grained control"
cons: "Complex configuration"
ideal_for: "Complex scaling patterns"
predictive_scaling:
description: "ML-based traffic prediction for pre-scaling"
example: "Learn from 14-day patterns, scale out proactively"
pros: "Proactive response, avoids cold starts"
cons: "Not suited for irregular traffic"
ideal_for: "Workloads with cyclical patterns"
schedule_based:
description: "Time-based scheduling"
example: "Weekdays 9-18h: 10 instances, nights: 2 instances"
pros: "Predictable costs"
cons: "Cannot respond to sudden traffic spikes"
ideal_for: "Business-hours workloads"
7.2 Karpenter for K8s Node Auto-Scaling
# Karpenter NodePool configuration
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "karpenter.k8s.aws/instance-generation"
operator: Gt
values: ["4"]
nodeClassRef:
name: default
limits:
cpu: "1000"
memory: "1000Gi"
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h # 30 days
weight: 50
8. Storage Cost Optimization
8.1 S3 Lifecycle Policies
{
"Rules": [
{
"ID": "MoveToIntelligentTiering",
"Status": "Enabled",
"Filter": {
"Prefix": "data/"
},
"Transitions": [
{
"Days": 0,
"StorageClass": "INTELLIGENT_TIERING"
}
]
},
{
"ID": "ArchiveOldLogs",
"Status": "Enabled",
"Filter": {
"Prefix": "logs/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 730
}
},
{
"ID": "CleanupIncompleteUploads",
"Status": "Enabled",
"Filter": {},
"AbortIncompleteMultipartUpload": {
"DaysAfterInitiation": 7
}
}
]
}
8.2 EBS Optimization: gp2 to gp3 Migration
# EBS gp2 -> gp3 migration script
# ebs_migration.py
import boto3
def find_gp2_volumes():
"""Find gp2 volumes and calculate savings potential"""
ec2 = boto3.client('ec2')
response = ec2.describe_volumes(
Filters=[{
'Name': 'volume-type',
'Values': ['gp2']
}]
)
volumes = []
total_monthly_savings = 0
for vol in response['Volumes']:
size_gb = vol['Size']
# gp2 price: $0.10/GB-month
gp2_cost = size_gb * 0.10
# gp3 price: $0.08/GB-month + IOPS/throughput add-ons
# gp3 baseline: 3000 IOPS, 125 MB/s
gp3_cost = size_gb * 0.08
# gp2 baseline IOPS = size * 3 (min 100)
gp2_baseline_iops = max(size_gb * 3, 100)
# Additional IOPS cost if gp2 exceeds gp3 baseline
if gp2_baseline_iops > 3000:
extra_iops = gp2_baseline_iops - 3000
gp3_cost += extra_iops * 0.005 # $0.005/IOPS
savings = gp2_cost - gp3_cost
total_monthly_savings += max(savings, 0)
volumes.append({
'volume_id': vol['VolumeId'],
'size_gb': size_gb,
'gp2_monthly_cost': round(gp2_cost, 2),
'gp3_monthly_cost': round(gp3_cost, 2),
'monthly_savings': round(max(savings, 0), 2)
})
return {
'volumes': volumes,
'total_gp2_volumes': len(volumes),
'total_monthly_savings': round(total_monthly_savings, 2)
}
def migrate_to_gp3(volume_id, target_iops=3000, target_throughput=125):
"""Migrate a gp2 volume to gp3"""
ec2 = boto3.client('ec2')
response = ec2.modify_volume(
VolumeId=volume_id,
VolumeType='gp3',
Iops=target_iops,
Throughput=target_throughput
)
return {
'volume_id': volume_id,
'modification_state': response['VolumeModification']['ModificationState'],
'target_type': 'gp3'
}
8.3 Unused Resource Cleanup Automation
# Unused resource detection and cleanup automation
# unused_resource_cleaner.py
import boto3
from datetime import datetime, timedelta
def find_unused_ebs_volumes():
"""Find detached EBS volumes"""
ec2 = boto3.client('ec2')
response = ec2.describe_volumes(
Filters=[{
'Name': 'status',
'Values': ['available']
}]
)
unused = []
for vol in response['Volumes']:
create_time = vol['CreateTime'].replace(tzinfo=None)
age_days = (datetime.utcnow() - create_time).days
if age_days > 7: # Detached for over 7 days
unused.append({
'volume_id': vol['VolumeId'],
'size_gb': vol['Size'],
'type': vol['VolumeType'],
'age_days': age_days,
'monthly_cost': vol['Size'] * 0.10
})
return unused
def find_unused_elastic_ips():
"""Find unassociated Elastic IPs"""
ec2 = boto3.client('ec2')
response = ec2.describe_addresses()
unused = [
{
'allocation_id': addr['AllocationId'],
'public_ip': addr['PublicIp'],
'monthly_cost': 3.65 # Unassociated EIP cost
}
for addr in response['Addresses']
if 'InstanceId' not in addr
and 'NetworkInterfaceId' not in addr
]
return unused
def find_old_snapshots(days=90):
"""Find old EBS snapshots"""
ec2 = boto3.client('ec2')
response = ec2.describe_snapshots(OwnerIds=['self'])
cutoff = datetime.utcnow() - timedelta(days=days)
old_snapshots = []
for snap in response['Snapshots']:
start_time = snap['StartTime'].replace(tzinfo=None)
if start_time < cutoff:
old_snapshots.append({
'snapshot_id': snap['SnapshotId'],
'volume_size': snap['VolumeSize'],
'age_days': (datetime.utcnow() - start_time).days
})
return old_snapshots
9. Kubernetes Cost Management
9.1 Kubecost Installation
# Install Kubecost via Helm
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm repo update
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--set kubecostToken="YOUR_TOKEN" \
--set prometheus.server.retention="15d" \
--set persistentVolume.enabled=true \
--set persistentVolume.size="32Gi"
9.2 Namespace Cost Allocation
# Kubecost allocation configuration
# kubecost-allocation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: allocation-config
namespace: kubecost
data:
allocation.json: |
{
"sharedNamespaces": [
"kube-system",
"kubecost",
"monitoring",
"istio-system"
],
"labelConfig": {
"enabled": true,
"team_label": "team",
"department_label": "department",
"product_label": "product",
"environment_label": "environment"
}
}
9.3 Resource Quotas and Limit Ranges
# Namespace resource quota
# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: backend
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "50"
services.loadbalancers: "2"
persistentvolumeclaims: "10"
---
# Default resource limits
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: backend
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
type: Container
9.4 VPA (Vertical Pod Autoscaler)
# VPA for automatic Pod resource optimization
# vpa-recommendation.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: user-api-vpa
namespace: backend
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: user-api
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: user-api
minAllowed:
cpu: "50m"
memory: "64Mi"
maxAllowed:
cpu: "2"
memory: "4Gi"
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
10. AI/ML Cost Management
10.1 GPU Instance Cost Optimization
# ML training cost optimization with GPU Spot instances
# ml_cost_optimizer.py
import boto3
def get_gpu_spot_pricing():
"""Query GPU Spot instance pricing"""
ec2 = boto3.client('ec2')
gpu_instances = [
'p3.2xlarge', 'p3.8xlarge', 'p3.16xlarge',
'g4dn.xlarge', 'g4dn.2xlarge',
'g5.xlarge', 'g5.2xlarge', 'g5.4xlarge'
]
prices = []
for instance_type in gpu_instances:
response = ec2.describe_spot_price_history(
InstanceTypes=[instance_type],
ProductDescriptions=['Linux/UNIX'],
MaxResults=1
)
if response['SpotPriceHistory']:
spot_price = float(
response['SpotPriceHistory'][0]['SpotPrice']
)
prices.append({
'instance_type': instance_type,
'spot_price': spot_price,
'az': response['SpotPriceHistory'][0]['AvailabilityZone']
})
return sorted(prices, key=lambda x: x['spot_price'])
def estimate_training_cost(
instance_type,
num_instances,
training_hours,
use_spot=True,
spot_discount=0.7
):
"""Estimate ML training cost"""
on_demand_prices = {
'p3.2xlarge': 3.06,
'p3.8xlarge': 12.24,
'p4d.24xlarge': 32.77,
'g4dn.xlarge': 0.526,
'g5.xlarge': 1.006,
}
hourly_rate = on_demand_prices.get(instance_type, 0)
if use_spot:
hourly_rate *= (1 - spot_discount)
total_cost = hourly_rate * num_instances * training_hours
return {
'instance_type': instance_type,
'num_instances': num_instances,
'training_hours': training_hours,
'use_spot': use_spot,
'hourly_rate_per_instance': round(hourly_rate, 3),
'total_estimated_cost': round(total_cost, 2)
}
10.2 SageMaker Cost Optimization
# SageMaker cost optimization strategies
# sagemaker_optimizer.py
import boto3
def setup_managed_spot_training(
training_job_name,
role_arn,
image_uri,
instance_type='ml.p3.2xlarge',
instance_count=1,
max_wait_seconds=7200,
max_runtime_seconds=3600
):
"""Configure SageMaker Managed Spot Training"""
sm = boto3.client('sagemaker')
response = sm.create_training_job(
TrainingJobName=training_job_name,
RoleArn=role_arn,
AlgorithmSpecification={
'TrainingImage': image_uri,
'TrainingInputMode': 'File'
},
ResourceConfig={
'InstanceType': instance_type,
'InstanceCount': instance_count,
'VolumeSizeInGB': 50
},
# Spot Training key settings
EnableManagedSpotTraining=True,
StoppingCondition={
'MaxRuntimeInSeconds': max_runtime_seconds,
'MaxWaitTimeInSeconds': max_wait_seconds
},
# Checkpoint config for Spot recovery
CheckpointConfig={
'S3Uri': f's3://my-bucket/checkpoints/{training_job_name}'
},
InputDataConfig=[{
'ChannelName': 'training',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://my-bucket/training-data/'
}
}
}],
OutputDataConfig={
'S3OutputPath': 's3://my-bucket/output/'
}
)
return response
11. Networking Cost Optimization
11.1 Understanding Data Transfer Costs
AWS Data Transfer Cost Structure:
Internet -> AWS: Free
AWS -> Internet: $0.09/GB (first 10TB)
Same AZ (Private IP): Free
Same Region, cross-AZ: $0.01/GB (each direction)
Cross-Region: $0.02/GB
AWS -> CloudFront: Free
NAT Gateway processing: $0.045/GB + $0.045/hour
Common Cost Traps:
1. S3/DynamoDB traffic via NAT Gateway -> Use VPC Endpoints
2. Cross-AZ traffic -> Service mesh or AZ-aware routing
3. Unnecessary cross-region replication -> Selective replication
11.2 VPC Endpoints to Reduce NAT Gateway Costs
# VPC Endpoint configuration (Terraform)
# vpc_endpoints.tf
# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
]
tags = {
Name = "s3-vpc-endpoint"
}
}
# DynamoDB Gateway Endpoint (free)
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.dynamodb"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
]
tags = {
Name = "dynamodb-vpc-endpoint"
}
}
# ECR Interface Endpoint
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = [
aws_subnet.private_a.id,
aws_subnet.private_b.id,
]
security_group_ids = [
aws_security_group.vpc_endpoints.id
]
}
12. FinOps Tools Ecosystem
12.1 Open Source Tools
# FinOps open source tools comparison
infracost:
description: "Cost estimation comments on Terraform PRs"
integration: "GitHub Actions, GitLab CI, Atlantis"
advantage: "Know cost impact before code changes"
opencost:
description: "K8s cost monitoring (CNCF project)"
integration: "Prometheus, Grafana, K8s"
advantage: "Real-time K8s cost monitoring, fully open source"
komiser:
description: "Multi-cloud cost dashboard"
integration: "AWS, GCP, Azure, DigitalOcean"
advantage: "Self-hosted, multi-cloud support"
vantage:
description: "Cloud cost transparency platform"
integration: "AWS, GCP, Azure, Datadog, Snowflake"
advantage: "Unit Cost tracking, cost reports"
12.2 Infracost CI/CD Integration
# Using Infracost with GitHub Actions
# .github/workflows/infracost.yml
name: Infracost
on:
pull_request:
paths:
- '**.tf'
- '**.tfvars'
jobs:
infracost:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Infracost
uses: infracost/actions/setup@v3
with:
api-key: "${{ secrets.INFRACOST_API_KEY }}"
- name: Generate Infracost diff
run: |
infracost diff \
--path=. \
--format=json \
--out-file=/tmp/infracost.json
- name: Post Infracost comment
uses: infracost/actions/comment@v1
with:
path: /tmp/infracost.json
behavior: update
13. Building a FinOps Culture
13.1 Showback vs Chargeback
Showback:
├── Show each team their costs
├── No actual budget deduction
├── Awareness-building purpose
├── Suitable for FinOps early stages
└── Risk: Insufficient motivation for behavior change
Chargeback:
├── Actually deduct from each team's budget
├── Strong cost awareness motivation
├── Requires accurate cost allocation
├── Suitable for mature FinOps stages
└── Risk: Disputes over shared resource allocation
13.2 Unit Economics Tracking
# Unit Economics dashboard data generation
# unit_economics.py
def calculate_unit_economics(
total_cloud_cost,
total_revenue,
active_users,
total_transactions,
total_api_calls
):
"""Calculate cloud cost per business unit"""
cloud_cost_ratio = (total_cloud_cost / total_revenue) * 100
return {
'cost_per_user': round(
total_cloud_cost / active_users, 4
),
'cost_per_transaction': round(
total_cloud_cost / total_transactions, 6
),
'cost_per_1k_api_calls': round(
(total_cloud_cost / total_api_calls) * 1000, 4
),
'cloud_cost_revenue_ratio': round(cloud_cost_ratio, 2),
'gross_margin_impact': round(100 - cloud_cost_ratio, 2)
}
# Example
metrics = calculate_unit_economics(
total_cloud_cost=50000,
total_revenue=500000,
active_users=100000,
total_transactions=2000000,
total_api_calls=500000000
)
# Result:
# cost_per_user: $0.50
# cost_per_transaction: $0.025
# cost_per_1k_api_calls: $0.10
# cloud_cost_revenue_ratio: 10.0%
13.3 FinOps Team Structure
# FinOps team structure best practices
finops_team:
finops_lead:
role: "FinOps program owner"
responsibilities:
- "Cost optimization strategy"
- "Executive reporting"
- "Cross-team coordination"
cloud_analyst:
role: "Cost data analysis"
responsibilities:
- "Cost report creation"
- "Anomaly detection and investigation"
- "Budget vs actual tracking"
engineering_champion:
role: "FinOps champion in each engineering team"
responsibilities:
- "Spread cost awareness within team"
- "Execute resource optimization"
- "Ensure tagging compliance"
governance:
weekly_review: "Weekly cost review meeting"
monthly_report: "Monthly FinOps report"
quarterly_optimization: "Quarterly major optimization"
annual_planning: "Annual cloud budget planning"
14. Cost Optimization Checklist
# FinOps cost optimization checklist
immediate_wins:
- "Delete unused EBS volumes"
- "Release unassociated Elastic IPs"
- "Migrate gp2 to gp3"
- "Upgrade previous-gen instances (m4 -> m6i)"
- "Stop dev environments on nights/weekends"
- "Clean up old snapshots"
- "Enable S3 Intelligent-Tiering"
short_term: # 1-3 months
- "Purchase Reserved Instances / Savings Plans"
- "Adopt Spot instances (suitable workloads)"
- "Optimize Auto Scaling"
- "Set up VPC Endpoints"
- "Establish and apply tagging strategy"
- "Configure cost alerts"
long_term: # 3-12 months
- "Adopt Karpenter (K8s)"
- "Kubecost-based cost allocation"
- "Unit Economics tracking system"
- "Embed FinOps culture"
- "Multi-cloud cost optimization"
- "Infracost CI/CD integration"
- "Automated cost governance"
15. Quiz
Q1. List the three phases of the FinOps lifecycle in order and describe the key activities of each phase.
Answer:
-
Inform: Gain cost visibility. Tagging, Cost Explorer analysis, department/team cost allocation, cost anomaly detection. Understanding "who is spending what, where."
-
Optimize: Execute cost reduction. Right-sizing, Reserved Instances/Savings Plans purchases, Spot instance adoption, unused resource cleanup, storage tiering.
-
Operate: Sustained optimization governance. Automated policy enforcement, cost alerts, regular reviews, Unit Economics tracking, FinOps culture embedding.
These three phases are performed iteratively.
Q2. What are three key differences between Savings Plans and Reserved Instances?
Answer:
-
Flexibility: Savings Plans (especially Compute SP) allow free changes to instance family, region, and OS. RI (Standard) is locked to a specific instance type and region.
-
Service scope: Savings Plans cover EC2, Fargate, and Lambda. RIs are EC2-specific (RDS, ElastiCache etc. have separate RIs).
-
Commitment type: Savings Plans commit to hourly spending amount (dollars/hour). RIs commit to specific instance quantities.
Both options offer up to 72% discount with 1-year/3-year terms.
Q3. Describe three architecture patterns for using Spot Instances reliably.
Answer:
-
Diversified instance pool: Instead of depending on a single instance type, specify multiple instance families/sizes to ensure availability. Use the capacityOptimized allocation strategy.
-
Checkpointing and retries: Periodically save work progress as checkpoints to S3 or similar storage. Resume from the last checkpoint upon interruption. SageMaker Managed Spot Training automates this pattern.
-
On-Demand mixing: Maintain On-Demand instances as baseline capacity in a Spot Fleet or Auto Scaling Group, using Spot for additional capacity. This ensures minimum service even during interruptions.
Q4. Explain the VPC Endpoint strategy for reducing NAT Gateway costs.
Answer:
NAT Gateway incurs both hourly charges (0.045/GB).
Reduction strategies:
-
S3 Gateway Endpoint: Free. S3 traffic bypasses NAT Gateway and goes directly to S3. Significant savings with large data volumes.
-
DynamoDB Gateway Endpoint: Free. DynamoDB traffic also bypasses NAT Gateway.
-
Interface Endpoints: Set up Interface Endpoints for frequently used AWS services like ECR, CloudWatch, and STS. They have hourly costs but are cheaper than NAT Gateway for high traffic.
-
NAT Gateway sharing optimization: Use one NAT Gateway per AZ when possible, weighing against cross-AZ traffic costs.
Q5. Why is Unit Economics-based FinOps tracking important, and what metrics should you track?
Answer:
Importance: Absolute cloud cost alone cannot measure business efficiency. If revenue grows 2x while costs grow 1.5x, efficiency has actually improved. Unit Economics connects costs to business outcomes, providing meaningful optimization direction.
Key metrics to track:
- Cost per User: Cloud cost / MAU
- Cost per Transaction: Cloud cost / total transactions
- Cost per API call: Cloud cost / API calls
- Cloud cost to revenue ratio: Cloud cost / total revenue (target: under 10-15%)
- Cloud share in COGS: Cloud cost as a proportion of cost of goods sold
Track the trends of these metrics to verify that efficiency is improving.
16. References
- FinOps Foundation - https://www.finops.org/
- AWS Cost Optimization Pillar - AWS Well-Architected Framework
- AWS Cost Explorer API - AWS Documentation
- GCP Cost Management - Google Cloud Documentation
- Azure Cost Management - Microsoft Learn
- Kubecost Documentation - https://docs.kubecost.com/
- OpenCost Project - https://www.opencost.io/
- Infracost Documentation - https://www.infracost.io/docs/
- Karpenter Documentation - https://karpenter.sh/
- AWS Compute Optimizer - AWS Documentation
- GCP Recommender - Google Cloud Documentation
- FinOps Certified Practitioner - FinOps Foundation Certification
- Cloud FinOps (O'Reilly) - J.R. Storment and Mike Fuller
- Spot.io (NetApp) - https://spot.io/
Conclusion
FinOps is not just a cost-cutting tool -- it is a cultural transformation that connects cloud costs to business value. The key is starting with tagging and visibility (Inform), executing right-sizing and reservation discounts (Optimize), and building sustained governance (Operate).
The most important thing is to just start. You do not need to build a perfect FinOps system all at once. Begin at the Crawl stage with basic tagging and cost reporting, move to Walk with reservation discounts and automation, and reach the Run stage with sophisticated Unit Economics-based optimization.