- Published on
FinOps & Cloud Cost Optimization Complete Guide 2025: AWS/GCP/Azure Cost Reduction Strategies
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
1. What is FinOps?
1.1 Definition and Background
As cloud costs have grown explosively, organizations need a new framework for managing expenses. FinOps (Financial Operations) combines Finance and DevOps: a cultural practice where engineering teams take ownership of cloud costs and balance business value against spending.
According to the FinOps Foundation, FinOps is based on these principles:
- Cross-team collaboration: Engineering, finance, and business teams manage costs together
- Cost ownership: Each team is responsible for its own cloud spending
- Timely decision-making: Cost optimization based on real-time data
- Business value focus: Maximizing business value, not just cutting costs
1.2 FinOps Lifecycle: Inform - Optimize - Operate
The FinOps framework consists of three iterative phases:
┌─────────────────────────────────────────────────────┐
│ FinOps Lifecycle │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Inform │───>│ Optimize │───>│ Operate │ │
│ │ │ │ │ │ │ │
│ │ Gain │ │ Execute │ │ Sustain │ │
│ │ Visibility│ │ Savings │ │ Governance│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ^ | │
│ └────────────────────────────────┘ │
│ Iterate │
└─────────────────────────────────────────────────────┘
Inform: Gain cost visibility. Understand who is spending what, where, and how much.
Optimize: Execute right-sizing, reservation discounts, unused resource cleanup, and more.
Operate: Build processes and governance to sustain optimization continuously.
1.3 FinOps Maturity Model
Level 1: Crawl (Foundation)
├── Basic cost reporting
├── Begin tagging strategy
└── Identify major cost drivers
Level 2: Walk (Intermediate)
├── Department-level cost allocation
├── Reserved Instance adoption
├── Automated reporting
└── Cost anomaly detection
Level 3: Run (Advanced)
├── Real-time cost optimization
├── Auto-scaling optimization
├── Unit Economics-based tracking
├── Cost forecasting and budgeting
└── FinOps culture fully embedded
2. Understanding Cloud Pricing Models
2.1 Pricing Model Comparison
To optimize cloud costs, you first need a solid understanding of pricing models.
# Cloud pricing model comparison
On-Demand:
description: "Pay for what you use. No commitment"
discount: "0% (baseline price)"
pros: "Flexibility, instant start/stop"
cons: "Most expensive option"
ideal_workloads: "Dev/test, short-term projects, irregular workloads"
Reserved_Instances:
description: "Discount for 1 or 3 year commitment"
discount: "Up to 72% (AWS), up to 57% (Azure)"
pros: "Large discounts, capacity reservation"
cons: "Long commitment, limited flexibility"
ideal_workloads: "Stable production workloads"
Savings_Plans:
description: "Discount for hourly spend commitment (AWS)"
discount: "Up to 72%"
pros: "More flexible than RI, can change instance family/region"
cons: "Commitment required"
ideal_workloads: "Stable but variable workloads"
Spot_Preemptible:
description: "Deep discounts on spare capacity"
discount: "Up to 90% (AWS), up to 91% (GCP)"
pros: "Cheapest option"
cons: "Can be interrupted at any time"
ideal_workloads: "Batch processing, CI/CD, fault-tolerant workloads"
2.2 Multi-Cloud Terminology Mapping
AWS GCP Azure
─────────────────────────────────────────────────────────
On-Demand On-Demand Pay-as-you-go
Reserved Instances Committed Use Reserved VM Instances
Savings Plans Flex CUDs Azure Savings Plan
Spot Instances Preemptible/Spot VMs Spot VMs
Cost Explorer Billing Reports Cost Management
AWS Budgets GCP Budgets Azure Budgets
Compute Optimizer Recommender Azure Advisor
3. Gaining Cost Visibility (Inform Phase)
3.1 Tagging Strategy
Tagging is the foundation of FinOps. Without proper tags on resources, you cannot determine who is spending what.
# Required tag schema example
mandatory_tags:
- key: "Environment"
values: ["production", "staging", "development", "sandbox"]
description: "Environment the resource belongs to"
- key: "Team"
values: ["platform", "backend", "data", "ml", "frontend"]
description: "Team that owns the resource"
- key: "Service"
values: ["user-api", "payment", "notification", "analytics"]
description: "Service the resource supports"
- key: "CostCenter"
values: ["CC-1001", "CC-1002", "CC-2001"]
description: "Cost center code"
- key: "ManagedBy"
values: ["terraform", "cloudformation", "manual", "pulumi"]
description: "Resource management tool"
optional_tags:
- key: "Project"
description: "Project name"
- key: "ExpiryDate"
description: "Resource expiry date (for temporary resources)"
3.2 Tag Enforcement Automation
# Detect non-compliant resources with AWS Config Rules
# config_rule_required_tags.py
import json
import boto3
REQUIRED_TAGS = ['Environment', 'Team', 'Service', 'CostCenter']
def lambda_handler(event, context):
"""AWS Config Rule: Check required tags"""
config = boto3.client('config')
configuration_item = json.loads(
event['invokingEvent']
)['configurationItem']
resource_tags = configuration_item.get('tags', {})
missing_tags = [
tag for tag in REQUIRED_TAGS
if tag not in resource_tags
]
compliance_type = (
'NON_COMPLIANT' if missing_tags else 'COMPLIANT'
)
annotation = (
f"Missing tags: {', '.join(missing_tags)}"
if missing_tags else "All required tags present"
)
config.put_evaluations(
Evaluations=[{
'ComplianceResourceType': configuration_item['resourceType'],
'ComplianceResourceId': configuration_item['resourceId'],
'ComplianceType': compliance_type,
'Annotation': annotation,
'OrderingTimestamp': configuration_item['configurationItemCaptureTime']
}],
ResultToken=event['resultToken']
)
return {
'compliance_type': compliance_type,
'annotation': annotation
}
3.3 AWS Cost Explorer Usage
# Cost analysis with AWS Cost Explorer API
# cost_analysis.py
import boto3
from datetime import datetime, timedelta
def get_cost_by_service(days=30):
"""Analyze cost by service"""
ce = boto3.client('ce')
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (
datetime.now() - timedelta(days=days)
).strftime('%Y-%m-%d')
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{
'Type': 'DIMENSION',
'Key': 'SERVICE'
}]
)
costs = []
for time_period in response['ResultsByTime']:
for group in time_period['Groups']:
service = group['Keys'][0]
amount = float(
group['Metrics']['UnblendedCost']['Amount']
)
if amount > 0:
costs.append({
'service': service,
'cost': round(amount, 2)
})
costs.sort(key=lambda x: x['cost'], reverse=True)
return costs
def get_cost_by_tag(tag_key='Team', days=30):
"""Analyze cost by tag"""
ce = boto3.client('ce')
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (
datetime.now() - timedelta(days=days)
).strftime('%Y-%m-%d')
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{
'Type': 'TAG',
'Key': tag_key
}]
)
return response
def detect_cost_anomalies(threshold_pct=20):
"""Cost anomaly detection: alert when increase exceeds threshold%"""
ce = boto3.client('ce')
current_month_start = datetime.now().replace(day=1).strftime('%Y-%m-%d')
current_date = datetime.now().strftime('%Y-%m-%d')
last_month_start = (
datetime.now().replace(day=1) - timedelta(days=1)
).replace(day=1).strftime('%Y-%m-%d')
last_month_end = (
datetime.now().replace(day=1) - timedelta(days=1)
).strftime('%Y-%m-%d')
# Current month cost
current = ce.get_cost_and_usage(
TimePeriod={
'Start': current_month_start,
'End': current_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
# Previous month cost
previous = ce.get_cost_and_usage(
TimePeriod={
'Start': last_month_start,
'End': last_month_end
},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
anomalies = []
# Comparison logic (daily average basis)
# ... Return anomaly detection results
return anomalies
3.4 GCP Billing Export with BigQuery Analysis
-- Analyze GCP Billing Export in BigQuery
-- Monthly cost trend by service
SELECT
invoice.month AS billing_month,
service.description AS service_name,
SUM(cost) + SUM(IFNULL(
(SELECT SUM(c.amount) FROM UNNEST(credits) c), 0
)) AS net_cost
FROM
`project.dataset.gcp_billing_export_v1_XXXXXX`
WHERE
invoice.month >= '202501'
GROUP BY
billing_month, service_name
HAVING
net_cost > 10
ORDER BY
billing_month DESC, net_cost DESC;
-- Daily cost tracking by project
SELECT
FORMAT_DATE('%Y-%m-%d', usage_start_time) AS usage_date,
project.name AS project_name,
SUM(cost) AS daily_cost
FROM
`project.dataset.gcp_billing_export_v1_XXXXXX`
WHERE
usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
usage_date, project_name
ORDER BY
usage_date DESC, daily_cost DESC;
4. Right-sizing
4.1 What is Right-sizing?
Right-sizing is the process of selecting the optimal instance type and size for your workloads. Most cloud workloads are over-provisioned.
Typical EC2 Instance Utilization:
CPU Utilization Memory Utilization
┌──────────────┐ ┌──────────────┐
│████░░░░░░░░░░│ │██████░░░░░░░░│
│ 20-30% │ │ 40-50% │
└──────────────┘ └──────────────┘
=> 70-80% of CPU is wasted!
=> 50-60% of memory is wasted!
4.2 AWS Compute Optimizer
# Query right-sizing recommendations from AWS Compute Optimizer
# rightsizing_analyzer.py
import boto3
def get_ec2_recommendations():
"""Get EC2 instance right-sizing recommendations"""
co = boto3.client('compute-optimizer')
response = co.get_ec2_instance_recommendations(
filters=[{
'name': 'Finding',
'values': ['OVER_PROVISIONED']
}]
)
recommendations = []
for rec in response['instanceRecommendations']:
current = rec['currentInstanceType']
finding = rec['finding']
options = []
for option in rec['recommendationOptions']:
options.append({
'instance_type': option['instanceType'],
'migration_effort': option.get('migrationEffort', 'Unknown'),
'performance_risk': option.get('performanceRisk', 0)
})
recommendations.append({
'instance_id': rec['instanceArn'].split('/')[-1],
'current_type': current,
'finding': finding,
'options': options,
'utilization': rec.get('utilizationMetrics', [])
})
return recommendations
def get_ebs_recommendations():
"""Get EBS volume right-sizing recommendations"""
co = boto3.client('compute-optimizer')
response = co.get_ebs_volume_recommendations(
filters=[{
'name': 'Finding',
'values': ['NotOptimized']
}]
)
recommendations = []
for rec in response['volumeRecommendations']:
current_config = rec['currentConfiguration']
recommendations.append({
'volume_id': rec['volumeArn'].split('/')[-1],
'current_type': current_config['volumeType'],
'current_size': current_config['volumeSize'],
'finding': rec['finding'],
'options': rec['volumeRecommendationOptions']
})
return recommendations
4.3 GCP Recommender API
# Query VM right-sizing recommendations from GCP Recommender
# gcp_rightsizing.py
from google.cloud import recommender_v1
def get_vm_rightsizing_recommendations(project_id, zone):
"""Get GCP VM right-sizing recommendations"""
client = recommender_v1.RecommenderClient()
parent = (
f"projects/{project_id}/locations/{zone}/"
f"recommenders/google.compute.instance.MachineTypeRecommender"
)
recommendations = client.list_recommendations(
request={"parent": parent}
)
results = []
for rec in recommendations:
results.append({
'name': rec.name,
'description': rec.description,
'priority': rec.priority.name,
'state': rec.state_info.state.name,
'impact': {
'category': rec.primary_impact.category.name,
'cost_projection': str(
rec.primary_impact.cost_projection
) if rec.primary_impact.cost_projection else None
}
})
return results
5. Reserved Instances vs Savings Plans
5.1 Comparison Analysis
┌──────────────────────────────────────────────────────────────────┐
│ Reserved Instances vs Savings Plans │
├──────────────────┬──────────────────┬────────────────────────────┤
│ Aspect │ Reserved Instances│ Savings Plans │
├──────────────────┼──────────────────┼────────────────────────────┤
│ Discount │ Up to 72% │ Up to 72% │
│ Term │ 1 or 3 years │ 1 or 3 years │
│ Instance flex │ Limited(Standard)│ Compute SP: Very flexible │
│ │ Moderate(Convert)│ EC2 SP: Within family │
│ Region flex │ Zonal/Regional │ Compute SP: All regions │
│ Service scope │ EC2 only │ EC2 + Fargate + Lambda │
│ Payment options │ All/Partial/None │ All/Partial/None │
│ Recommended for │ Stable workloads │ Variable workloads │
└──────────────────┴──────────────────┴────────────────────────────┘
5.2 Break-even Analysis
# Reserved Instance break-even analysis
# break_even_analysis.py
def calculate_break_even(
on_demand_hourly,
ri_hourly,
ri_upfront,
term_months=12
):
"""Calculate RI break-even point"""
hours_per_month = 730 # 24 * 365 / 12
# Monthly savings
monthly_savings = (on_demand_hourly - ri_hourly) * hours_per_month
if monthly_savings == 0:
return None
# Break-even point (months)
break_even_months = ri_upfront / monthly_savings if ri_upfront > 0 else 0
# Total savings over term
total_on_demand = on_demand_hourly * hours_per_month * term_months
total_ri = ri_hourly * hours_per_month * term_months + ri_upfront
total_savings = total_on_demand - total_ri
savings_pct = (total_savings / total_on_demand) * 100
return {
'break_even_months': round(break_even_months, 1),
'monthly_savings': round(monthly_savings, 2),
'total_savings': round(total_savings, 2),
'savings_percentage': round(savings_pct, 1),
'total_on_demand_cost': round(total_on_demand, 2),
'total_ri_cost': round(total_ri, 2)
}
# Example: m5.xlarge (us-east-1)
result = calculate_break_even(
on_demand_hourly=0.192,
ri_hourly=0.0684, # 3-year partial upfront
ri_upfront=1248.0,
term_months=36
)
# Result:
# break_even_months: 13.8
# total_savings: ~58%
# => Net positive after 14 months
def recommend_commitment_strategy(usage_history):
"""Recommend commitment strategy based on usage patterns"""
import numpy as np
# Minimum usage (P10): safely cover with RI
# Average usage (P50): cover with Savings Plans
# Peak usage (P90): keep On-Demand
p10 = np.percentile(usage_history, 10)
p50 = np.percentile(usage_history, 50)
p90 = np.percentile(usage_history, 90)
return {
'reserved_instances': {
'coverage': 'P10 baseline',
'amount': p10,
'reason': 'Always-on minimum capacity'
},
'savings_plans': {
'coverage': 'P10 to P50',
'amount': p50 - p10,
'reason': 'Flexible discount for variable range'
},
'on_demand': {
'coverage': 'Above P50',
'amount': p90 - p50,
'reason': 'Elastic response to peaks'
}
}
6. Spot/Preemptible Instance Strategies
6.1 Spot Instance Architecture
Spot Instances offer up to 90% discount but can be reclaimed with a 2-minute warning. Proper architecture is key.
# Spot Instance suitability criteria
spot_suitable:
- Batch processing jobs
- CI/CD pipelines
- Data analytics / ETL
- Stateless web servers
- Containerized microservices
- Machine learning training
- Test environments
spot_not_suitable:
- Single-instance databases
- Long-running stateful tasks
- Non-interruptible production workloads
6.2 AWS Spot Fleet Configuration
{
"SpotFleetRequestConfig": {
"IamFleetRole": "arn:aws:iam::123456789012:role/spot-fleet-role",
"TargetCapacity": 10,
"SpotPrice": "0.10",
"TerminateInstancesWithExpiration": true,
"AllocationStrategy": "capacityOptimized",
"LaunchSpecifications": [
{
"InstanceType": "m5.xlarge",
"SubnetId": "subnet-abc123",
"WeightedCapacity": 1
},
{
"InstanceType": "m5a.xlarge",
"SubnetId": "subnet-abc123",
"WeightedCapacity": 1
},
{
"InstanceType": "m5d.xlarge",
"SubnetId": "subnet-def456",
"WeightedCapacity": 1
},
{
"InstanceType": "m4.xlarge",
"SubnetId": "subnet-def456",
"WeightedCapacity": 1
}
],
"OnDemandTargetCapacity": 2,
"OnDemandAllocationStrategy": "lowestPrice"
}
}
6.3 Spot Interruption Handling
# Spot instance interruption detection and handling
# spot_interruption_handler.py
import requests
import sys
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
METADATA_URL = "http://169.254.169.254/latest/meta-data"
SPOT_ACTION_URL = f"{METADATA_URL}/spot/instance-action"
def check_spot_interruption():
"""Check for Spot interruption notice (2-minute warning)"""
try:
response = requests.get(SPOT_ACTION_URL, timeout=2)
if response.status_code == 200:
data = response.json()
return {
'action': data.get('action'),
'time': data.get('time'),
'interrupted': True
}
except requests.exceptions.RequestException:
pass
return {'interrupted': False}
def graceful_shutdown(checkpoint_func=None):
"""Perform graceful shutdown"""
logger.info("Spot interruption detected! Starting graceful shutdown...")
# 1. Stop accepting new work
logger.info("Stopping new task acceptance...")
# 2. Save checkpoint of current work
if checkpoint_func:
logger.info("Saving checkpoint...")
checkpoint_func()
# 3. Deregister from load balancer
logger.info("Deregistering from load balancer...")
# 4. Wait for in-flight requests (max 90 seconds)
logger.info("Waiting for in-flight requests...")
time.sleep(10)
logger.info("Graceful shutdown complete")
def spot_monitor_loop(checkpoint_func=None, interval=5):
"""Spot interruption monitoring loop"""
logger.info("Starting Spot interruption monitor...")
while True:
status = check_spot_interruption()
if status['interrupted']:
graceful_shutdown(checkpoint_func)
sys.exit(0)
time.sleep(interval)
7. Auto-Scaling Optimization
7.1 Scaling Strategy Comparison
# Auto Scaling strategy comparison
target_tracking:
description: "Maintain target metric value"
example: "Keep average CPU at 60%"
pros: "Simple configuration, automatic adjustment"
cons: "Depends on a single metric"
ideal_for: "General web applications"
step_scaling:
description: "Define scaling steps per metric range"
example: "CPU 60-70%: +1, 70-80%: +2, 80%+: +4"
pros: "Fine-grained control"
cons: "Complex configuration"
ideal_for: "Complex scaling patterns"
predictive_scaling:
description: "ML-based traffic prediction for pre-scaling"
example: "Learn from 14-day patterns, scale out proactively"
pros: "Proactive response, avoids cold starts"
cons: "Not suited for irregular traffic"
ideal_for: "Workloads with cyclical patterns"
schedule_based:
description: "Time-based scheduling"
example: "Weekdays 9-18h: 10 instances, nights: 2 instances"
pros: "Predictable costs"
cons: "Cannot respond to sudden traffic spikes"
ideal_for: "Business-hours workloads"
7.2 Karpenter for K8s Node Auto-Scaling
# Karpenter NodePool configuration
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "karpenter.k8s.aws/instance-generation"
operator: Gt
values: ["4"]
nodeClassRef:
name: default
limits:
cpu: "1000"
memory: "1000Gi"
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h # 30 days
weight: 50
8. Storage Cost Optimization
8.1 S3 Lifecycle Policies
{
"Rules": [
{
"ID": "MoveToIntelligentTiering",
"Status": "Enabled",
"Filter": {
"Prefix": "data/"
},
"Transitions": [
{
"Days": 0,
"StorageClass": "INTELLIGENT_TIERING"
}
]
},
{
"ID": "ArchiveOldLogs",
"Status": "Enabled",
"Filter": {
"Prefix": "logs/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 730
}
},
{
"ID": "CleanupIncompleteUploads",
"Status": "Enabled",
"Filter": {},
"AbortIncompleteMultipartUpload": {
"DaysAfterInitiation": 7
}
}
]
}
8.2 EBS Optimization: gp2 to gp3 Migration
# EBS gp2 -> gp3 migration script
# ebs_migration.py
import boto3
def find_gp2_volumes():
"""Find gp2 volumes and calculate savings potential"""
ec2 = boto3.client('ec2')
response = ec2.describe_volumes(
Filters=[{
'Name': 'volume-type',
'Values': ['gp2']
}]
)
volumes = []
total_monthly_savings = 0
for vol in response['Volumes']:
size_gb = vol['Size']
# gp2 price: $0.10/GB-month
gp2_cost = size_gb * 0.10
# gp3 price: $0.08/GB-month + IOPS/throughput add-ons
# gp3 baseline: 3000 IOPS, 125 MB/s
gp3_cost = size_gb * 0.08
# gp2 baseline IOPS = size * 3 (min 100)
gp2_baseline_iops = max(size_gb * 3, 100)
# Additional IOPS cost if gp2 exceeds gp3 baseline
if gp2_baseline_iops > 3000:
extra_iops = gp2_baseline_iops - 3000
gp3_cost += extra_iops * 0.005 # $0.005/IOPS
savings = gp2_cost - gp3_cost
total_monthly_savings += max(savings, 0)
volumes.append({
'volume_id': vol['VolumeId'],
'size_gb': size_gb,
'gp2_monthly_cost': round(gp2_cost, 2),
'gp3_monthly_cost': round(gp3_cost, 2),
'monthly_savings': round(max(savings, 0), 2)
})
return {
'volumes': volumes,
'total_gp2_volumes': len(volumes),
'total_monthly_savings': round(total_monthly_savings, 2)
}
def migrate_to_gp3(volume_id, target_iops=3000, target_throughput=125):
"""Migrate a gp2 volume to gp3"""
ec2 = boto3.client('ec2')
response = ec2.modify_volume(
VolumeId=volume_id,
VolumeType='gp3',
Iops=target_iops,
Throughput=target_throughput
)
return {
'volume_id': volume_id,
'modification_state': response['VolumeModification']['ModificationState'],
'target_type': 'gp3'
}
8.3 Unused Resource Cleanup Automation
# Unused resource detection and cleanup automation
# unused_resource_cleaner.py
import boto3
from datetime import datetime, timedelta
def find_unused_ebs_volumes():
"""Find detached EBS volumes"""
ec2 = boto3.client('ec2')
response = ec2.describe_volumes(
Filters=[{
'Name': 'status',
'Values': ['available']
}]
)
unused = []
for vol in response['Volumes']:
create_time = vol['CreateTime'].replace(tzinfo=None)
age_days = (datetime.utcnow() - create_time).days
if age_days > 7: # Detached for over 7 days
unused.append({
'volume_id': vol['VolumeId'],
'size_gb': vol['Size'],
'type': vol['VolumeType'],
'age_days': age_days,
'monthly_cost': vol['Size'] * 0.10
})
return unused
def find_unused_elastic_ips():
"""Find unassociated Elastic IPs"""
ec2 = boto3.client('ec2')
response = ec2.describe_addresses()
unused = [
{
'allocation_id': addr['AllocationId'],
'public_ip': addr['PublicIp'],
'monthly_cost': 3.65 # Unassociated EIP cost
}
for addr in response['Addresses']
if 'InstanceId' not in addr
and 'NetworkInterfaceId' not in addr
]
return unused
def find_old_snapshots(days=90):
"""Find old EBS snapshots"""
ec2 = boto3.client('ec2')
response = ec2.describe_snapshots(OwnerIds=['self'])
cutoff = datetime.utcnow() - timedelta(days=days)
old_snapshots = []
for snap in response['Snapshots']:
start_time = snap['StartTime'].replace(tzinfo=None)
if start_time < cutoff:
old_snapshots.append({
'snapshot_id': snap['SnapshotId'],
'volume_size': snap['VolumeSize'],
'age_days': (datetime.utcnow() - start_time).days
})
return old_snapshots
9. Kubernetes Cost Management
9.1 Kubecost Installation
# Install Kubecost via Helm
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm repo update
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--set kubecostToken="YOUR_TOKEN" \
--set prometheus.server.retention="15d" \
--set persistentVolume.enabled=true \
--set persistentVolume.size="32Gi"
9.2 Namespace Cost Allocation
# Kubecost allocation configuration
# kubecost-allocation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: allocation-config
namespace: kubecost
data:
allocation.json: |
{
"sharedNamespaces": [
"kube-system",
"kubecost",
"monitoring",
"istio-system"
],
"labelConfig": {
"enabled": true,
"team_label": "team",
"department_label": "department",
"product_label": "product",
"environment_label": "environment"
}
}
9.3 Resource Quotas and Limit Ranges
# Namespace resource quota
# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: backend
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "50"
services.loadbalancers: "2"
persistentvolumeclaims: "10"
---
# Default resource limits
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: backend
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
type: Container
9.4 VPA (Vertical Pod Autoscaler)
# VPA for automatic Pod resource optimization
# vpa-recommendation.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: user-api-vpa
namespace: backend
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: user-api
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: user-api
minAllowed:
cpu: "50m"
memory: "64Mi"
maxAllowed:
cpu: "2"
memory: "4Gi"
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
10. AI/ML Cost Management
10.1 GPU Instance Cost Optimization
# ML training cost optimization with GPU Spot instances
# ml_cost_optimizer.py
import boto3
def get_gpu_spot_pricing():
"""Query GPU Spot instance pricing"""
ec2 = boto3.client('ec2')
gpu_instances = [
'p3.2xlarge', 'p3.8xlarge', 'p3.16xlarge',
'g4dn.xlarge', 'g4dn.2xlarge',
'g5.xlarge', 'g5.2xlarge', 'g5.4xlarge'
]
prices = []
for instance_type in gpu_instances:
response = ec2.describe_spot_price_history(
InstanceTypes=[instance_type],
ProductDescriptions=['Linux/UNIX'],
MaxResults=1
)
if response['SpotPriceHistory']:
spot_price = float(
response['SpotPriceHistory'][0]['SpotPrice']
)
prices.append({
'instance_type': instance_type,
'spot_price': spot_price,
'az': response['SpotPriceHistory'][0]['AvailabilityZone']
})
return sorted(prices, key=lambda x: x['spot_price'])
def estimate_training_cost(
instance_type,
num_instances,
training_hours,
use_spot=True,
spot_discount=0.7
):
"""Estimate ML training cost"""
on_demand_prices = {
'p3.2xlarge': 3.06,
'p3.8xlarge': 12.24,
'p4d.24xlarge': 32.77,
'g4dn.xlarge': 0.526,
'g5.xlarge': 1.006,
}
hourly_rate = on_demand_prices.get(instance_type, 0)
if use_spot:
hourly_rate *= (1 - spot_discount)
total_cost = hourly_rate * num_instances * training_hours
return {
'instance_type': instance_type,
'num_instances': num_instances,
'training_hours': training_hours,
'use_spot': use_spot,
'hourly_rate_per_instance': round(hourly_rate, 3),
'total_estimated_cost': round(total_cost, 2)
}
10.2 SageMaker Cost Optimization
# SageMaker cost optimization strategies
# sagemaker_optimizer.py
import boto3
def setup_managed_spot_training(
training_job_name,
role_arn,
image_uri,
instance_type='ml.p3.2xlarge',
instance_count=1,
max_wait_seconds=7200,
max_runtime_seconds=3600
):
"""Configure SageMaker Managed Spot Training"""
sm = boto3.client('sagemaker')
response = sm.create_training_job(
TrainingJobName=training_job_name,
RoleArn=role_arn,
AlgorithmSpecification={
'TrainingImage': image_uri,
'TrainingInputMode': 'File'
},
ResourceConfig={
'InstanceType': instance_type,
'InstanceCount': instance_count,
'VolumeSizeInGB': 50
},
# Spot Training key settings
EnableManagedSpotTraining=True,
StoppingCondition={
'MaxRuntimeInSeconds': max_runtime_seconds,
'MaxWaitTimeInSeconds': max_wait_seconds
},
# Checkpoint config for Spot recovery
CheckpointConfig={
'S3Uri': f's3://my-bucket/checkpoints/{training_job_name}'
},
InputDataConfig=[{
'ChannelName': 'training',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://my-bucket/training-data/'
}
}
}],
OutputDataConfig={
'S3OutputPath': 's3://my-bucket/output/'
}
)
return response
11. Networking Cost Optimization
11.1 Understanding Data Transfer Costs
AWS Data Transfer Cost Structure:
Internet -> AWS: Free
AWS -> Internet: $0.09/GB (first 10TB)
Same AZ (Private IP): Free
Same Region, cross-AZ: $0.01/GB (each direction)
Cross-Region: $0.02/GB
AWS -> CloudFront: Free
NAT Gateway processing: $0.045/GB + $0.045/hour
Common Cost Traps:
1. S3/DynamoDB traffic via NAT Gateway -> Use VPC Endpoints
2. Cross-AZ traffic -> Service mesh or AZ-aware routing
3. Unnecessary cross-region replication -> Selective replication
11.2 VPC Endpoints to Reduce NAT Gateway Costs
# VPC Endpoint configuration (Terraform)
# vpc_endpoints.tf
# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
]
tags = {
Name = "s3-vpc-endpoint"
}
}
# DynamoDB Gateway Endpoint (free)
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.dynamodb"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
]
tags = {
Name = "dynamodb-vpc-endpoint"
}
}
# ECR Interface Endpoint
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = [
aws_subnet.private_a.id,
aws_subnet.private_b.id,
]
security_group_ids = [
aws_security_group.vpc_endpoints.id
]
}
12. FinOps Tools Ecosystem
12.1 Open Source Tools
# FinOps open source tools comparison
infracost:
description: "Cost estimation comments on Terraform PRs"
integration: "GitHub Actions, GitLab CI, Atlantis"
advantage: "Know cost impact before code changes"
opencost:
description: "K8s cost monitoring (CNCF project)"
integration: "Prometheus, Grafana, K8s"
advantage: "Real-time K8s cost monitoring, fully open source"
komiser:
description: "Multi-cloud cost dashboard"
integration: "AWS, GCP, Azure, DigitalOcean"
advantage: "Self-hosted, multi-cloud support"
vantage:
description: "Cloud cost transparency platform"
integration: "AWS, GCP, Azure, Datadog, Snowflake"
advantage: "Unit Cost tracking, cost reports"
12.2 Infracost CI/CD Integration
# Using Infracost with GitHub Actions
# .github/workflows/infracost.yml
name: Infracost
on:
pull_request:
paths:
- '**.tf'
- '**.tfvars'
jobs:
infracost:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Infracost
uses: infracost/actions/setup@v3
with:
api-key: "${{ secrets.INFRACOST_API_KEY }}"
- name: Generate Infracost diff
run: |
infracost diff \
--path=. \
--format=json \
--out-file=/tmp/infracost.json
- name: Post Infracost comment
uses: infracost/actions/comment@v1
with:
path: /tmp/infracost.json
behavior: update
13. Building a FinOps Culture
13.1 Showback vs Chargeback
Showback:
├── Show each team their costs
├── No actual budget deduction
├── Awareness-building purpose
├── Suitable for FinOps early stages
└── Risk: Insufficient motivation for behavior change
Chargeback:
├── Actually deduct from each team's budget
├── Strong cost awareness motivation
├── Requires accurate cost allocation
├── Suitable for mature FinOps stages
└── Risk: Disputes over shared resource allocation
13.2 Unit Economics Tracking
# Unit Economics dashboard data generation
# unit_economics.py
def calculate_unit_economics(
total_cloud_cost,
total_revenue,
active_users,
total_transactions,
total_api_calls
):
"""Calculate cloud cost per business unit"""
cloud_cost_ratio = (total_cloud_cost / total_revenue) * 100
return {
'cost_per_user': round(
total_cloud_cost / active_users, 4
),
'cost_per_transaction': round(
total_cloud_cost / total_transactions, 6
),
'cost_per_1k_api_calls': round(
(total_cloud_cost / total_api_calls) * 1000, 4
),
'cloud_cost_revenue_ratio': round(cloud_cost_ratio, 2),
'gross_margin_impact': round(100 - cloud_cost_ratio, 2)
}
# Example
metrics = calculate_unit_economics(
total_cloud_cost=50000,
total_revenue=500000,
active_users=100000,
total_transactions=2000000,
total_api_calls=500000000
)
# Result:
# cost_per_user: $0.50
# cost_per_transaction: $0.025
# cost_per_1k_api_calls: $0.10
# cloud_cost_revenue_ratio: 10.0%
13.3 FinOps Team Structure
# FinOps team structure best practices
finops_team:
finops_lead:
role: "FinOps program owner"
responsibilities:
- "Cost optimization strategy"
- "Executive reporting"
- "Cross-team coordination"
cloud_analyst:
role: "Cost data analysis"
responsibilities:
- "Cost report creation"
- "Anomaly detection and investigation"
- "Budget vs actual tracking"
engineering_champion:
role: "FinOps champion in each engineering team"
responsibilities:
- "Spread cost awareness within team"
- "Execute resource optimization"
- "Ensure tagging compliance"
governance:
weekly_review: "Weekly cost review meeting"
monthly_report: "Monthly FinOps report"
quarterly_optimization: "Quarterly major optimization"
annual_planning: "Annual cloud budget planning"
14. Cost Optimization Checklist
# FinOps cost optimization checklist
immediate_wins:
- "Delete unused EBS volumes"
- "Release unassociated Elastic IPs"
- "Migrate gp2 to gp3"
- "Upgrade previous-gen instances (m4 -> m6i)"
- "Stop dev environments on nights/weekends"
- "Clean up old snapshots"
- "Enable S3 Intelligent-Tiering"
short_term: # 1-3 months
- "Purchase Reserved Instances / Savings Plans"
- "Adopt Spot instances (suitable workloads)"
- "Optimize Auto Scaling"
- "Set up VPC Endpoints"
- "Establish and apply tagging strategy"
- "Configure cost alerts"
long_term: # 3-12 months
- "Adopt Karpenter (K8s)"
- "Kubecost-based cost allocation"
- "Unit Economics tracking system"
- "Embed FinOps culture"
- "Multi-cloud cost optimization"
- "Infracost CI/CD integration"
- "Automated cost governance"
15. Quiz
Q1. List the three phases of the FinOps lifecycle in order and describe the key activities of each phase.
Answer:
-
Inform: Gain cost visibility. Tagging, Cost Explorer analysis, department/team cost allocation, cost anomaly detection. Understanding "who is spending what, where."
-
Optimize: Execute cost reduction. Right-sizing, Reserved Instances/Savings Plans purchases, Spot instance adoption, unused resource cleanup, storage tiering.
-
Operate: Sustained optimization governance. Automated policy enforcement, cost alerts, regular reviews, Unit Economics tracking, FinOps culture embedding.
These three phases are performed iteratively.
Q2. What are three key differences between Savings Plans and Reserved Instances?
Answer:
-
Flexibility: Savings Plans (especially Compute SP) allow free changes to instance family, region, and OS. RI (Standard) is locked to a specific instance type and region.
-
Service scope: Savings Plans cover EC2, Fargate, and Lambda. RIs are EC2-specific (RDS, ElastiCache etc. have separate RIs).
-
Commitment type: Savings Plans commit to hourly spending amount (dollars/hour). RIs commit to specific instance quantities.
Both options offer up to 72% discount with 1-year/3-year terms.
Q3. Describe three architecture patterns for using Spot Instances reliably.
Answer:
-
Diversified instance pool: Instead of depending on a single instance type, specify multiple instance families/sizes to ensure availability. Use the capacityOptimized allocation strategy.
-
Checkpointing and retries: Periodically save work progress as checkpoints to S3 or similar storage. Resume from the last checkpoint upon interruption. SageMaker Managed Spot Training automates this pattern.
-
On-Demand mixing: Maintain On-Demand instances as baseline capacity in a Spot Fleet or Auto Scaling Group, using Spot for additional capacity. This ensures minimum service even during interruptions.
Q4. Explain the VPC Endpoint strategy for reducing NAT Gateway costs.
Answer:
NAT Gateway incurs both hourly charges (0.045/GB).
Reduction strategies:
-
S3 Gateway Endpoint: Free. S3 traffic bypasses NAT Gateway and goes directly to S3. Significant savings with large data volumes.
-
DynamoDB Gateway Endpoint: Free. DynamoDB traffic also bypasses NAT Gateway.
-
Interface Endpoints: Set up Interface Endpoints for frequently used AWS services like ECR, CloudWatch, and STS. They have hourly costs but are cheaper than NAT Gateway for high traffic.
-
NAT Gateway sharing optimization: Use one NAT Gateway per AZ when possible, weighing against cross-AZ traffic costs.
Q5. Why is Unit Economics-based FinOps tracking important, and what metrics should you track?
Answer:
Importance: Absolute cloud cost alone cannot measure business efficiency. If revenue grows 2x while costs grow 1.5x, efficiency has actually improved. Unit Economics connects costs to business outcomes, providing meaningful optimization direction.
Key metrics to track:
- Cost per User: Cloud cost / MAU
- Cost per Transaction: Cloud cost / total transactions
- Cost per API call: Cloud cost / API calls
- Cloud cost to revenue ratio: Cloud cost / total revenue (target: under 10-15%)
- Cloud share in COGS: Cloud cost as a proportion of cost of goods sold
Track the trends of these metrics to verify that efficiency is improving.
16. References
- FinOps Foundation - https://www.finops.org/
- AWS Cost Optimization Pillar - AWS Well-Architected Framework
- AWS Cost Explorer API - AWS Documentation
- GCP Cost Management - Google Cloud Documentation
- Azure Cost Management - Microsoft Learn
- Kubecost Documentation - https://docs.kubecost.com/
- OpenCost Project - https://www.opencost.io/
- Infracost Documentation - https://www.infracost.io/docs/
- Karpenter Documentation - https://karpenter.sh/
- AWS Compute Optimizer - AWS Documentation
- GCP Recommender - Google Cloud Documentation
- FinOps Certified Practitioner - FinOps Foundation Certification
- Cloud FinOps (O'Reilly) - J.R. Storment and Mike Fuller
- Spot.io (NetApp) - https://spot.io/
Conclusion
FinOps is not just a cost-cutting tool -- it is a cultural transformation that connects cloud costs to business value. The key is starting with tagging and visibility (Inform), executing right-sizing and reservation discounts (Optimize), and building sustained governance (Operate).
The most important thing is to just start. You do not need to build a perfect FinOps system all at once. Begin at the Crawl stage with basic tagging and cost reporting, move to Walk with reservation discounts and automation, and reach the Run stage with sophisticated Unit Economics-based optimization.