- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 1. DevOps Fundamentals: CI/CD Pipelines
- 2. GitOps and Infrastructure as Code
- 3. Kubernetes Mastery
- 4. Monitoring and Observability
- 5. SRE Principles
- 6. AI/ML Workflow Automation
- 7. Security: RBAC and Secrets Management
- 8. Incident Management
- Conclusion
- Quiz
Introduction
In modern software engineering, DevOps and SRE (Site Reliability Engineering) are no longer optional — they are foundational. Netflix deploys thousands of times per day. Google guarantees 99.99% availability to billions of users. Behind these feats lies rigorously automated pipelines and a data-driven operational philosophy.
This guide takes you through DevOps/SRE fundamentals, Kubernetes operations, and AI/ML workflow automation with practical, production-ready code.
1. DevOps Fundamentals: CI/CD Pipelines
What is CI/CD?
CI (Continuous Integration) is the practice of frequently merging developer code changes with automated build and test validation. CD (Continuous Delivery/Deployment) automatically delivers verified code to production.
| Stage | Purpose | Automation Scope |
|---|---|---|
| CI | Code integration validation | Build, test, lint |
| CD (Delivery) | Release preparation | Automated to staging |
| CD (Deployment) | Automatic release | Automated to production |
CI/CD with GitHub Actions
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test:
name: Test & Lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest pytest-cov flake8
- name: Lint with flake8
run: flake8 src/ --max-line-length=88
- name: Run tests
run: pytest tests/ --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: coverage.xml
build:
name: Build & Push Docker Image
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=sha-
type=ref,event=branch
type=semver,pattern={{version}}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
name: Deploy to Kubernetes
runs-on: ubuntu-latest
needs: build
environment: production
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
kubeconfig: ${{ secrets.KUBECONFIG }}
- name: Deploy with Helm
run: |
helm upgrade --install my-app ./helm/my-app \
--namespace production \
--set image.tag=${{ github.sha }} \
--wait --timeout=5m
Deployment Strategy Comparison
Blue-Green Deployment: Maintain two identical production environments (Blue/Green). Deploy the new version to Green, then switch all traffic at once. Rollback is instantaneous but requires double the resources.
Canary Deployment: Route a fraction of traffic (e.g., 5%) to the new version and gradually increase. Validates with real users while minimizing risk.
# Argo Rollouts Canary Strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 30
- pause: { duration: 10m }
- setWeight: 60
- pause: { duration: 10m }
- setWeight: 100
canaryService: my-app-canary
stableService: my-app-stable
2. GitOps and Infrastructure as Code
GitOps Principles
GitOps uses Git as the Single Source of Truth for all operational state.
- Declarative: System state is described as code
- Versioned: All changes tracked in Git history
- Automated: Git changes trigger automatic reconciliation
- Auditable: PR-based workflow records who changed what and why
The leading tools are ArgoCD and Flux.
Infrastructure as Code with Terraform
# main.tf - Provision AWS EKS Cluster
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "my-terraform-state"
key = "prod/eks/terraform.tfstate"
region = "ap-northeast-2"
}
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "prod-cluster"
cluster_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
general = {
instance_types = ["m5.xlarge"]
min_size = 2
max_size = 10
desired_size = 3
}
gpu = {
instance_types = ["g4dn.xlarge"]
min_size = 0
max_size = 5
desired_size = 1
taints = [{
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}]
}
}
}
3. Kubernetes Mastery
Core Resources Overview
| Resource | Role |
|---|---|
| Pod | Execution unit; one or more containers |
| Deployment | Manages Pod replicas and rolling updates |
| Service | Stable network endpoint for a set of Pods |
| HPA | Horizontal auto-scaling based on CPU/memory |
| ConfigMap | Externalized configuration and environment variables |
| Secret | Sensitive data (passwords, tokens) |
| Ingress | External HTTP/S traffic routing |
Production Deployment + HPA Manifest
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference-api
namespace: production
labels:
app: ml-inference-api
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: ml-inference-api
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: ml-inference-api
version: v1
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
prometheus.io/path: '/metrics'
spec:
containers:
- name: api
image: ghcr.io/myorg/ml-inference-api:sha-abc123
ports:
- containerPort: 8080
env:
- name: MODEL_NAME
valueFrom:
configMapKeyRef:
name: ml-config
key: model_name
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-secret
key: password
resources:
requests:
cpu: '500m'
memory: '512Mi'
limits:
cpu: '2000m'
memory: '2Gi'
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-inference-api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-inference-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: '100'
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
Helm for Package Management
Helm is the package manager for Kubernetes. It templates complex application deployments for reuse and versioning.
# helm/my-app/values.yaml
replicaCount: 3
image:
repository: ghcr.io/myorg/my-app
pullPolicy: IfNotPresent
tag: 'latest'
service:
type: ClusterIP
port: 80
targetPort: 8080
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: api.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: api-tls
hosts:
- api.example.com
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
postgresql:
enabled: true
auth:
database: myapp
existingSecret: db-credentials
redis:
enabled: true
architecture: replication
4. Monitoring and Observability
The Three Pillars of Observability
| Pillar | Tools | Use Case |
|---|---|---|
| Metrics | Prometheus, Grafana | Numeric data, dashboards |
| Logs | Loki, Elasticsearch | Event records, debugging |
| Traces | Jaeger, Tempo | Distributed request tracking |
Prometheus Alerting Rules
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ml-service-alerts
namespace: monitoring
spec:
groups:
- name: ml-service.rules
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: 'High error rate detected'
description: 'Error rate is {{ $value | humanizePercentage }} over the last 5 minutes'
- alert: SlowResponseTime
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: 'Slow p99 latency'
description: 'p99 latency is {{ $value }}s for {{ $labels.service }}'
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[15m]) > 3
for: 0m
labels:
severity: critical
annotations:
summary: 'Pod is crash looping'
description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes
/
container_spec_memory_limit_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: 'High memory usage detected'
OpenTelemetry Instrumentation in Python
# instrumentation.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
def setup_telemetry(service_name: str, otlp_endpoint: str):
"""Configure OpenTelemetry for the service."""
# Tracer setup
tracer_provider = TracerProvider()
otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(tracer_provider)
# Metrics setup
meter_provider = MeterProvider()
metrics.set_meter_provider(meter_provider)
return trace.get_tracer(service_name)
# Apply to FastAPI app
from fastapi import FastAPI
app = FastAPI()
tracer = setup_telemetry("ml-inference-api", "http://otel-collector:4317")
# Auto-instrumentation
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()
@app.post("/predict")
async def predict(payload: dict):
with tracer.start_as_current_span("model-inference") as span:
span.set_attribute("model.name", "bert-base")
span.set_attribute("input.length", len(str(payload)))
result = run_inference(payload)
span.set_attribute("prediction.confidence", result["confidence"])
return result
5. SRE Principles
SLI / SLO / SLA Hierarchy
- SLI (Service Level Indicator): The actual measured metric (e.g., request success rate, latency)
- SLO (Service Level Objective): The target value for an SLI (e.g., 99.9% availability)
- SLA (Service Level Agreement): The contractual commitment to customers (typically looser than the SLO)
Error Budget Calculation
With a 99.9% SLO, the allowable downtime per month (30 days) is:
Error Budget = 100% - SLO = 0.1%
Monthly downtime budget = 30d x 24h x 60m x 0.1% = ~43.2 minutes
When the error budget is exhausted, feature deployments halt and reliability work takes priority.
| SLO Level | Monthly Downtime Budget |
|---|---|
| 99% | 7 hours 18 minutes |
| 99.9% | 43 minutes 48 seconds |
| 99.99% | 4 minutes 22 seconds |
| 99.999% | 26 seconds |
Toil Reduction
Toil is manual, repetitive, automatable operational work. Google SRE recommends keeping toil below 50% of total work time.
Strategies for toil reduction:
- Script repetitive tasks
- Convert runbooks into automated code
- Improve alert quality to reduce noise
- Build self-healing systems
6. AI/ML Workflow Automation
Experiment Tracking with MLflow
# train.py
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("fraud-detection-v2")
with mlflow.start_run(run_name="rf-baseline"):
params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
mlflow.log_params(params)
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1_score", f1_score(y_test, y_pred))
mlflow.sklearn.log_model(model, "model",
registered_model_name="fraud-detector")
ML Pipeline with Argo Workflows
# ml-pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: ml-training-pipeline
spec:
entrypoint: ml-pipeline
templates:
- name: ml-pipeline
dag:
tasks:
- name: data-prep
template: prepare-data
- name: train
template: train-model
dependencies: [data-prep]
- name: evaluate
template: evaluate-model
dependencies: [train]
- name: deploy
template: deploy-model
dependencies: [evaluate]
- name: prepare-data
container:
image: ghcr.io/myorg/data-prep:latest
command: [python, prepare_data.py]
resources:
requests:
memory: 4Gi
cpu: '2'
- name: train-model
container:
image: ghcr.io/myorg/ml-trainer:latest
command: [python, train.py]
resources:
requests:
memory: 16Gi
cpu: '8'
nvidia.com/gpu: '1'
- name: evaluate-model
container:
image: ghcr.io/myorg/ml-evaluator:latest
command: [python, evaluate.py]
- name: deploy-model
container:
image: ghcr.io/myorg/model-deployer:latest
command: [python, deploy.py]
Dockerfile for a Python ML Service
# Dockerfile
FROM python:3.11-slim AS builder
WORKDIR /app
# Separate dependency layer for better cache utilization
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
FROM python:3.11-slim AS runtime
# Security: run as non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
# Copy packages from builder
COPY /root/.local /home/appuser/.local
COPY src/ ./src/
COPY models/ ./models/
USER appuser
ENV PATH=/home/appuser/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
EXPOSE 8080
HEALTHCHECK \
CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]
7. Security: RBAC and Secrets Management
Kubernetes RBAC
# rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: ml-service-account
namespace: production
---
# Least-privilege Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ml-service-role
namespace: production
rules:
- apiGroups: ['']
resources: ['pods', 'services']
verbs: ['get', 'list', 'watch']
- apiGroups: ['']
resources: ['secrets']
resourceNames: ['ml-model-secrets']
verbs: ['get']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ml-service-rolebinding
namespace: production
subjects:
- kind: ServiceAccount
name: ml-service-account
namespace: production
roleRef:
kind: Role
apiGroup: rbac.authorization.k8s.io
name: ml-service-role
Secrets Management with HashiCorp Vault
# vault_client.py
import hvac
import os
def get_secret(secret_path: str) -> dict:
"""Securely retrieve a secret from Vault."""
client = hvac.Client(
url=os.environ["VAULT_ADDR"],
token=os.environ["VAULT_TOKEN"]
)
if not client.is_authenticated():
raise RuntimeError("Vault authentication failed")
secret = client.secrets.kv.v2.read_secret_version(
path=secret_path,
mount_point="secret"
)
return secret["data"]["data"]
# In Kubernetes, use the Vault Agent Injector for
# automatic secret injection via pod annotations
Network Policies
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ml-service-netpol
namespace: production
spec:
podSelector:
matchLabels:
app: ml-inference-api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 4317 # OTLP gRPC
8. Incident Management
Incident Response Process
- Detection: Prometheus alert or user report
- Triage: Assess severity (P1/P2/P3)
- Communication: Update status page, notify stakeholders
- Mitigation: Traffic failover, rollback, scale out
- Resolution: Fix root cause
- Post-Mortem: Write a blameless post-mortem
Writing Effective Post-Mortems
A good post-mortem focuses on system improvement, not individual blame.
Key elements:
- Detailed timeline of events
- Distinguish root cause from triggering factor
- 5 Whys analysis
- Concrete action items with owners and due dates
Conclusion
DevOps and SRE are not just toolsets — they are a culture and philosophy. Automation reduces human error, data-driven decisions improve reliability, and continuous improvement builds trust.
Core principles in summary:
- Automate first: All repetitive work should be code
- Measure everything: SLI/SLO make goals concrete
- Fail fast and safely: Canary deployments minimize blast radius
- Blameless culture: Improve systems, not blame people
Quiz
Q1. What is the difference between Blue-Green and Canary deployments in CI/CD?
Answer: Blue-Green maintains two identical production environments and switches all traffic at once, while Canary gradually routes a fraction of traffic to the new version.
Explanation: Blue-Green provides instant rollback (just flip the traffic switch) with zero downtime, but requires double the resources. Canary validates the new version with real user traffic while minimizing risk, but requires sophisticated monitoring and traffic management. Tools like Argo Rollouts support both strategies natively.
Q2. What metrics does Kubernetes HPA use to trigger scaling?
Answer: By default, CPU and memory utilization. With the Custom Metrics API, HPA can also scale on application-level metrics like RPS (requests per second) or queue depth.
Explanation: HPA v2 supports four metric types: resource, pods, object, and external. Setting a 70% CPU target causes HPA to add pods when average CPU across existing pods exceeds that threshold. Installing the Prometheus Adapter lets you expose Prometheus metrics to HPA for more meaningful scaling decisions.
Q3. What is the role of Error Budget in SRE?
Answer: The Error Budget represents the allowable failure margin derived from the SLO. It serves as the mechanism balancing reliability investment against feature delivery velocity.
Explanation: If the SLO is 99.9%, the error budget is 0.1% (roughly 43 minutes of downtime per month). When the budget is ample, teams can deploy freely and experiment. When it is exhausted, new deployments freeze until reliability work replenishes it. This creates a data-driven negotiation between development speed and stability that replaces subjective arguments.
Q4. What are the advantages of Prometheus's pull-based metric collection?
Answer: Pull-based collection lets Prometheus centrally manage scrape targets, immediately detects when a target goes down, and requires no inbound firewall rules from the monitored services.
Explanation: Push-based systems (like StatsD or InfluxDB line protocol) require each service to know the address of the metrics server and risk data loss on network issues. Pull-based Prometheus manages targets via configuration files or service discovery, making it easy to add and remove services dynamically. For short-lived batch jobs that exit before Prometheus can scrape them, the Pushgateway provides a bridge.
Q5. What is the difference between GitOps and traditional CI/CD?
Answer: GitOps uses Git as the single source of truth and continuously reconciles declared state, while traditional CI/CD pipelines imperatively execute deployment commands directly.
Explanation: Traditional CI/CD (Jenkins, GitHub Actions) runs kubectl apply or helm upgrade directly from the pipeline. GitOps tools (ArgoCD, Flux) continuously compare the Git-declared desired state against the live cluster state and auto-sync any drift. GitOps enables drift detection, automatic rollback, and full audit trails without granting the CI/CD system direct cluster credentials — a significant security improvement.