Skip to content
Published on

DevOps/SRE Complete Guide: From CI/CD to Kubernetes and MLOps

Authors

Introduction

In modern software engineering, DevOps and SRE (Site Reliability Engineering) are no longer optional — they are foundational. Netflix deploys thousands of times per day. Google guarantees 99.99% availability to billions of users. Behind these feats lies rigorously automated pipelines and a data-driven operational philosophy.

This guide takes you through DevOps/SRE fundamentals, Kubernetes operations, and AI/ML workflow automation with practical, production-ready code.


1. DevOps Fundamentals: CI/CD Pipelines

What is CI/CD?

CI (Continuous Integration) is the practice of frequently merging developer code changes with automated build and test validation. CD (Continuous Delivery/Deployment) automatically delivers verified code to production.

StagePurposeAutomation Scope
CICode integration validationBuild, test, lint
CD (Delivery)Release preparationAutomated to staging
CD (Deployment)Automatic releaseAutomated to production

CI/CD with GitHub Actions

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    name: Test & Lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov flake8

      - name: Lint with flake8
        run: flake8 src/ --max-line-length=88

      - name: Run tests
        run: pytest tests/ --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: coverage.xml

  build:
    name: Build & Push Docker Image
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-
            type=ref,event=branch
            type=semver,pattern={{version}}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    name: Deploy to Kubernetes
    runs-on: ubuntu-latest
    needs: build
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBECONFIG }}

      - name: Deploy with Helm
        run: |
          helm upgrade --install my-app ./helm/my-app \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --wait --timeout=5m

Deployment Strategy Comparison

Blue-Green Deployment: Maintain two identical production environments (Blue/Green). Deploy the new version to Green, then switch all traffic at once. Rollback is instantaneous but requires double the resources.

Canary Deployment: Route a fraction of traffic (e.g., 5%) to the new version and gradually increase. Validates with real users while minimizing risk.

# Argo Rollouts Canary Strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 30
        - pause: { duration: 10m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 100
      canaryService: my-app-canary
      stableService: my-app-stable

2. GitOps and Infrastructure as Code

GitOps Principles

GitOps uses Git as the Single Source of Truth for all operational state.

  • Declarative: System state is described as code
  • Versioned: All changes tracked in Git history
  • Automated: Git changes trigger automatic reconciliation
  • Auditable: PR-based workflow records who changed what and why

The leading tools are ArgoCD and Flux.

Infrastructure as Code with Terraform

# main.tf - Provision AWS EKS Cluster
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/eks/terraform.tfstate"
    region = "ap-northeast-2"
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "prod-cluster"
  cluster_version = "1.29"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["m5.xlarge"]
      min_size       = 2
      max_size       = 10
      desired_size   = 3
    }
    gpu = {
      instance_types = ["g4dn.xlarge"]
      min_size       = 0
      max_size       = 5
      desired_size   = 1
      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

3. Kubernetes Mastery

Core Resources Overview

ResourceRole
PodExecution unit; one or more containers
DeploymentManages Pod replicas and rolling updates
ServiceStable network endpoint for a set of Pods
HPAHorizontal auto-scaling based on CPU/memory
ConfigMapExternalized configuration and environment variables
SecretSensitive data (passwords, tokens)
IngressExternal HTTP/S traffic routing

Production Deployment + HPA Manifest

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-api
  namespace: production
  labels:
    app: ml-inference-api
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-inference-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ml-inference-api
        version: v1
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
        prometheus.io/path: '/metrics'
    spec:
      containers:
        - name: api
          image: ghcr.io/myorg/ml-inference-api:sha-abc123
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_NAME
              valueFrom:
                configMapKeyRef:
                  name: ml-config
                  key: model_name
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: password
          resources:
            requests:
              cpu: '500m'
              memory: '512Mi'
            limits:
              cpu: '2000m'
              memory: '2Gi'
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '100'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Helm for Package Management

Helm is the package manager for Kubernetes. It templates complex application deployments for reuse and versioning.

# helm/my-app/values.yaml
replicaCount: 3

image:
  repository: ghcr.io/myorg/my-app
  pullPolicy: IfNotPresent
  tag: 'latest'

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: api.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: api-tls
      hosts:
        - api.example.com

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

postgresql:
  enabled: true
  auth:
    database: myapp
    existingSecret: db-credentials

redis:
  enabled: true
  architecture: replication

4. Monitoring and Observability

The Three Pillars of Observability

PillarToolsUse Case
MetricsPrometheus, GrafanaNumeric data, dashboards
LogsLoki, ElasticsearchEvent records, debugging
TracesJaeger, TempoDistributed request tracking

Prometheus Alerting Rules

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-service-alerts
  namespace: monitoring
spec:
  groups:
    - name: ml-service.rules
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m])) > 0.05
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: 'High error rate detected'
            description: 'Error rate is {{ $value | humanizePercentage }} over the last 5 minutes'

        - alert: SlowResponseTime
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'Slow p99 latency'
            description: 'p99 latency is {{ $value }}s for {{ $labels.service }}'

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total[15m]) > 3
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: 'Pod is crash looping'
            description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'

        - alert: HighMemoryUsage
          expr: |
            container_memory_usage_bytes
            /
            container_spec_memory_limit_bytes > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'High memory usage detected'

OpenTelemetry Instrumentation in Python

# instrumentation.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

def setup_telemetry(service_name: str, otlp_endpoint: str):
    """Configure OpenTelemetry for the service."""
    # Tracer setup
    tracer_provider = TracerProvider()
    otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
    tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    trace.set_tracer_provider(tracer_provider)

    # Metrics setup
    meter_provider = MeterProvider()
    metrics.set_meter_provider(meter_provider)

    return trace.get_tracer(service_name)

# Apply to FastAPI app
from fastapi import FastAPI

app = FastAPI()
tracer = setup_telemetry("ml-inference-api", "http://otel-collector:4317")

# Auto-instrumentation
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()

@app.post("/predict")
async def predict(payload: dict):
    with tracer.start_as_current_span("model-inference") as span:
        span.set_attribute("model.name", "bert-base")
        span.set_attribute("input.length", len(str(payload)))

        result = run_inference(payload)

        span.set_attribute("prediction.confidence", result["confidence"])
        return result

5. SRE Principles

SLI / SLO / SLA Hierarchy

  • SLI (Service Level Indicator): The actual measured metric (e.g., request success rate, latency)
  • SLO (Service Level Objective): The target value for an SLI (e.g., 99.9% availability)
  • SLA (Service Level Agreement): The contractual commitment to customers (typically looser than the SLO)

Error Budget Calculation

With a 99.9% SLO, the allowable downtime per month (30 days) is:

Error Budget = 100% - SLO = 0.1%
Monthly downtime budget = 30d x 24h x 60m x 0.1% = ~43.2 minutes

When the error budget is exhausted, feature deployments halt and reliability work takes priority.

SLO LevelMonthly Downtime Budget
99%7 hours 18 minutes
99.9%43 minutes 48 seconds
99.99%4 minutes 22 seconds
99.999%26 seconds

Toil Reduction

Toil is manual, repetitive, automatable operational work. Google SRE recommends keeping toil below 50% of total work time.

Strategies for toil reduction:

  1. Script repetitive tasks
  2. Convert runbooks into automated code
  3. Improve alert quality to reduce noise
  4. Build self-healing systems

6. AI/ML Workflow Automation

Experiment Tracking with MLflow

# train.py
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="rf-baseline"):
    params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)

    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

    mlflow.sklearn.log_model(model, "model",
        registered_model_name="fraud-detector")

ML Pipeline with Argo Workflows

# ml-pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: ml-training-pipeline
spec:
  entrypoint: ml-pipeline
  templates:
    - name: ml-pipeline
      dag:
        tasks:
          - name: data-prep
            template: prepare-data
          - name: train
            template: train-model
            dependencies: [data-prep]
          - name: evaluate
            template: evaluate-model
            dependencies: [train]
          - name: deploy
            template: deploy-model
            dependencies: [evaluate]

    - name: prepare-data
      container:
        image: ghcr.io/myorg/data-prep:latest
        command: [python, prepare_data.py]
        resources:
          requests:
            memory: 4Gi
            cpu: '2'

    - name: train-model
      container:
        image: ghcr.io/myorg/ml-trainer:latest
        command: [python, train.py]
        resources:
          requests:
            memory: 16Gi
            cpu: '8'
            nvidia.com/gpu: '1'

    - name: evaluate-model
      container:
        image: ghcr.io/myorg/ml-evaluator:latest
        command: [python, evaluate.py]

    - name: deploy-model
      container:
        image: ghcr.io/myorg/model-deployer:latest
        command: [python, deploy.py]

Dockerfile for a Python ML Service

# Dockerfile
FROM python:3.11-slim AS builder

WORKDIR /app

# Separate dependency layer for better cache utilization
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim AS runtime

# Security: run as non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

# Copy packages from builder
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser models/ ./models/

USER appuser

ENV PATH=/home/appuser/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

7. Security: RBAC and Secrets Management

Kubernetes RBAC

# rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-service-account
  namespace: production
---
# Least-privilege Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-service-role
  namespace: production
rules:
  - apiGroups: ['']
    resources: ['pods', 'services']
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['']
    resources: ['secrets']
    resourceNames: ['ml-model-secrets']
    verbs: ['get']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-service-rolebinding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: ml-service-account
    namespace: production
roleRef:
  kind: Role
  apiGroup: rbac.authorization.k8s.io
  name: ml-service-role

Secrets Management with HashiCorp Vault

# vault_client.py
import hvac
import os

def get_secret(secret_path: str) -> dict:
    """Securely retrieve a secret from Vault."""
    client = hvac.Client(
        url=os.environ["VAULT_ADDR"],
        token=os.environ["VAULT_TOKEN"]
    )

    if not client.is_authenticated():
        raise RuntimeError("Vault authentication failed")

    secret = client.secrets.kv.v2.read_secret_version(
        path=secret_path,
        mount_point="secret"
    )
    return secret["data"]["data"]

# In Kubernetes, use the Vault Agent Injector for
# automatic secret injection via pod annotations

Network Policies

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-service-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: ml-inference-api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: database
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 4317 # OTLP gRPC

8. Incident Management

Incident Response Process

  1. Detection: Prometheus alert or user report
  2. Triage: Assess severity (P1/P2/P3)
  3. Communication: Update status page, notify stakeholders
  4. Mitigation: Traffic failover, rollback, scale out
  5. Resolution: Fix root cause
  6. Post-Mortem: Write a blameless post-mortem

Writing Effective Post-Mortems

A good post-mortem focuses on system improvement, not individual blame.

Key elements:

  • Detailed timeline of events
  • Distinguish root cause from triggering factor
  • 5 Whys analysis
  • Concrete action items with owners and due dates

Conclusion

DevOps and SRE are not just toolsets — they are a culture and philosophy. Automation reduces human error, data-driven decisions improve reliability, and continuous improvement builds trust.

Core principles in summary:

  • Automate first: All repetitive work should be code
  • Measure everything: SLI/SLO make goals concrete
  • Fail fast and safely: Canary deployments minimize blast radius
  • Blameless culture: Improve systems, not blame people

Quiz

Q1. What is the difference between Blue-Green and Canary deployments in CI/CD?

Answer: Blue-Green maintains two identical production environments and switches all traffic at once, while Canary gradually routes a fraction of traffic to the new version.

Explanation: Blue-Green provides instant rollback (just flip the traffic switch) with zero downtime, but requires double the resources. Canary validates the new version with real user traffic while minimizing risk, but requires sophisticated monitoring and traffic management. Tools like Argo Rollouts support both strategies natively.

Q2. What metrics does Kubernetes HPA use to trigger scaling?

Answer: By default, CPU and memory utilization. With the Custom Metrics API, HPA can also scale on application-level metrics like RPS (requests per second) or queue depth.

Explanation: HPA v2 supports four metric types: resource, pods, object, and external. Setting a 70% CPU target causes HPA to add pods when average CPU across existing pods exceeds that threshold. Installing the Prometheus Adapter lets you expose Prometheus metrics to HPA for more meaningful scaling decisions.

Q3. What is the role of Error Budget in SRE?

Answer: The Error Budget represents the allowable failure margin derived from the SLO. It serves as the mechanism balancing reliability investment against feature delivery velocity.

Explanation: If the SLO is 99.9%, the error budget is 0.1% (roughly 43 minutes of downtime per month). When the budget is ample, teams can deploy freely and experiment. When it is exhausted, new deployments freeze until reliability work replenishes it. This creates a data-driven negotiation between development speed and stability that replaces subjective arguments.

Q4. What are the advantages of Prometheus's pull-based metric collection?

Answer: Pull-based collection lets Prometheus centrally manage scrape targets, immediately detects when a target goes down, and requires no inbound firewall rules from the monitored services.

Explanation: Push-based systems (like StatsD or InfluxDB line protocol) require each service to know the address of the metrics server and risk data loss on network issues. Pull-based Prometheus manages targets via configuration files or service discovery, making it easy to add and remove services dynamically. For short-lived batch jobs that exit before Prometheus can scrape them, the Pushgateway provides a bridge.

Q5. What is the difference between GitOps and traditional CI/CD?

Answer: GitOps uses Git as the single source of truth and continuously reconciles declared state, while traditional CI/CD pipelines imperatively execute deployment commands directly.

Explanation: Traditional CI/CD (Jenkins, GitHub Actions) runs kubectl apply or helm upgrade directly from the pipeline. GitOps tools (ArgoCD, Flux) continuously compare the Git-declared desired state against the live cluster state and auto-sync any drift. GitOps enables drift detection, automatic rollback, and full audit trails without granting the CI/CD system direct cluster credentials — a significant security improvement.