💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

In modern software engineering, **DevOps** and **SRE (Site Reliability Engineering)** are no longer optional — they are foundational. Netflix deploys thousands of times per day. Google guarantees 99.99% availability to billions of users. Behind these feats lies rigorously automated pipelines and a data-driven operational philosophy.

This guide takes you through DevOps/SRE fundamentals, Kubernetes operations, and AI/ML workflow automation with **practical, production-ready code**.

1. DevOps Fundamentals: CI/CD Pipelines

What is CI/CD?

**CI (Continuous Integration)** is the practice of frequently merging developer code changes with automated build and test validation. **CD (Continuous Delivery/Deployment)** automatically delivers verified code to production.

| Stage | Purpose | Automation Scope |

| --------------- | --------------------------- | ----------------------- |

| CI | Code integration validation | Build, test, lint |

| CD (Delivery) | Release preparation | Automated to staging |

| CD (Deployment) | Automatic release | Automated to production |

CI/CD with GitHub Actions

.github/workflows/ci-cd.yml

on:

push:

branches: [main, develop]

pull_request:

branches: [main]

env:

REGISTRY: ghcr.io

IMAGE_NAME: ${{ github.repository }}

jobs:

test:

runs-on: ubuntu-latest

steps:

- uses: actions/checkout@v4

- name: Set up Python

uses: actions/setup-python@v5

with:

python-version: '3.11'

cache: 'pip'

- name: Install dependencies

run: |

pip install -r requirements.txt

pip install pytest pytest-cov flake8

- name: Lint with flake8

run: flake8 src/ --max-line-length=88

- name: Run tests

run: pytest tests/ --cov=src --cov-report=xml

- name: Upload coverage

uses: codecov/codecov-action@v4

with:

file: coverage.xml

build:

runs-on: ubuntu-latest

needs: test

if: github.ref == 'refs/heads/main'

permissions:

contents: read

packages: write

steps:

- uses: actions/checkout@v4

- name: Log in to Container Registry

uses: docker/login-action@v3

with:

registry: ${{ env.REGISTRY }}

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

- name: Extract metadata

id: meta

uses: docker/metadata-action@v5

with:

images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

tags: |

type=sha,prefix=sha-

type=ref,event=branch

type=semver,pattern={{version}}

- name: Build and push

uses: docker/build-push-action@v5

with:

context: .

push: true

tags: ${{ steps.meta.outputs.tags }}

cache-from: type=gha

cache-to: type=gha,mode=max

deploy:

runs-on: ubuntu-latest

needs: build

environment: production

steps:

- uses: actions/checkout@v4

- name: Configure kubectl

uses: azure/k8s-set-context@v3

with:

kubeconfig: ${{ secrets.KUBECONFIG }}

- name: Deploy with Helm

run: |

helm upgrade --install my-app ./helm/my-app \

--namespace production \

--set image.tag=${{ github.sha }} \

--wait --timeout=5m

Deployment Strategy Comparison

**Blue-Green Deployment**: Maintain two identical production environments (Blue/Green). Deploy the new version to Green, then switch all traffic at once. Rollback is instantaneous but requires double the resources.

**Canary Deployment**: Route a fraction of traffic (e.g., 5%) to the new version and gradually increase. Validates with real users while minimizing risk.

Argo Rollouts Canary Strategy

apiVersion: argoproj.io/v1alpha1

kind: Rollout

metadata:

spec:

replicas: 10

strategy:

canary:

steps:

- setWeight: 10

- pause: { duration: 5m }

- setWeight: 30

- pause: { duration: 10m }

- setWeight: 60

- pause: { duration: 10m }

- setWeight: 100

canaryService: my-app-canary

stableService: my-app-stable

2. GitOps and Infrastructure as Code

GitOps Principles

GitOps uses Git as the **Single Source of Truth** for all operational state.

- **Declarative**: System state is described as code

- **Versioned**: All changes tracked in Git history

- **Automated**: Git changes trigger automatic reconciliation

- **Auditable**: PR-based workflow records who changed what and why

The leading tools are **ArgoCD** and **Flux**.

Infrastructure as Code with Terraform

main.tf - Provision AWS EKS Cluster

terraform {

required_providers {

aws = {

source = "hashicorp/aws"

version = "~> 5.0"

}

backend "s3" {

bucket = "my-terraform-state"

key = "prod/eks/terraform.tfstate"

region = "ap-northeast-2"

}

module "eks" {

source = "terraform-aws-modules/eks/aws"

version = "~> 20.0"

cluster_name = "prod-cluster"

cluster_version = "1.29"

vpc_id = module.vpc.vpc_id

subnet_ids = module.vpc.private_subnets

eks_managed_node_groups = {

general = {

instance_types = ["m5.xlarge"]

min_size = 2

max_size = 10

desired_size = 3

}

gpu = {

instance_types = ["g4dn.xlarge"]

min_size = 0

max_size = 5

desired_size = 1

taints = [{

key = "nvidia.com/gpu"

value = "true"

effect = "NO_SCHEDULE"

}]

}

3. Kubernetes Mastery

Core Resources Overview

| Resource | Role |

| ---------- | ---------------------------------------------------- |

| Pod | Execution unit; one or more containers |

| Deployment | Manages Pod replicas and rolling updates |

| Service | Stable network endpoint for a set of Pods |

| HPA | Horizontal auto-scaling based on CPU/memory |

| ConfigMap | Externalized configuration and environment variables |

| Secret | Sensitive data (passwords, tokens) |

| Ingress | External HTTP/S traffic routing |

Production Deployment + HPA Manifest

deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: production

labels:

app: ml-inference-api

version: v1

spec:

replicas: 3

selector:

matchLabels:

app: ml-inference-api

strategy:

type: RollingUpdate

rollingUpdate:

maxSurge: 1

maxUnavailable: 0

template:

metadata:

labels:

app: ml-inference-api

version: v1

annotations:

prometheus.io/scrape: 'true'

prometheus.io/port: '8080'

prometheus.io/path: '/metrics'

spec:

containers:

- name: api

image: ghcr.io/myorg/ml-inference-api:sha-abc123

ports:

- containerPort: 8080

env:

- name: MODEL_NAME

valueFrom:

configMapKeyRef:

key: model_name

- name: DB_PASSWORD

valueFrom:

secretKeyRef:

key: password

resources:

requests:

cpu: '500m'

memory: '512Mi'

limits:

cpu: '2000m'

memory: '2Gi'

readinessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 10

periodSeconds: 5

livenessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 30

periodSeconds: 10

hpa.yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

namespace: production

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 3

maxReplicas: 20

metrics:

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 80

- type: Pods

pods:

metric:

target:

type: AverageValue

averageValue: '100'

behavior:

scaleUp:

stabilizationWindowSeconds: 60

policies:

- type: Pods

value: 4

periodSeconds: 60

scaleDown:

stabilizationWindowSeconds: 300

Helm for Package Management

Helm is the package manager for Kubernetes. It templates complex application deployments for reuse and versioning.

helm/my-app/values.yaml

replicaCount: 3

image:

repository: ghcr.io/myorg/my-app

pullPolicy: IfNotPresent

tag: 'latest'

service:

type: ClusterIP

port: 80

targetPort: 8080

ingress:

enabled: true

className: nginx

annotations:

cert-manager.io/cluster-issuer: letsencrypt-prod

hosts:

- host: api.example.com

paths:

- path: /

pathType: Prefix

tls:

- secretName: api-tls

hosts:

- api.example.com

resources:

requests:

cpu: 500m

memory: 512Mi

limits:

cpu: 2000m

memory: 2Gi

autoscaling:

enabled: true

minReplicas: 3

maxReplicas: 20

targetCPUUtilizationPercentage: 70

postgresql:

enabled: true

auth:

database: myapp

existingSecret: db-credentials

redis:

enabled: true

architecture: replication

4. Monitoring and Observability

The Three Pillars of Observability

| Pillar | Tools | Use Case |

| ------- | ------------------- | ---------------------------- |

| Metrics | Prometheus, Grafana | Numeric data, dashboards |

| Logs | Loki, Elasticsearch | Event records, debugging |

| Traces | Jaeger, Tempo | Distributed request tracking |

Prometheus Alerting Rules

prometheus-rules.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

namespace: monitoring

spec:

groups:

- name: ml-service.rules

interval: 30s

rules:

- alert: HighErrorRate

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m]))

sum(rate(http_requests_total[5m])) > 0.05

for: 2m

labels:

severity: critical

annotations:

summary: 'High error rate detected'

description: 'Error rate is {{ $value | humanizePercentage }} over the last 5 minutes'

- alert: SlowResponseTime

expr: |

histogram_quantile(0.99,

sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)

) > 1.0

for: 5m

labels:

severity: warning

annotations:

summary: 'Slow p99 latency'

description: 'p99 latency is {{ $value }}s for {{ $labels.service }}'

- alert: PodCrashLooping

expr: |

increase(kube_pod_container_status_restarts_total[15m]) > 3

for: 0m

labels:

severity: critical

annotations:

summary: 'Pod is crash looping'

description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'

- alert: HighMemoryUsage

expr: |

container_memory_usage_bytes

container_spec_memory_limit_bytes > 0.85

for: 5m

labels:

severity: warning

annotations:

summary: 'High memory usage detected'

OpenTelemetry Instrumentation in Python

instrumentation.py

from opentelemetry import trace, metrics

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

from opentelemetry.sdk.metrics import MeterProvider

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

from opentelemetry.instrumentation.requests import RequestsInstrumentor

def setup_telemetry(service_name: str, otlp_endpoint: str):

"""Configure OpenTelemetry for the service."""

Tracer setup

tracer_provider = TracerProvider()

otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint)

tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

trace.set_tracer_provider(tracer_provider)

Metrics setup

meter_provider = MeterProvider()

metrics.set_meter_provider(meter_provider)

return trace.get_tracer(service_name)

Apply to FastAPI app

from fastapi import FastAPI

app = FastAPI()

tracer = setup_telemetry("ml-inference-api", "http://otel-collector:4317")

Auto-instrumentation

FastAPIInstrumentor.instrument_app(app)

RequestsInstrumentor().instrument()

@app.post("/predict")

async def predict(payload: dict):

with tracer.start_as_current_span("model-inference") as span:

span.set_attribute("model.name", "bert-base")

span.set_attribute("input.length", len(str(payload)))

result = run_inference(payload)

span.set_attribute("prediction.confidence", result["confidence"])

return result

5. SRE Principles

SLI / SLO / SLA Hierarchy

- **SLI (Service Level Indicator)**: The actual measured metric (e.g., request success rate, latency)

- **SLO (Service Level Objective)**: The target value for an SLI (e.g., 99.9% availability)

- **SLA (Service Level Agreement)**: The contractual commitment to customers (typically looser than the SLO)

Error Budget Calculation

With a 99.9% SLO, the allowable downtime per month (30 days) is:

Error Budget = 100% - SLO = 0.1%

Monthly downtime budget = 30d x 24h x 60m x 0.1% = ~43.2 minutes

When the error budget is exhausted, feature deployments halt and reliability work takes priority.

| SLO Level | Monthly Downtime Budget |

| --------- | ----------------------- |

| 99% | 7 hours 18 minutes |

| 99.9% | 43 minutes 48 seconds |

| 99.99% | 4 minutes 22 seconds |

| 99.999% | 26 seconds |

Toil Reduction

**Toil** is manual, repetitive, automatable operational work. Google SRE recommends keeping toil below 50% of total work time.

Strategies for toil reduction:

1. Script repetitive tasks

2. Convert runbooks into automated code

3. Improve alert quality to reduce noise

4. Build self-healing systems

6. AI/ML Workflow Automation

Experiment Tracking with MLflow

train.py

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://mlflow-server:5000")

mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="rf-baseline"):

params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}

mlflow.log_params(params)

model = RandomForestClassifier(**params)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))

mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

mlflow.sklearn.log_model(model, "model",

registered_model_name="fraud-detector")

ML Pipeline with Argo Workflows

ml-pipeline.yaml

apiVersion: argoproj.io/v1alpha1

kind: Workflow

metadata:

spec:

entrypoint: ml-pipeline

templates:

- name: ml-pipeline

dag:

tasks:

- name: data-prep

- name: train

dependencies: [data-prep]

- name: evaluate

dependencies: [train]

- name: deploy

dependencies: [evaluate]

- name: prepare-data

container:

image: ghcr.io/myorg/data-prep:latest

command: [python, prepare_data.py]

resources:

requests:

memory: 4Gi

cpu: '2'

- name: train-model

container:

image: ghcr.io/myorg/ml-trainer:latest

command: [python, train.py]

resources:

requests:

memory: 16Gi

cpu: '8'

nvidia.com/gpu: '1'

- name: evaluate-model

container:

image: ghcr.io/myorg/ml-evaluator:latest

command: [python, evaluate.py]

- name: deploy-model

container:

image: ghcr.io/myorg/model-deployer:latest

command: [python, deploy.py]

Dockerfile for a Python ML Service

Dockerfile

FROM python:3.11-slim AS builder

WORKDIR /app

Separate dependency layer for better cache utilization

COPY requirements.txt .

RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim AS runtime

Security: run as non-root user

RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

Copy packages from builder

COPY --from=builder /root/.local /home/appuser/.local

COPY --chown=appuser:appuser src/ ./src/

COPY --chown=appuser:appuser models/ ./models/

USER appuser

ENV PATH=/home/appuser/.local/bin:$PATH

ENV PYTHONUNBUFFERED=1

ENV PYTHONDONTWRITEBYTECODE=1

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \

CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

7. Security: RBAC and Secrets Management

Kubernetes RBAC

rbac.yaml

apiVersion: v1

kind: ServiceAccount

metadata:

namespace: production

Least-privilege Role

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

namespace: production

rules:

- apiGroups: ['']

resources: ['pods', 'services']

verbs: ['get', 'list', 'watch']

- apiGroups: ['']

resources: ['secrets']

resourceNames: ['ml-model-secrets']

verbs: ['get']

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

namespace: production

subjects:

- kind: ServiceAccount

namespace: production

roleRef:

kind: Role

apiGroup: rbac.authorization.k8s.io

Secrets Management with HashiCorp Vault

vault_client.py

def get_secret(secret_path: str) -> dict:

"""Securely retrieve a secret from Vault."""

client = hvac.Client(

url=os.environ["VAULT_ADDR"],

token=os.environ["VAULT_TOKEN"]

)

if not client.is_authenticated():

raise RuntimeError("Vault authentication failed")

secret = client.secrets.kv.v2.read_secret_version(

path=secret_path,

mount_point="secret"

)

return secret["data"]["data"]

In Kubernetes, use the Vault Agent Injector for

automatic secret injection via pod annotations

Network Policies

network-policy.yaml

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

namespace: production

spec:

podSelector:

matchLabels:

app: ml-inference-api

policyTypes:

- Ingress

- Egress

ingress:

- from:

- namespaceSelector:

matchLabels:

ports:

- protocol: TCP

port: 8080

egress:

- to:

- namespaceSelector:

matchLabels:

ports:

- protocol: TCP

port: 5432

- to:

- namespaceSelector:

matchLabels:

ports:

- protocol: TCP

port: 4317 # OTLP gRPC

8. Incident Management

Incident Response Process

1. **Detection**: Prometheus alert or user report

2. **Triage**: Assess severity (P1/P2/P3)

3. **Communication**: Update status page, notify stakeholders

4. **Mitigation**: Traffic failover, rollback, scale out

5. **Resolution**: Fix root cause

6. **Post-Mortem**: Write a blameless post-mortem

Writing Effective Post-Mortems

A good post-mortem focuses on **system improvement**, not individual blame.

Key elements:

- Detailed timeline of events

- Distinguish root cause from triggering factor

- 5 Whys analysis

- Concrete action items with owners and due dates

Conclusion

DevOps and SRE are not just toolsets — they are a **culture and philosophy**. Automation reduces human error, data-driven decisions improve reliability, and continuous improvement builds trust.

Core principles in summary:

- **Automate first**: All repetitive work should be code

- **Measure everything**: SLI/SLO make goals concrete

- **Fail fast and safely**: Canary deployments minimize blast radius

- **Blameless culture**: Improve systems, not blame people

Quiz

**Answer**: Blue-Green maintains two identical production environments and switches all traffic at once, while Canary gradually routes a fraction of traffic to the new version.

**Explanation**: Blue-Green provides instant rollback (just flip the traffic switch) with zero downtime, but requires double the resources. Canary validates the new version with real user traffic while minimizing risk, but requires sophisticated monitoring and traffic management. Tools like Argo Rollouts support both strategies natively.

**Answer**: By default, CPU and memory utilization. With the Custom Metrics API, HPA can also scale on application-level metrics like RPS (requests per second) or queue depth.

**Explanation**: HPA v2 supports four metric types: `resource`, `pods`, `object`, and `external`. Setting a 70% CPU target causes HPA to add pods when average CPU across existing pods exceeds that threshold. Installing the Prometheus Adapter lets you expose Prometheus metrics to HPA for more meaningful scaling decisions.

**Answer**: The Error Budget represents the allowable failure margin derived from the SLO. It serves as the mechanism balancing reliability investment against feature delivery velocity.

**Explanation**: If the SLO is 99.9%, the error budget is 0.1% (roughly 43 minutes of downtime per month). When the budget is ample, teams can deploy freely and experiment. When it is exhausted, new deployments freeze until reliability work replenishes it. This creates a data-driven negotiation between development speed and stability that replaces subjective arguments.

**Answer**: Pull-based collection lets Prometheus centrally manage scrape targets, immediately detects when a target goes down, and requires no inbound firewall rules from the monitored services.

**Explanation**: Push-based systems (like StatsD or InfluxDB line protocol) require each service to know the address of the metrics server and risk data loss on network issues. Pull-based Prometheus manages targets via configuration files or service discovery, making it easy to add and remove services dynamically. For short-lived batch jobs that exit before Prometheus can scrape them, the Pushgateway provides a bridge.

**Answer**: GitOps uses Git as the single source of truth and continuously reconciles declared state, while traditional CI/CD pipelines imperatively execute deployment commands directly.

**Explanation**: Traditional CI/CD (Jenkins, GitHub Actions) runs `kubectl apply` or `helm upgrade` directly from the pipeline. GitOps tools (ArgoCD, Flux) continuously compare the Git-declared desired state against the live cluster state and auto-sync any drift. GitOps enables drift detection, automatic rollback, and full audit trails without granting the CI/CD system direct cluster credentials — a significant security improvement.