Split View: DevOps/SRE 완전 정복: CI/CD부터 Kubernetes, MLOps까지

DevOps/SRE 완전 정복: CI/CD부터 Kubernetes, MLOps까지

들어가며
1. DevOps 기초: CI/CD 파이프라인
2. GitOps와 Infrastructure as Code
- GitOps 원칙
- Terraform으로 IaC 구현
3. Kubernetes 완전 정복
4. 모니터링 & 관측성
5. SRE 원칙
6. AI/ML 워크플로우 자동화
7. 보안: RBAC와 Secrets Management
8. 인시던트 관리
- 인시던트 대응 프로세스
- 효과적인 Post-Mortem 작성
마무리
퀴즈

들어가며

현대 소프트웨어 개발에서 DevOps와 **SRE(Site Reliability Engineering)**는 선택이 아닌 필수입니다. Netflix는 하루 수천 번 배포하고, Google은 수십억 사용자에게 99.99% 가용성을 보장합니다. 그 뒤에는 철저히 자동화된 파이프라인과 데이터 기반 운영 철학이 있습니다.

이 가이드는 DevOps/SRE의 핵심 개념부터 Kubernetes 실전 운영, AI/ML 워크플로우 자동화까지 실전 코드와 함께 완전히 정복합니다.

1. DevOps 기초: CI/CD 파이프라인

CI/CD란 무엇인가

CI(Continuous Integration) 는 개발자가 코드를 자주 통합하고 자동으로 빌드/테스트하는 관행입니다. CD(Continuous Delivery/Deployment) 는 검증된 코드를 자동으로 프로덕션에 배포합니다.

구분	목적	자동화 범위
CI	코드 통합 검증	빌드, 테스트, 린트
CD (Delivery)	배포 준비	스테이징까지 자동
CD (Deployment)	자동 배포	프로덕션까지 자동

GitHub Actions로 CI/CD 구축

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    name: Test & Lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov flake8

      - name: Lint with flake8
        run: flake8 src/ --max-line-length=88

      - name: Run tests
        run: pytest tests/ --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: coverage.xml

  build:
    name: Build & Push Docker Image
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-
            type=ref,event=branch
            type=semver,pattern={{version}}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    name: Deploy to Kubernetes
    runs-on: ubuntu-latest
    needs: build
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBECONFIG }}

      - name: Deploy with Helm
        run: |
          helm upgrade --install my-app ./helm/my-app \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --wait --timeout=5m

배포 전략 비교

Blue-Green 배포: 동일한 프로덕션 환경을 두 개(Blue/Green) 유지합니다. 새 버전을 Green에 배포 후 트래픽을 한 번에 전환합니다. 롤백이 즉각적이지만 리소스가 두 배 필요합니다.

Canary 배포: 트래픽의 일부(예: 5%)만 새 버전으로 라우팅하여 점진적으로 확대합니다. 실제 사용자로 검증하면서 위험을 최소화합니다.

# Argo Rollouts Canary 배포 예시
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 30
        - pause: { duration: 10m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 100
      canaryService: my-app-canary
      stableService: my-app-stable

2. GitOps와 Infrastructure as Code

GitOps 원칙

GitOps는 Git을 단일 진실 공급원(Single Source of Truth) 으로 사용하는 운영 모델입니다.

선언적(Declarative): 시스템 상태를 코드로 선언
버전 관리: 모든 변경사항이 Git 히스토리로 추적
자동화: Git 변경 → 자동 동기화
감사 가능: PR 기반 변경으로 누가 무엇을 왜 변경했는지 기록

대표 도구로는 ArgoCD와 Flux가 있습니다.

Terraform으로 IaC 구현

# main.tf - AWS EKS 클러스터 프로비저닝
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/eks/terraform.tfstate"
    region = "ap-northeast-2"
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "prod-cluster"
  cluster_version = "1.29"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["m5.xlarge"]
      min_size       = 2
      max_size       = 10
      desired_size   = 3
    }
    gpu = {
      instance_types = ["g4dn.xlarge"]
      min_size       = 0
      max_size       = 5
      desired_size   = 1
      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

3. Kubernetes 완전 정복

핵심 리소스 이해

Kubernetes의 핵심 오브젝트들을 정리합니다.

리소스	역할
Pod	실행 단위, 1개 이상의 컨테이너 묶음
Deployment	Pod 복제본 관리, 롤링 업데이트
Service	Pod 집합에 대한 네트워크 엔드포인트
HPA	CPU/메모리 기반 자동 수평 확장
ConfigMap	환경변수/설정 파일 분리
Secret	민감한 정보(비밀번호, 토큰) 관리
Ingress	외부 HTTP 트래픽 라우팅

실전 Deployment + HPA 매니페스트

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-api
  namespace: production
  labels:
    app: ml-inference-api
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-inference-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ml-inference-api
        version: v1
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
        prometheus.io/path: '/metrics'
    spec:
      containers:
        - name: api
          image: ghcr.io/myorg/ml-inference-api:sha-abc123
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_NAME
              valueFrom:
                configMapKeyRef:
                  name: ml-config
                  key: model_name
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: password
          resources:
            requests:
              cpu: '500m'
              memory: '512Mi'
            limits:
              cpu: '2000m'
              memory: '2Gi'
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '100'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Helm Chart로 패키지 관리

Helm은 Kubernetes의 패키지 매니저입니다. 복잡한 애플리케이션 배포를 템플릿으로 관리합니다.

# helm/my-app/values.yaml
replicaCount: 3

image:
  repository: ghcr.io/myorg/my-app
  pullPolicy: IfNotPresent
  tag: 'latest'

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: api.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: api-tls
      hosts:
        - api.example.com

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

postgresql:
  enabled: true
  auth:
    database: myapp
    existingSecret: db-credentials

redis:
  enabled: true
  architecture: replication

4. 모니터링 & 관측성

관측성의 3가지 기둥

기둥	도구	용도
메트릭(Metrics)	Prometheus, Grafana	수치 데이터, 대시보드
로그(Logs)	Loki, Elasticsearch	이벤트 기록, 디버깅
트레이스(Traces)	Jaeger, Tempo	분산 요청 추적

Prometheus 알림 규칙

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-service-alerts
  namespace: monitoring
spec:
  groups:
    - name: ml-service.rules
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m])) > 0.05
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: 'High error rate detected'
            description: 'Error rate is {{ $value | humanizePercentage }} for the last 5 minutes'

        - alert: SlowResponseTime
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'Slow p99 latency'
            description: 'p99 latency is {{ $value }}s for {{ $labels.service }}'

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total[15m]) > 3
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: 'Pod crash looping'
            description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'

        - alert: HighMemoryUsage
          expr: |
            container_memory_usage_bytes
            /
            container_spec_memory_limit_bytes > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'High memory usage'

OpenTelemetry 계측

# instrumentation.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

def setup_telemetry(service_name: str, otlp_endpoint: str):
    """OpenTelemetry 설정"""
    # 트레이서 설정
    tracer_provider = TracerProvider()
    otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
    tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    trace.set_tracer_provider(tracer_provider)

    # 메트릭 설정
    meter_provider = MeterProvider()
    metrics.set_meter_provider(meter_provider)

    return trace.get_tracer(service_name)

# FastAPI 앱에 적용
from fastapi import FastAPI

app = FastAPI()
tracer = setup_telemetry("ml-inference-api", "http://otel-collector:4317")

# 자동 계측
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()

@app.post("/predict")
async def predict(payload: dict):
    with tracer.start_as_current_span("model-inference") as span:
        span.set_attribute("model.name", "bert-base")
        span.set_attribute("input.length", len(str(payload)))

        # 모델 추론 로직
        result = run_inference(payload)

        span.set_attribute("prediction.confidence", result["confidence"])
        return result

5. SRE 원칙

SLI / SLO / SLA 계층 구조

SLI(Service Level Indicator): 실제 측정되는 서비스 성능 지표 (예: 요청 성공률, 레이턴시)
SLO(Service Level Objective): SLI에 대한 목표값 (예: 99.9% 가용성)
SLA(Service Level Agreement): 고객과 합의한 계약 수준 (SLO보다 느슨하게 설정)

Error Budget 계산

SLO가 99.9%라면, 한 달(30일) 기준 허용되는 다운타임은 다음과 같습니다.

Error Budget = 100% - SLO = 0.1%
월별 허용 다운타임 = 30일 × 24시간 × 60분 × 0.1% ≈ 43.2분

Error Budget이 소진되면 새 기능 배포를 중단하고 안정성 작업에 집중합니다. 이것이 SRE의 핵심 메커니즘입니다.

SLO 수준	월별 허용 다운타임
99%	7시간 18분
99.9%	43분 48초
99.99%	4분 22초
99.999%	26초

Toil 감소 전략

Toil은 수동적, 반복적, 자동화 가능한 운영 작업입니다. Google SRE는 Toil을 전체 업무 시간의 50% 미만으로 유지하도록 권고합니다.

Toil 감소 방법:

반복 작업 스크립트/자동화
런북(Runbook)을 자동화 코드로 전환
알림 품질 개선으로 노이즈 감소
자가 치유(Self-Healing) 시스템 구축

6. AI/ML 워크플로우 자동화

MLflow로 실험 추적

# train.py
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="rf-baseline"):
    # 하이퍼파라미터 로깅
    params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)

    # 모델 학습
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # 메트릭 로깅
    y_pred = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

    # 모델 저장
    mlflow.sklearn.log_model(model, "model",
        registered_model_name="fraud-detector")

Argo Workflows로 ML 파이프라인

# ml-pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: ml-training-pipeline
spec:
  entrypoint: ml-pipeline
  templates:
    - name: ml-pipeline
      dag:
        tasks:
          - name: data-prep
            template: prepare-data
          - name: train
            template: train-model
            dependencies: [data-prep]
          - name: evaluate
            template: evaluate-model
            dependencies: [train]
          - name: deploy
            template: deploy-model
            dependencies: [evaluate]

    - name: prepare-data
      container:
        image: ghcr.io/myorg/data-prep:latest
        command: [python, prepare_data.py]
        resources:
          requests:
            memory: 4Gi
            cpu: '2'

    - name: train-model
      container:
        image: ghcr.io/myorg/ml-trainer:latest
        command: [python, train.py]
        resources:
          requests:
            memory: 16Gi
            cpu: '8'
            nvidia.com/gpu: '1'

    - name: evaluate-model
      container:
        image: ghcr.io/myorg/ml-evaluator:latest
        command: [python, evaluate.py]

    - name: deploy-model
      container:
        image: ghcr.io/myorg/model-deployer:latest
        command: [python, deploy.py]

Python ML 서비스 Dockerfile

# Dockerfile
FROM python:3.11-slim AS builder

WORKDIR /app

# 의존성 설치 레이어 분리 (캐시 활용)
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim AS runtime

# 보안: 루트 권한 없이 실행
RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

# 빌더에서 패키지 복사
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser models/ ./models/

USER appuser

ENV PATH=/home/appuser/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

7. 보안: RBAC와 Secrets Management

Kubernetes RBAC

# rbac.yaml
# ServiceAccount 생성
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-service-account
  namespace: production
---
# 최소 권한 Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-service-role
  namespace: production
rules:
  - apiGroups: ['']
    resources: ['pods', 'services']
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['']
    resources: ['secrets']
    resourceNames: ['ml-model-secrets']
    verbs: ['get']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-service-rolebinding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: ml-service-account
    namespace: production
roleRef:
  kind: Role
  apiGroup: rbac.authorization.k8s.io
  name: ml-service-role

HashiCorp Vault로 Secret 관리

# vault_client.py
import hvac
import os

def get_secret(secret_path: str) -> dict:
    """Vault에서 시크릿을 안전하게 조회합니다."""
    client = hvac.Client(
        url=os.environ["VAULT_ADDR"],
        token=os.environ["VAULT_TOKEN"]
    )

    if not client.is_authenticated():
        raise RuntimeError("Vault 인증 실패")

    secret = client.secrets.kv.v2.read_secret_version(
        path=secret_path,
        mount_point="secret"
    )
    return secret["data"]["data"]

# Kubernetes에서는 Vault Agent Injector 사용
# Pod 어노테이션으로 자동 시크릿 주입

Network Policy로 트래픽 제어

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-service-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: ml-inference-api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: database
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 4317 # OTLP gRPC

8. 인시던트 관리

인시던트 대응 프로세스

감지(Detection): Prometheus 알림 또는 사용자 보고
분류(Triage): 심각도 판단 (P1/P2/P3)
소통(Communication): 상태 페이지 업데이트, 이해관계자 알림
완화(Mitigation): 트래픽 전환, 롤백, 스케일 아웃
해결(Resolution): 근본 원인 해결
사후 검토(Post-Mortem): Blameless post-mortem 작성

효과적인 Post-Mortem 작성

좋은 Post-Mortem은 개인 책임이 아닌 시스템 개선에 집중합니다.

타임라인 상세 기록
근본 원인(Root Cause) vs 유발 요인(Trigger) 구분
5 Whys 분석
구체적인 액션 아이템 (담당자 + 기한)

마무리

DevOps/SRE는 단순한 도구 모음이 아니라 문화와 철학입니다. 자동화로 인간 실수를 줄이고, 데이터로 의사결정을 하며, 지속적인 개선을 통해 시스템 신뢰성을 높입니다.

핵심 원칙을 요약하면:

자동화 우선: 모든 반복 작업은 코드로
측정 가능성: SLI/SLO로 목표를 명확히
빠른 실패: 카나리 배포로 위험 최소화
Blameless 문화: 시스템을 개선하지, 사람을 비난하지 않음

퀴즈

Q1. CI/CD 파이프라인에서 Blue-Green 배포와 Canary 배포의 차이는 무엇인가요?

정답: Blue-Green은 두 개의 동일한 프로덕션 환경을 유지하고 트래픽을 한 번에 전환하는 방식이며, Canary는 트래픽의 일부만 새 버전으로 점진적으로 전환하는 방식입니다.

설명: Blue-Green 배포는 롤백이 즉각적(트래픽 전환만)이고 다운타임이 없지만, 리소스가 두 배 필요합니다. Canary 배포는 실제 사용자 트래픽으로 새 버전을 검증하면서 위험을 최소화할 수 있으나, 모니터링이 복잡합니다. Argo Rollouts 같은 도구가 두 방식 모두 지원합니다.

Q2. Kubernetes HPA가 스케일링하는 기준 메트릭은 무엇인가요?

정답: 기본적으로 CPU 사용률과 메모리 사용률이며, Custom Metrics API를 통해 RPS(초당 요청 수), 큐 깊이 등 커스텀 메트릭도 사용할 수 있습니다.

설명: HPA(HorizontalPodAutoscaler)는 v2 API에서 resource, pods, object, external 4가지 메트릭 타입을 지원합니다. CPU 70% 목표를 설정하면 현재 평균 CPU가 이를 초과할 때 파드 수를 증가시킵니다. Prometheus Adapter를 설치하면 Prometheus 메트릭을 HPA에 연동할 수 있습니다.

Q3. SRE에서 Error Budget의 역할은 무엇인가요?

정답: Error Budget은 SLO에서 정의한 허용 실패량으로, 안정성과 기능 개발 속도 사이의 균형을 맞추는 메커니즘입니다.

설명: SLO가 99.9%라면 Error Budget은 0.1%입니다. 이 예산이 충분하면 새 기능 배포, 실험적 변경이 가능합니다. 예산이 소진되면 새 배포를 동결하고 안정성 개선에 집중합니다. 이를 통해 개발팀과 운영팀이 데이터 기반으로 배포 속도를 협의할 수 있습니다.

Q4. Prometheus의 pull 방식 메트릭 수집 장점은 무엇인가요?

정답: Pull 방식은 Prometheus가 스크레이핑 대상을 중앙에서 관리하므로 설정이 단순하고, 대상 서비스가 다운되었을 때 즉시 감지할 수 있으며, 보안 측면에서 방화벽 인바운드 규칙이 필요 없습니다.

설명: Push 방식(StatsD, InfluxDB 등)은 각 서비스가 메트릭 서버 주소를 알아야 하고 네트워크 문제 시 데이터가 유실될 수 있습니다. Pull 방식은 Prometheus가 설정 파일이나 서비스 디스커버리로 대상을 관리하므로 서비스 추가/제거가 유연합니다. 단, 매우 짧은 수명의 배치 작업에는 Pushgateway를 활용합니다.

Q5. GitOps와 전통적 CI/CD의 차이점은 무엇인가요?

정답: GitOps는 Git을 단일 진실 공급원으로 사용하고 선언적 상태를 지속적으로 동기화하는 반면, 전통적 CI/CD는 파이프라인이 직접 명령형으로 배포를 실행합니다.

설명: 전통적 CI/CD는 파이프라인(Jenkins, GitHub Actions)이 직접 kubectl apply나 helm upgrade를 실행합니다. GitOps(ArgoCD, Flux)는 Git의 선언적 상태와 클러스터 실제 상태를 지속적으로 비교하고 자동 동기화합니다. GitOps는 드리프트 감지, 자동 롤백, 완전한 감사 추적이 가능하며 클러스터 접근 권한을 CI/CD 시스템에 부여할 필요가 없어 보안이 강화됩니다.

DevOps/SRE Complete Guide: From CI/CD to Kubernetes and MLOps

Introduction
1. DevOps Fundamentals: CI/CD Pipelines
2. GitOps and Infrastructure as Code
- GitOps Principles
- Infrastructure as Code with Terraform
3. Kubernetes Mastery
4. Monitoring and Observability
5. SRE Principles
6. AI/ML Workflow Automation
7. Security: RBAC and Secrets Management
8. Incident Management
- Incident Response Process
- Writing Effective Post-Mortems
Conclusion
Quiz

Introduction

In modern software engineering, DevOps and SRE (Site Reliability Engineering) are no longer optional — they are foundational. Netflix deploys thousands of times per day. Google guarantees 99.99% availability to billions of users. Behind these feats lies rigorously automated pipelines and a data-driven operational philosophy.

This guide takes you through DevOps/SRE fundamentals, Kubernetes operations, and AI/ML workflow automation with practical, production-ready code.

1. DevOps Fundamentals: CI/CD Pipelines

What is CI/CD?

CI (Continuous Integration) is the practice of frequently merging developer code changes with automated build and test validation. CD (Continuous Delivery/Deployment) automatically delivers verified code to production.

Stage	Purpose	Automation Scope
CI	Code integration validation	Build, test, lint
CD (Delivery)	Release preparation	Automated to staging
CD (Deployment)	Automatic release	Automated to production

CI/CD with GitHub Actions

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    name: Test & Lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov flake8

      - name: Lint with flake8
        run: flake8 src/ --max-line-length=88

      - name: Run tests
        run: pytest tests/ --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: coverage.xml

  build:
    name: Build & Push Docker Image
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-
            type=ref,event=branch
            type=semver,pattern={{version}}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    name: Deploy to Kubernetes
    runs-on: ubuntu-latest
    needs: build
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBECONFIG }}

      - name: Deploy with Helm
        run: |
          helm upgrade --install my-app ./helm/my-app \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --wait --timeout=5m

Deployment Strategy Comparison

Blue-Green Deployment: Maintain two identical production environments (Blue/Green). Deploy the new version to Green, then switch all traffic at once. Rollback is instantaneous but requires double the resources.

Canary Deployment: Route a fraction of traffic (e.g., 5%) to the new version and gradually increase. Validates with real users while minimizing risk.

# Argo Rollouts Canary Strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 30
        - pause: { duration: 10m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 100
      canaryService: my-app-canary
      stableService: my-app-stable

2. GitOps and Infrastructure as Code

GitOps Principles

GitOps uses Git as the Single Source of Truth for all operational state.

Declarative: System state is described as code
Versioned: All changes tracked in Git history
Automated: Git changes trigger automatic reconciliation
Auditable: PR-based workflow records who changed what and why

The leading tools are ArgoCD and Flux.

Infrastructure as Code with Terraform

# main.tf - Provision AWS EKS Cluster
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/eks/terraform.tfstate"
    region = "ap-northeast-2"
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "prod-cluster"
  cluster_version = "1.29"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["m5.xlarge"]
      min_size       = 2
      max_size       = 10
      desired_size   = 3
    }
    gpu = {
      instance_types = ["g4dn.xlarge"]
      min_size       = 0
      max_size       = 5
      desired_size   = 1
      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

3. Kubernetes Mastery

Core Resources Overview

Resource	Role
Pod	Execution unit; one or more containers
Deployment	Manages Pod replicas and rolling updates
Service	Stable network endpoint for a set of Pods
HPA	Horizontal auto-scaling based on CPU/memory
ConfigMap	Externalized configuration and environment variables
Secret	Sensitive data (passwords, tokens)
Ingress	External HTTP/S traffic routing

Production Deployment + HPA Manifest

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-api
  namespace: production
  labels:
    app: ml-inference-api
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-inference-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ml-inference-api
        version: v1
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
        prometheus.io/path: '/metrics'
    spec:
      containers:
        - name: api
          image: ghcr.io/myorg/ml-inference-api:sha-abc123
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_NAME
              valueFrom:
                configMapKeyRef:
                  name: ml-config
                  key: model_name
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: password
          resources:
            requests:
              cpu: '500m'
              memory: '512Mi'
            limits:
              cpu: '2000m'
              memory: '2Gi'
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '100'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Helm for Package Management

Helm is the package manager for Kubernetes. It templates complex application deployments for reuse and versioning.

# helm/my-app/values.yaml
replicaCount: 3

image:
  repository: ghcr.io/myorg/my-app
  pullPolicy: IfNotPresent
  tag: 'latest'

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: api.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: api-tls
      hosts:
        - api.example.com

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

postgresql:
  enabled: true
  auth:
    database: myapp
    existingSecret: db-credentials

redis:
  enabled: true
  architecture: replication

4. Monitoring and Observability

The Three Pillars of Observability

Pillar	Tools	Use Case
Metrics	Prometheus, Grafana	Numeric data, dashboards
Logs	Loki, Elasticsearch	Event records, debugging
Traces	Jaeger, Tempo	Distributed request tracking

Prometheus Alerting Rules

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-service-alerts
  namespace: monitoring
spec:
  groups:
    - name: ml-service.rules
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m])) > 0.05
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: 'High error rate detected'
            description: 'Error rate is {{ $value | humanizePercentage }} over the last 5 minutes'

        - alert: SlowResponseTime
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'Slow p99 latency'
            description: 'p99 latency is {{ $value }}s for {{ $labels.service }}'

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total[15m]) > 3
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: 'Pod is crash looping'
            description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'

        - alert: HighMemoryUsage
          expr: |
            container_memory_usage_bytes
            /
            container_spec_memory_limit_bytes > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'High memory usage detected'

OpenTelemetry Instrumentation in Python

# instrumentation.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

def setup_telemetry(service_name: str, otlp_endpoint: str):
    """Configure OpenTelemetry for the service."""
    # Tracer setup
    tracer_provider = TracerProvider()
    otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
    tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    trace.set_tracer_provider(tracer_provider)

    # Metrics setup
    meter_provider = MeterProvider()
    metrics.set_meter_provider(meter_provider)

    return trace.get_tracer(service_name)

# Apply to FastAPI app
from fastapi import FastAPI

app = FastAPI()
tracer = setup_telemetry("ml-inference-api", "http://otel-collector:4317")

# Auto-instrumentation
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()

@app.post("/predict")
async def predict(payload: dict):
    with tracer.start_as_current_span("model-inference") as span:
        span.set_attribute("model.name", "bert-base")
        span.set_attribute("input.length", len(str(payload)))

        result = run_inference(payload)

        span.set_attribute("prediction.confidence", result["confidence"])
        return result

5. SRE Principles

SLI / SLO / SLA Hierarchy

SLI (Service Level Indicator): The actual measured metric (e.g., request success rate, latency)
SLO (Service Level Objective): The target value for an SLI (e.g., 99.9% availability)
SLA (Service Level Agreement): The contractual commitment to customers (typically looser than the SLO)

Error Budget Calculation

With a 99.9% SLO, the allowable downtime per month (30 days) is:

Error Budget = 100% - SLO = 0.1%
Monthly downtime budget = 30d x 24h x 60m x 0.1% = ~43.2 minutes

When the error budget is exhausted, feature deployments halt and reliability work takes priority.

SLO Level	Monthly Downtime Budget
99%	7 hours 18 minutes
99.9%	43 minutes 48 seconds
99.99%	4 minutes 22 seconds
99.999%	26 seconds

Toil Reduction

Toil is manual, repetitive, automatable operational work. Google SRE recommends keeping toil below 50% of total work time.

Strategies for toil reduction:

Script repetitive tasks
Convert runbooks into automated code
Improve alert quality to reduce noise
Build self-healing systems

6. AI/ML Workflow Automation

Experiment Tracking with MLflow

# train.py
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="rf-baseline"):
    params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)

    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

    mlflow.sklearn.log_model(model, "model",
        registered_model_name="fraud-detector")

ML Pipeline with Argo Workflows

# ml-pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: ml-training-pipeline
spec:
  entrypoint: ml-pipeline
  templates:
    - name: ml-pipeline
      dag:
        tasks:
          - name: data-prep
            template: prepare-data
          - name: train
            template: train-model
            dependencies: [data-prep]
          - name: evaluate
            template: evaluate-model
            dependencies: [train]
          - name: deploy
            template: deploy-model
            dependencies: [evaluate]

    - name: prepare-data
      container:
        image: ghcr.io/myorg/data-prep:latest
        command: [python, prepare_data.py]
        resources:
          requests:
            memory: 4Gi
            cpu: '2'

    - name: train-model
      container:
        image: ghcr.io/myorg/ml-trainer:latest
        command: [python, train.py]
        resources:
          requests:
            memory: 16Gi
            cpu: '8'
            nvidia.com/gpu: '1'

    - name: evaluate-model
      container:
        image: ghcr.io/myorg/ml-evaluator:latest
        command: [python, evaluate.py]

    - name: deploy-model
      container:
        image: ghcr.io/myorg/model-deployer:latest
        command: [python, deploy.py]

Dockerfile for a Python ML Service

# Dockerfile
FROM python:3.11-slim AS builder

WORKDIR /app

# Separate dependency layer for better cache utilization
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim AS runtime

# Security: run as non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

# Copy packages from builder
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser models/ ./models/

USER appuser

ENV PATH=/home/appuser/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

7. Security: RBAC and Secrets Management

Kubernetes RBAC

# rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-service-account
  namespace: production
---
# Least-privilege Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-service-role
  namespace: production
rules:
  - apiGroups: ['']
    resources: ['pods', 'services']
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['']
    resources: ['secrets']
    resourceNames: ['ml-model-secrets']
    verbs: ['get']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-service-rolebinding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: ml-service-account
    namespace: production
roleRef:
  kind: Role
  apiGroup: rbac.authorization.k8s.io
  name: ml-service-role

Secrets Management with HashiCorp Vault

# vault_client.py
import hvac
import os

def get_secret(secret_path: str) -> dict:
    """Securely retrieve a secret from Vault."""
    client = hvac.Client(
        url=os.environ["VAULT_ADDR"],
        token=os.environ["VAULT_TOKEN"]
    )

    if not client.is_authenticated():
        raise RuntimeError("Vault authentication failed")

    secret = client.secrets.kv.v2.read_secret_version(
        path=secret_path,
        mount_point="secret"
    )
    return secret["data"]["data"]

# In Kubernetes, use the Vault Agent Injector for
# automatic secret injection via pod annotations

Network Policies

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-service-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: ml-inference-api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: database
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 4317 # OTLP gRPC

8. Incident Management

Incident Response Process

Detection: Prometheus alert or user report
Triage: Assess severity (P1/P2/P3)
Communication: Update status page, notify stakeholders
Mitigation: Traffic failover, rollback, scale out
Resolution: Fix root cause
Post-Mortem: Write a blameless post-mortem

Writing Effective Post-Mortems

A good post-mortem focuses on system improvement, not individual blame.

Key elements:

Detailed timeline of events
Distinguish root cause from triggering factor
5 Whys analysis
Concrete action items with owners and due dates

Conclusion

DevOps and SRE are not just toolsets — they are a culture and philosophy. Automation reduces human error, data-driven decisions improve reliability, and continuous improvement builds trust.

Core principles in summary:

Automate first: All repetitive work should be code
Measure everything: SLI/SLO make goals concrete
Fail fast and safely: Canary deployments minimize blast radius
Blameless culture: Improve systems, not blame people

Quiz

Q1. What is the difference between Blue-Green and Canary deployments in CI/CD?

Answer: Blue-Green maintains two identical production environments and switches all traffic at once, while Canary gradually routes a fraction of traffic to the new version.

Explanation: Blue-Green provides instant rollback (just flip the traffic switch) with zero downtime, but requires double the resources. Canary validates the new version with real user traffic while minimizing risk, but requires sophisticated monitoring and traffic management. Tools like Argo Rollouts support both strategies natively.

Q2. What metrics does Kubernetes HPA use to trigger scaling?

Answer: By default, CPU and memory utilization. With the Custom Metrics API, HPA can also scale on application-level metrics like RPS (requests per second) or queue depth.

Explanation: HPA v2 supports four metric types: resource, pods, object, and external. Setting a 70% CPU target causes HPA to add pods when average CPU across existing pods exceeds that threshold. Installing the Prometheus Adapter lets you expose Prometheus metrics to HPA for more meaningful scaling decisions.

Q3. What is the role of Error Budget in SRE?

Answer: The Error Budget represents the allowable failure margin derived from the SLO. It serves as the mechanism balancing reliability investment against feature delivery velocity.

Explanation: If the SLO is 99.9%, the error budget is 0.1% (roughly 43 minutes of downtime per month). When the budget is ample, teams can deploy freely and experiment. When it is exhausted, new deployments freeze until reliability work replenishes it. This creates a data-driven negotiation between development speed and stability that replaces subjective arguments.

Q4. What are the advantages of Prometheus's pull-based metric collection?

Answer: Pull-based collection lets Prometheus centrally manage scrape targets, immediately detects when a target goes down, and requires no inbound firewall rules from the monitored services.

Explanation: Push-based systems (like StatsD or InfluxDB line protocol) require each service to know the address of the metrics server and risk data loss on network issues. Pull-based Prometheus manages targets via configuration files or service discovery, making it easy to add and remove services dynamically. For short-lived batch jobs that exit before Prometheus can scrape them, the Pushgateway provides a bridge.

Q5. What is the difference between GitOps and traditional CI/CD?

Answer: GitOps uses Git as the single source of truth and continuously reconciles declared state, while traditional CI/CD pipelines imperatively execute deployment commands directly.

Explanation: Traditional CI/CD (Jenkins, GitHub Actions) runs kubectl apply or helm upgrade directly from the pipeline. GitOps tools (ArgoCD, Flux) continuously compare the Git-declared desired state against the live cluster state and auto-sync any drift. GitOps enables drift detection, automatic rollback, and full audit trails without granting the CI/CD system direct cluster credentials — a significant security improvement.