💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

들어가며

현대 소프트웨어 개발에서 **DevOps**와 **SRE(Site Reliability Engineering)**는 선택이 아닌 필수입니다. Netflix는 하루 수천 번 배포하고, Google은 수십억 사용자에게 99.99% 가용성을 보장합니다. 그 뒤에는 철저히 자동화된 파이프라인과 데이터 기반 운영 철학이 있습니다.

이 가이드는 DevOps/SRE의 핵심 개념부터 Kubernetes 실전 운영, AI/ML 워크플로우 자동화까지 **실전 코드와 함께** 완전히 정복합니다.

1. DevOps 기초: CI/CD 파이프라인

CI/CD란 무엇인가

**CI(Continuous Integration)** 는 개발자가 코드를 자주 통합하고 자동으로 빌드/테스트하는 관행입니다. **CD(Continuous Delivery/Deployment)** 는 검증된 코드를 자동으로 프로덕션에 배포합니다.

| 구분 | 목적 | 자동화 범위 |

| --------------- | -------------- | ------------------ |

| CI | 코드 통합 검증 | 빌드, 테스트, 린트 |

| CD (Delivery) | 배포 준비 | 스테이징까지 자동 |

| CD (Deployment) | 자동 배포 | 프로덕션까지 자동 |

GitHub Actions로 CI/CD 구축

.github/workflows/ci-cd.yml

on:

push:

branches: [main, develop]

pull_request:

branches: [main]

env:

REGISTRY: ghcr.io

IMAGE_NAME: ${{ github.repository }}

jobs:

test:

runs-on: ubuntu-latest

steps:

- uses: actions/checkout@v4

- name: Set up Python

uses: actions/setup-python@v5

with:

python-version: '3.11'

cache: 'pip'

- name: Install dependencies

run: |

pip install -r requirements.txt

pip install pytest pytest-cov flake8

- name: Lint with flake8

run: flake8 src/ --max-line-length=88

- name: Run tests

run: pytest tests/ --cov=src --cov-report=xml

- name: Upload coverage

uses: codecov/codecov-action@v4

with:

file: coverage.xml

build:

runs-on: ubuntu-latest

needs: test

if: github.ref == 'refs/heads/main'

permissions:

contents: read

packages: write

steps:

- uses: actions/checkout@v4

- name: Log in to Container Registry

uses: docker/login-action@v3

with:

registry: ${{ env.REGISTRY }}

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

- name: Extract metadata

id: meta

uses: docker/metadata-action@v5

with:

images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

tags: |

type=sha,prefix=sha-

type=ref,event=branch

type=semver,pattern={{version}}

- name: Build and push

uses: docker/build-push-action@v5

with:

context: .

push: true

tags: ${{ steps.meta.outputs.tags }}

cache-from: type=gha

cache-to: type=gha,mode=max

deploy:

runs-on: ubuntu-latest

needs: build

environment: production

steps:

- uses: actions/checkout@v4

- name: Configure kubectl

uses: azure/k8s-set-context@v3

with:

kubeconfig: ${{ secrets.KUBECONFIG }}

- name: Deploy with Helm

run: |

helm upgrade --install my-app ./helm/my-app \

--namespace production \

--set image.tag=${{ github.sha }} \

--wait --timeout=5m

배포 전략 비교

**Blue-Green 배포**: 동일한 프로덕션 환경을 두 개(Blue/Green) 유지합니다. 새 버전을 Green에 배포 후 트래픽을 한 번에 전환합니다. 롤백이 즉각적이지만 리소스가 두 배 필요합니다.

**Canary 배포**: 트래픽의 일부(예: 5%)만 새 버전으로 라우팅하여 점진적으로 확대합니다. 실제 사용자로 검증하면서 위험을 최소화합니다.

Argo Rollouts Canary 배포 예시

apiVersion: argoproj.io/v1alpha1

kind: Rollout

metadata:

spec:

replicas: 10

strategy:

canary:

steps:

- setWeight: 10

- pause: { duration: 5m }

- setWeight: 30

- pause: { duration: 10m }

- setWeight: 60

- pause: { duration: 10m }

- setWeight: 100

canaryService: my-app-canary

stableService: my-app-stable

2. GitOps와 Infrastructure as Code

GitOps 원칙

GitOps는 Git을 **단일 진실 공급원(Single Source of Truth)** 으로 사용하는 운영 모델입니다.

- **선언적(Declarative)**: 시스템 상태를 코드로 선언

- **버전 관리**: 모든 변경사항이 Git 히스토리로 추적

- **자동화**: Git 변경 → 자동 동기화

- **감사 가능**: PR 기반 변경으로 누가 무엇을 왜 변경했는지 기록

대표 도구로는 **ArgoCD**와 **Flux**가 있습니다.

Terraform으로 IaC 구현

main.tf - AWS EKS 클러스터 프로비저닝

terraform {

required_providers {

aws = {

source = "hashicorp/aws"

version = "~> 5.0"

}

backend "s3" {

bucket = "my-terraform-state"

key = "prod/eks/terraform.tfstate"

region = "ap-northeast-2"

}

module "eks" {

source = "terraform-aws-modules/eks/aws"

version = "~> 20.0"

cluster_name = "prod-cluster"

cluster_version = "1.29"

vpc_id = module.vpc.vpc_id

subnet_ids = module.vpc.private_subnets

eks_managed_node_groups = {

general = {

instance_types = ["m5.xlarge"]

min_size = 2

max_size = 10

desired_size = 3

}

gpu = {

instance_types = ["g4dn.xlarge"]

min_size = 0

max_size = 5

desired_size = 1

taints = [{

key = "nvidia.com/gpu"

value = "true"

effect = "NO_SCHEDULE"

}]

}

3. Kubernetes 완전 정복

핵심 리소스 이해

Kubernetes의 핵심 오브젝트들을 정리합니다.

| 리소스 | 역할 |

| ---------- | ----------------------------------- |

| Pod | 실행 단위, 1개 이상의 컨테이너 묶음 |

| Deployment | Pod 복제본 관리, 롤링 업데이트 |

| Service | Pod 집합에 대한 네트워크 엔드포인트 |

| HPA | CPU/메모리 기반 자동 수평 확장 |

| ConfigMap | 환경변수/설정 파일 분리 |

| Secret | 민감한 정보(비밀번호, 토큰) 관리 |

| Ingress | 외부 HTTP 트래픽 라우팅 |

실전 Deployment + HPA 매니페스트

deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: production

labels:

app: ml-inference-api

version: v1

spec:

replicas: 3

selector:

matchLabels:

app: ml-inference-api

strategy:

type: RollingUpdate

rollingUpdate:

maxSurge: 1

maxUnavailable: 0

template:

metadata:

labels:

app: ml-inference-api

version: v1

annotations:

prometheus.io/scrape: 'true'

prometheus.io/port: '8080'

prometheus.io/path: '/metrics'

spec:

containers:

- name: api

image: ghcr.io/myorg/ml-inference-api:sha-abc123

ports:

- containerPort: 8080

env:

- name: MODEL_NAME

valueFrom:

configMapKeyRef:

key: model_name

- name: DB_PASSWORD

valueFrom:

secretKeyRef:

key: password

resources:

requests:

cpu: '500m'

memory: '512Mi'

limits:

cpu: '2000m'

memory: '2Gi'

readinessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 10

periodSeconds: 5

livenessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 30

periodSeconds: 10

hpa.yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

namespace: production

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 3

maxReplicas: 20

metrics:

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 80

- type: Pods

pods:

metric:

target:

type: AverageValue

averageValue: '100'

behavior:

scaleUp:

stabilizationWindowSeconds: 60

policies:

- type: Pods

value: 4

periodSeconds: 60

scaleDown:

stabilizationWindowSeconds: 300

Helm Chart로 패키지 관리

Helm은 Kubernetes의 패키지 매니저입니다. 복잡한 애플리케이션 배포를 템플릿으로 관리합니다.

helm/my-app/values.yaml

replicaCount: 3

image:

repository: ghcr.io/myorg/my-app

pullPolicy: IfNotPresent

tag: 'latest'

service:

type: ClusterIP

port: 80

targetPort: 8080

ingress:

enabled: true

className: nginx

annotations:

cert-manager.io/cluster-issuer: letsencrypt-prod

hosts:

- host: api.example.com

paths:

- path: /

pathType: Prefix

tls:

- secretName: api-tls

hosts:

- api.example.com

resources:

requests:

cpu: 500m

memory: 512Mi

limits:

cpu: 2000m

memory: 2Gi

autoscaling:

enabled: true

minReplicas: 3

maxReplicas: 20

targetCPUUtilizationPercentage: 70

postgresql:

enabled: true

auth:

database: myapp

existingSecret: db-credentials

redis:

enabled: true

architecture: replication

4. 모니터링 & 관측성

관측성의 3가지 기둥

| 기둥 | 도구 | 용도 |

| ---------------- | ------------------- | --------------------- |

| 메트릭(Metrics) | Prometheus, Grafana | 수치 데이터, 대시보드 |

| 로그(Logs) | Loki, Elasticsearch | 이벤트 기록, 디버깅 |

| 트레이스(Traces) | Jaeger, Tempo | 분산 요청 추적 |

Prometheus 알림 규칙

prometheus-rules.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

namespace: monitoring

spec:

groups:

- name: ml-service.rules

interval: 30s

rules:

- alert: HighErrorRate

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m]))

sum(rate(http_requests_total[5m])) > 0.05

for: 2m

labels:

severity: critical

annotations:

summary: 'High error rate detected'

description: 'Error rate is {{ $value | humanizePercentage }} for the last 5 minutes'

- alert: SlowResponseTime

expr: |

histogram_quantile(0.99,

sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)

) > 1.0

for: 5m

labels:

severity: warning

annotations:

summary: 'Slow p99 latency'

description: 'p99 latency is {{ $value }}s for {{ $labels.service }}'

- alert: PodCrashLooping

expr: |

increase(kube_pod_container_status_restarts_total[15m]) > 3

for: 0m

labels:

severity: critical

annotations:

summary: 'Pod crash looping'

description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'

- alert: HighMemoryUsage

expr: |

container_memory_usage_bytes

container_spec_memory_limit_bytes > 0.85

for: 5m

labels:

severity: warning

annotations:

summary: 'High memory usage'

OpenTelemetry 계측

instrumentation.py

from opentelemetry import trace, metrics

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

from opentelemetry.sdk.metrics import MeterProvider

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

from opentelemetry.instrumentation.requests import RequestsInstrumentor

def setup_telemetry(service_name: str, otlp_endpoint: str):

"""OpenTelemetry 설정"""

트레이서 설정

tracer_provider = TracerProvider()

otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint)

tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

trace.set_tracer_provider(tracer_provider)

메트릭 설정

meter_provider = MeterProvider()

metrics.set_meter_provider(meter_provider)

return trace.get_tracer(service_name)

FastAPI 앱에 적용

from fastapi import FastAPI

app = FastAPI()

tracer = setup_telemetry("ml-inference-api", "http://otel-collector:4317")

자동 계측

FastAPIInstrumentor.instrument_app(app)

RequestsInstrumentor().instrument()

@app.post("/predict")

async def predict(payload: dict):

with tracer.start_as_current_span("model-inference") as span:

span.set_attribute("model.name", "bert-base")

span.set_attribute("input.length", len(str(payload)))

모델 추론 로직

result = run_inference(payload)

span.set_attribute("prediction.confidence", result["confidence"])

return result

5. SRE 원칙

SLI / SLO / SLA 계층 구조

- **SLI(Service Level Indicator)**: 실제 측정되는 서비스 성능 지표 (예: 요청 성공률, 레이턴시)

- **SLO(Service Level Objective)**: SLI에 대한 목표값 (예: 99.9% 가용성)

- **SLA(Service Level Agreement)**: 고객과 합의한 계약 수준 (SLO보다 느슨하게 설정)

Error Budget 계산

SLO가 99.9%라면, 한 달(30일) 기준 허용되는 다운타임은 다음과 같습니다.

Error Budget = 100% - SLO = 0.1%

월별 허용 다운타임 = 30일 × 24시간 × 60분 × 0.1% ≈ 43.2분

Error Budget이 소진되면 새 기능 배포를 중단하고 안정성 작업에 집중합니다. 이것이 SRE의 핵심 메커니즘입니다.

| SLO 수준 | 월별 허용 다운타임 |

| -------- | ------------------ |

| 99% | 7시간 18분 |

| 99.9% | 43분 48초 |

| 99.99% | 4분 22초 |

| 99.999% | 26초 |

Toil 감소 전략

**Toil**은 수동적, 반복적, 자동화 가능한 운영 작업입니다. Google SRE는 Toil을 전체 업무 시간의 50% 미만으로 유지하도록 권고합니다.

Toil 감소 방법:

1. 반복 작업 스크립트/자동화

2. 런북(Runbook)을 자동화 코드로 전환

3. 알림 품질 개선으로 노이즈 감소

4. 자가 치유(Self-Healing) 시스템 구축

6. AI/ML 워크플로우 자동화

MLflow로 실험 추적

train.py

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://mlflow-server:5000")

mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="rf-baseline"):

하이퍼파라미터 로깅

params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}

mlflow.log_params(params)

모델 학습

model = RandomForestClassifier(**params)

model.fit(X_train, y_train)

메트릭 로깅

y_pred = model.predict(X_test)

mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))

mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

모델 저장

mlflow.sklearn.log_model(model, "model",

registered_model_name="fraud-detector")

Argo Workflows로 ML 파이프라인

ml-pipeline.yaml

apiVersion: argoproj.io/v1alpha1

kind: Workflow

metadata:

spec:

entrypoint: ml-pipeline

templates:

- name: ml-pipeline

dag:

tasks:

- name: data-prep

- name: train

dependencies: [data-prep]

- name: evaluate

dependencies: [train]

- name: deploy

dependencies: [evaluate]

- name: prepare-data

container:

image: ghcr.io/myorg/data-prep:latest

command: [python, prepare_data.py]

resources:

requests:

memory: 4Gi

cpu: '2'

- name: train-model

container:

image: ghcr.io/myorg/ml-trainer:latest

command: [python, train.py]

resources:

requests:

memory: 16Gi

cpu: '8'

nvidia.com/gpu: '1'

- name: evaluate-model

container:

image: ghcr.io/myorg/ml-evaluator:latest

command: [python, evaluate.py]

- name: deploy-model

container:

image: ghcr.io/myorg/model-deployer:latest

command: [python, deploy.py]

Python ML 서비스 Dockerfile

Dockerfile

FROM python:3.11-slim AS builder

WORKDIR /app

의존성 설치 레이어 분리 (캐시 활용)

COPY requirements.txt .

RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim AS runtime

보안: 루트 권한 없이 실행

RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

빌더에서 패키지 복사

COPY --from=builder /root/.local /home/appuser/.local

COPY --chown=appuser:appuser src/ ./src/

COPY --chown=appuser:appuser models/ ./models/

USER appuser

ENV PATH=/home/appuser/.local/bin:$PATH

ENV PYTHONUNBUFFERED=1

ENV PYTHONDONTWRITEBYTECODE=1

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \

CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

7. 보안: RBAC와 Secrets Management

Kubernetes RBAC

rbac.yaml

ServiceAccount 생성

apiVersion: v1

kind: ServiceAccount

metadata:

namespace: production

최소 권한 Role

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

namespace: production

rules:

- apiGroups: ['']

resources: ['pods', 'services']

verbs: ['get', 'list', 'watch']

- apiGroups: ['']

resources: ['secrets']

resourceNames: ['ml-model-secrets']

verbs: ['get']

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

namespace: production

subjects:

- kind: ServiceAccount

namespace: production

roleRef:

kind: Role

apiGroup: rbac.authorization.k8s.io

HashiCorp Vault로 Secret 관리

vault_client.py

def get_secret(secret_path: str) -> dict:

"""Vault에서 시크릿을 안전하게 조회합니다."""

client = hvac.Client(

url=os.environ["VAULT_ADDR"],

token=os.environ["VAULT_TOKEN"]

)

if not client.is_authenticated():

raise RuntimeError("Vault 인증 실패")

secret = client.secrets.kv.v2.read_secret_version(

path=secret_path,

mount_point="secret"

)

return secret["data"]["data"]

Kubernetes에서는 Vault Agent Injector 사용

Pod 어노테이션으로 자동 시크릿 주입

Network Policy로 트래픽 제어

network-policy.yaml

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

namespace: production

spec:

podSelector:

matchLabels:

app: ml-inference-api

policyTypes:

- Ingress

- Egress

ingress:

- from:

- namespaceSelector:

matchLabels:

ports:

- protocol: TCP

port: 8080

egress:

- to:

- namespaceSelector:

matchLabels:

ports:

- protocol: TCP

port: 5432

- to:

- namespaceSelector:

matchLabels:

ports:

- protocol: TCP

port: 4317 # OTLP gRPC

8. 인시던트 관리

인시던트 대응 프로세스

1. **감지(Detection)**: Prometheus 알림 또는 사용자 보고

2. **분류(Triage)**: 심각도 판단 (P1/P2/P3)

3. **소통(Communication)**: 상태 페이지 업데이트, 이해관계자 알림

4. **완화(Mitigation)**: 트래픽 전환, 롤백, 스케일 아웃

5. **해결(Resolution)**: 근본 원인 해결

6. **사후 검토(Post-Mortem)**: Blameless post-mortem 작성

효과적인 Post-Mortem 작성

좋은 Post-Mortem은 개인 책임이 아닌 **시스템 개선**에 집중합니다.

- 타임라인 상세 기록

- 근본 원인(Root Cause) vs 유발 요인(Trigger) 구분

- 5 Whys 분석

- 구체적인 액션 아이템 (담당자 + 기한)

마무리

DevOps/SRE는 단순한 도구 모음이 아니라 **문화와 철학**입니다. 자동화로 인간 실수를 줄이고, 데이터로 의사결정을 하며, 지속적인 개선을 통해 시스템 신뢰성을 높입니다.

핵심 원칙을 요약하면:

- **자동화 우선**: 모든 반복 작업은 코드로

- **측정 가능성**: SLI/SLO로 목표를 명확히

- **빠른 실패**: 카나리 배포로 위험 최소화

- **Blameless 문화**: 시스템을 개선하지, 사람을 비난하지 않음

퀴즈

**정답**: Blue-Green은 두 개의 동일한 프로덕션 환경을 유지하고 트래픽을 한 번에 전환하는 방식이며, Canary는 트래픽의 일부만 새 버전으로 점진적으로 전환하는 방식입니다.

**설명**: Blue-Green 배포는 롤백이 즉각적(트래픽 전환만)이고 다운타임이 없지만, 리소스가 두 배 필요합니다. Canary 배포는 실제 사용자 트래픽으로 새 버전을 검증하면서 위험을 최소화할 수 있으나, 모니터링이 복잡합니다. Argo Rollouts 같은 도구가 두 방식 모두 지원합니다.

**정답**: 기본적으로 CPU 사용률과 메모리 사용률이며, Custom Metrics API를 통해 RPS(초당 요청 수), 큐 깊이 등 커스텀 메트릭도 사용할 수 있습니다.

**설명**: HPA(HorizontalPodAutoscaler)는 v2 API에서 `resource`, `pods`, `object`, `external` 4가지 메트릭 타입을 지원합니다. CPU 70% 목표를 설정하면 현재 평균 CPU가 이를 초과할 때 파드 수를 증가시킵니다. Prometheus Adapter를 설치하면 Prometheus 메트릭을 HPA에 연동할 수 있습니다.

**정답**: Error Budget은 SLO에서 정의한 허용 실패량으로, 안정성과 기능 개발 속도 사이의 균형을 맞추는 메커니즘입니다.

**설명**: SLO가 99.9%라면 Error Budget은 0.1%입니다. 이 예산이 충분하면 새 기능 배포, 실험적 변경이 가능합니다. 예산이 소진되면 새 배포를 동결하고 안정성 개선에 집중합니다. 이를 통해 개발팀과 운영팀이 데이터 기반으로 배포 속도를 협의할 수 있습니다.

**정답**: Pull 방식은 Prometheus가 스크레이핑 대상을 중앙에서 관리하므로 설정이 단순하고, 대상 서비스가 다운되었을 때 즉시 감지할 수 있으며, 보안 측면에서 방화벽 인바운드 규칙이 필요 없습니다.

**설명**: Push 방식(StatsD, InfluxDB 등)은 각 서비스가 메트릭 서버 주소를 알아야 하고 네트워크 문제 시 데이터가 유실될 수 있습니다. Pull 방식은 Prometheus가 설정 파일이나 서비스 디스커버리로 대상을 관리하므로 서비스 추가/제거가 유연합니다. 단, 매우 짧은 수명의 배치 작업에는 Pushgateway를 활용합니다.

**정답**: GitOps는 Git을 단일 진실 공급원으로 사용하고 선언적 상태를 지속적으로 동기화하는 반면, 전통적 CI/CD는 파이프라인이 직접 명령형으로 배포를 실행합니다.

**설명**: 전통적 CI/CD는 파이프라인(Jenkins, GitHub Actions)이 직접 `kubectl apply`나 `helm upgrade`를 실행합니다. GitOps(ArgoCD, Flux)는 Git의 선언적 상태와 클러스터 실제 상태를 지속적으로 비교하고 자동 동기화합니다. GitOps는 드리프트 감지, 자동 롤백, 완전한 감사 추적이 가능하며 클러스터 접근 권한을 CI/CD 시스템에 부여할 필요가 없어 보안이 강화됩니다.