DevOps/SRE 완전 정복: CI/CD부터 Kubernetes, MLOps까지

들어가며
1. DevOps 기초: CI/CD 파이프라인
2. GitOps와 Infrastructure as Code
- GitOps 원칙
- Terraform으로 IaC 구현
3. Kubernetes 완전 정복
4. 모니터링 & 관측성
5. SRE 원칙
6. AI/ML 워크플로우 자동화
7. 보안: RBAC와 Secrets Management
8. 인시던트 관리
- 인시던트 대응 프로세스
- 효과적인 Post-Mortem 작성
마무리
퀴즈

들어가며

현대 소프트웨어 개발에서 DevOps와 **SRE(Site Reliability Engineering)**는 선택이 아닌 필수입니다. Netflix는 하루 수천 번 배포하고, Google은 수십억 사용자에게 99.99% 가용성을 보장합니다. 그 뒤에는 철저히 자동화된 파이프라인과 데이터 기반 운영 철학이 있습니다.

이 가이드는 DevOps/SRE의 핵심 개념부터 Kubernetes 실전 운영, AI/ML 워크플로우 자동화까지 실전 코드와 함께 완전히 정복합니다.

1. DevOps 기초: CI/CD 파이프라인

CI/CD란 무엇인가

CI(Continuous Integration) 는 개발자가 코드를 자주 통합하고 자동으로 빌드/테스트하는 관행입니다. CD(Continuous Delivery/Deployment) 는 검증된 코드를 자동으로 프로덕션에 배포합니다.

구분	목적	자동화 범위
CI	코드 통합 검증	빌드, 테스트, 린트
CD (Delivery)	배포 준비	스테이징까지 자동
CD (Deployment)	자동 배포	프로덕션까지 자동

GitHub Actions로 CI/CD 구축

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    name: Test & Lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov flake8

      - name: Lint with flake8
        run: flake8 src/ --max-line-length=88

      - name: Run tests
        run: pytest tests/ --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: coverage.xml

  build:
    name: Build & Push Docker Image
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-
            type=ref,event=branch
            type=semver,pattern={{version}}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    name: Deploy to Kubernetes
    runs-on: ubuntu-latest
    needs: build
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBECONFIG }}

      - name: Deploy with Helm
        run: |
          helm upgrade --install my-app ./helm/my-app \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --wait --timeout=5m

배포 전략 비교

Blue-Green 배포: 동일한 프로덕션 환경을 두 개(Blue/Green) 유지합니다. 새 버전을 Green에 배포 후 트래픽을 한 번에 전환합니다. 롤백이 즉각적이지만 리소스가 두 배 필요합니다.

Canary 배포: 트래픽의 일부(예: 5%)만 새 버전으로 라우팅하여 점진적으로 확대합니다. 실제 사용자로 검증하면서 위험을 최소화합니다.

# Argo Rollouts Canary 배포 예시
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 30
        - pause: { duration: 10m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 100
      canaryService: my-app-canary
      stableService: my-app-stable

2. GitOps와 Infrastructure as Code

GitOps 원칙

GitOps는 Git을 단일 진실 공급원(Single Source of Truth) 으로 사용하는 운영 모델입니다.

선언적(Declarative): 시스템 상태를 코드로 선언
버전 관리: 모든 변경사항이 Git 히스토리로 추적
자동화: Git 변경 → 자동 동기화
감사 가능: PR 기반 변경으로 누가 무엇을 왜 변경했는지 기록

대표 도구로는 ArgoCD와 Flux가 있습니다.

Terraform으로 IaC 구현

# main.tf - AWS EKS 클러스터 프로비저닝
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/eks/terraform.tfstate"
    region = "ap-northeast-2"
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "prod-cluster"
  cluster_version = "1.29"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["m5.xlarge"]
      min_size       = 2
      max_size       = 10
      desired_size   = 3
    }
    gpu = {
      instance_types = ["g4dn.xlarge"]
      min_size       = 0
      max_size       = 5
      desired_size   = 1
      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

3. Kubernetes 완전 정복

핵심 리소스 이해

Kubernetes의 핵심 오브젝트들을 정리합니다.

리소스	역할
Pod	실행 단위, 1개 이상의 컨테이너 묶음
Deployment	Pod 복제본 관리, 롤링 업데이트
Service	Pod 집합에 대한 네트워크 엔드포인트
HPA	CPU/메모리 기반 자동 수평 확장
ConfigMap	환경변수/설정 파일 분리
Secret	민감한 정보(비밀번호, 토큰) 관리
Ingress	외부 HTTP 트래픽 라우팅

실전 Deployment + HPA 매니페스트

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-api
  namespace: production
  labels:
    app: ml-inference-api
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-inference-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ml-inference-api
        version: v1
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
        prometheus.io/path: '/metrics'
    spec:
      containers:
        - name: api
          image: ghcr.io/myorg/ml-inference-api:sha-abc123
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_NAME
              valueFrom:
                configMapKeyRef:
                  name: ml-config
                  key: model_name
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: password
          resources:
            requests:
              cpu: '500m'
              memory: '512Mi'
            limits:
              cpu: '2000m'
              memory: '2Gi'
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '100'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Helm Chart로 패키지 관리

Helm은 Kubernetes의 패키지 매니저입니다. 복잡한 애플리케이션 배포를 템플릿으로 관리합니다.

# helm/my-app/values.yaml
replicaCount: 3

image:
  repository: ghcr.io/myorg/my-app
  pullPolicy: IfNotPresent
  tag: 'latest'

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: api.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: api-tls
      hosts:
        - api.example.com

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

postgresql:
  enabled: true
  auth:
    database: myapp
    existingSecret: db-credentials

redis:
  enabled: true
  architecture: replication

4. 모니터링 & 관측성

관측성의 3가지 기둥

기둥	도구	용도
메트릭(Metrics)	Prometheus, Grafana	수치 데이터, 대시보드
로그(Logs)	Loki, Elasticsearch	이벤트 기록, 디버깅
트레이스(Traces)	Jaeger, Tempo	분산 요청 추적

Prometheus 알림 규칙

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-service-alerts
  namespace: monitoring
spec:
  groups:
    - name: ml-service.rules
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m])) > 0.05
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: 'High error rate detected'
            description: 'Error rate is {{ $value | humanizePercentage }} for the last 5 minutes'

        - alert: SlowResponseTime
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'Slow p99 latency'
            description: 'p99 latency is {{ $value }}s for {{ $labels.service }}'

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total[15m]) > 3
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: 'Pod crash looping'
            description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'

        - alert: HighMemoryUsage
          expr: |
            container_memory_usage_bytes
            /
            container_spec_memory_limit_bytes > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'High memory usage'

OpenTelemetry 계측

# instrumentation.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

def setup_telemetry(service_name: str, otlp_endpoint: str):
    """OpenTelemetry 설정"""
    # 트레이서 설정
    tracer_provider = TracerProvider()
    otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
    tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    trace.set_tracer_provider(tracer_provider)

    # 메트릭 설정
    meter_provider = MeterProvider()
    metrics.set_meter_provider(meter_provider)

    return trace.get_tracer(service_name)

# FastAPI 앱에 적용
from fastapi import FastAPI

app = FastAPI()
tracer = setup_telemetry("ml-inference-api", "http://otel-collector:4317")

# 자동 계측
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()

@app.post("/predict")
async def predict(payload: dict):
    with tracer.start_as_current_span("model-inference") as span:
        span.set_attribute("model.name", "bert-base")
        span.set_attribute("input.length", len(str(payload)))

        # 모델 추론 로직
        result = run_inference(payload)

        span.set_attribute("prediction.confidence", result["confidence"])
        return result

5. SRE 원칙

SLI / SLO / SLA 계층 구조

SLI(Service Level Indicator): 실제 측정되는 서비스 성능 지표 (예: 요청 성공률, 레이턴시)
SLO(Service Level Objective): SLI에 대한 목표값 (예: 99.9% 가용성)
SLA(Service Level Agreement): 고객과 합의한 계약 수준 (SLO보다 느슨하게 설정)

Error Budget 계산

SLO가 99.9%라면, 한 달(30일) 기준 허용되는 다운타임은 다음과 같습니다.

Error Budget = 100% - SLO = 0.1%
월별 허용 다운타임 = 30일 × 24시간 × 60분 × 0.1% ≈ 43.2분

Error Budget이 소진되면 새 기능 배포를 중단하고 안정성 작업에 집중합니다. 이것이 SRE의 핵심 메커니즘입니다.

SLO 수준	월별 허용 다운타임
99%	7시간 18분
99.9%	43분 48초
99.99%	4분 22초
99.999%	26초

Toil 감소 전략

Toil은 수동적, 반복적, 자동화 가능한 운영 작업입니다. Google SRE는 Toil을 전체 업무 시간의 50% 미만으로 유지하도록 권고합니다.

Toil 감소 방법:

반복 작업 스크립트/자동화
런북(Runbook)을 자동화 코드로 전환
알림 품질 개선으로 노이즈 감소
자가 치유(Self-Healing) 시스템 구축

6. AI/ML 워크플로우 자동화

MLflow로 실험 추적

# train.py
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="rf-baseline"):
    # 하이퍼파라미터 로깅
    params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)

    # 모델 학습
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # 메트릭 로깅
    y_pred = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

    # 모델 저장
    mlflow.sklearn.log_model(model, "model",
        registered_model_name="fraud-detector")

Argo Workflows로 ML 파이프라인

# ml-pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: ml-training-pipeline
spec:
  entrypoint: ml-pipeline
  templates:
    - name: ml-pipeline
      dag:
        tasks:
          - name: data-prep
            template: prepare-data
          - name: train
            template: train-model
            dependencies: [data-prep]
          - name: evaluate
            template: evaluate-model
            dependencies: [train]
          - name: deploy
            template: deploy-model
            dependencies: [evaluate]

    - name: prepare-data
      container:
        image: ghcr.io/myorg/data-prep:latest
        command: [python, prepare_data.py]
        resources:
          requests:
            memory: 4Gi
            cpu: '2'

    - name: train-model
      container:
        image: ghcr.io/myorg/ml-trainer:latest
        command: [python, train.py]
        resources:
          requests:
            memory: 16Gi
            cpu: '8'
            nvidia.com/gpu: '1'

    - name: evaluate-model
      container:
        image: ghcr.io/myorg/ml-evaluator:latest
        command: [python, evaluate.py]

    - name: deploy-model
      container:
        image: ghcr.io/myorg/model-deployer:latest
        command: [python, deploy.py]

Python ML 서비스 Dockerfile

# Dockerfile
FROM python:3.11-slim AS builder

WORKDIR /app

# 의존성 설치 레이어 분리 (캐시 활용)
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim AS runtime

# 보안: 루트 권한 없이 실행
RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

# 빌더에서 패키지 복사
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser models/ ./models/

USER appuser

ENV PATH=/home/appuser/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

7. 보안: RBAC와 Secrets Management

Kubernetes RBAC

# rbac.yaml
# ServiceAccount 생성
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-service-account
  namespace: production
---
# 최소 권한 Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-service-role
  namespace: production
rules:
  - apiGroups: ['']
    resources: ['pods', 'services']
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['']
    resources: ['secrets']
    resourceNames: ['ml-model-secrets']
    verbs: ['get']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-service-rolebinding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: ml-service-account
    namespace: production
roleRef:
  kind: Role
  apiGroup: rbac.authorization.k8s.io
  name: ml-service-role

HashiCorp Vault로 Secret 관리

# vault_client.py
import hvac
import os

def get_secret(secret_path: str) -> dict:
    """Vault에서 시크릿을 안전하게 조회합니다."""
    client = hvac.Client(
        url=os.environ["VAULT_ADDR"],
        token=os.environ["VAULT_TOKEN"]
    )

    if not client.is_authenticated():
        raise RuntimeError("Vault 인증 실패")

    secret = client.secrets.kv.v2.read_secret_version(
        path=secret_path,
        mount_point="secret"
    )
    return secret["data"]["data"]

# Kubernetes에서는 Vault Agent Injector 사용
# Pod 어노테이션으로 자동 시크릿 주입

Network Policy로 트래픽 제어

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-service-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: ml-inference-api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: database
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 4317 # OTLP gRPC

8. 인시던트 관리

인시던트 대응 프로세스

감지(Detection): Prometheus 알림 또는 사용자 보고
분류(Triage): 심각도 판단 (P1/P2/P3)
소통(Communication): 상태 페이지 업데이트, 이해관계자 알림
완화(Mitigation): 트래픽 전환, 롤백, 스케일 아웃
해결(Resolution): 근본 원인 해결
사후 검토(Post-Mortem): Blameless post-mortem 작성

효과적인 Post-Mortem 작성

좋은 Post-Mortem은 개인 책임이 아닌 시스템 개선에 집중합니다.

타임라인 상세 기록
근본 원인(Root Cause) vs 유발 요인(Trigger) 구분
5 Whys 분석
구체적인 액션 아이템 (담당자 + 기한)

마무리

DevOps/SRE는 단순한 도구 모음이 아니라 문화와 철학입니다. 자동화로 인간 실수를 줄이고, 데이터로 의사결정을 하며, 지속적인 개선을 통해 시스템 신뢰성을 높입니다.

핵심 원칙을 요약하면:

자동화 우선: 모든 반복 작업은 코드로
측정 가능성: SLI/SLO로 목표를 명확히
빠른 실패: 카나리 배포로 위험 최소화
Blameless 문화: 시스템을 개선하지, 사람을 비난하지 않음

퀴즈

Q1. CI/CD 파이프라인에서 Blue-Green 배포와 Canary 배포의 차이는 무엇인가요?

정답: Blue-Green은 두 개의 동일한 프로덕션 환경을 유지하고 트래픽을 한 번에 전환하는 방식이며, Canary는 트래픽의 일부만 새 버전으로 점진적으로 전환하는 방식입니다.

설명: Blue-Green 배포는 롤백이 즉각적(트래픽 전환만)이고 다운타임이 없지만, 리소스가 두 배 필요합니다. Canary 배포는 실제 사용자 트래픽으로 새 버전을 검증하면서 위험을 최소화할 수 있으나, 모니터링이 복잡합니다. Argo Rollouts 같은 도구가 두 방식 모두 지원합니다.

Q2. Kubernetes HPA가 스케일링하는 기준 메트릭은 무엇인가요?

정답: 기본적으로 CPU 사용률과 메모리 사용률이며, Custom Metrics API를 통해 RPS(초당 요청 수), 큐 깊이 등 커스텀 메트릭도 사용할 수 있습니다.

설명: HPA(HorizontalPodAutoscaler)는 v2 API에서 resource, pods, object, external 4가지 메트릭 타입을 지원합니다. CPU 70% 목표를 설정하면 현재 평균 CPU가 이를 초과할 때 파드 수를 증가시킵니다. Prometheus Adapter를 설치하면 Prometheus 메트릭을 HPA에 연동할 수 있습니다.

Q3. SRE에서 Error Budget의 역할은 무엇인가요?

정답: Error Budget은 SLO에서 정의한 허용 실패량으로, 안정성과 기능 개발 속도 사이의 균형을 맞추는 메커니즘입니다.

설명: SLO가 99.9%라면 Error Budget은 0.1%입니다. 이 예산이 충분하면 새 기능 배포, 실험적 변경이 가능합니다. 예산이 소진되면 새 배포를 동결하고 안정성 개선에 집중합니다. 이를 통해 개발팀과 운영팀이 데이터 기반으로 배포 속도를 협의할 수 있습니다.

Q4. Prometheus의 pull 방식 메트릭 수집 장점은 무엇인가요?

정답: Pull 방식은 Prometheus가 스크레이핑 대상을 중앙에서 관리하므로 설정이 단순하고, 대상 서비스가 다운되었을 때 즉시 감지할 수 있으며, 보안 측면에서 방화벽 인바운드 규칙이 필요 없습니다.

설명: Push 방식(StatsD, InfluxDB 등)은 각 서비스가 메트릭 서버 주소를 알아야 하고 네트워크 문제 시 데이터가 유실될 수 있습니다. Pull 방식은 Prometheus가 설정 파일이나 서비스 디스커버리로 대상을 관리하므로 서비스 추가/제거가 유연합니다. 단, 매우 짧은 수명의 배치 작업에는 Pushgateway를 활용합니다.

Q5. GitOps와 전통적 CI/CD의 차이점은 무엇인가요?

정답: GitOps는 Git을 단일 진실 공급원으로 사용하고 선언적 상태를 지속적으로 동기화하는 반면, 전통적 CI/CD는 파이프라인이 직접 명령형으로 배포를 실행합니다.

설명: 전통적 CI/CD는 파이프라인(Jenkins, GitHub Actions)이 직접 kubectl apply나 helm upgrade를 실행합니다. GitOps(ArgoCD, Flux)는 Git의 선언적 상태와 클러스터 실제 상태를 지속적으로 비교하고 자동 동기화합니다. GitOps는 드리프트 감지, 자동 롤백, 완전한 감사 추적이 가능하며 클러스터 접근 권한을 CI/CD 시스템에 부여할 필요가 없어 보안이 강화됩니다.