Skip to content

Split View: 관측성(Observability) 완전 가이드 2025: Prometheus, Grafana, OpenTelemetry로 시스템을 투명하게

✨ Learn with Quiz
|

관측성(Observability) 완전 가이드 2025: Prometheus, Grafana, OpenTelemetry로 시스템을 투명하게

들어가며: 관측성이 중요한 이유

"모니터링은 시스템이 잘 동작하는지 확인하는 것이고, 관측성은 시스템이 왜 잘못 동작하는지 이해하는 것이다."

현대의 분산 시스템에서는 단순히 CPU 사용률이나 메모리를 모니터링하는 것만으로는 부족합니다. 마이크로서비스 아키텍처, 컨테이너, 서버리스 환경에서는 하나의 요청이 수십 개의 서비스를 거치며, 문제의 원인을 파악하려면 시스템 내부를 들여다볼 수 있는 **관측성(Observability)**이 필요합니다.


1. 관측성의 3가지 축 (Three Pillars)

메트릭 (Metrics)

숫자로 표현되는 시계열 데이터입니다. 시스템 상태의 집계(aggregated) 뷰를 제공합니다.

  • Counter: 단조 증가 값 (예: 총 요청 수)
  • Gauge: 올라가고 내려가는 값 (예: 현재 메모리 사용량)
  • Histogram: 값의 분포 (예: 응답 시간 분포)
  • Summary: 클라이언트 측에서 계산된 분위수
# Counter 예시
http_requests_total{method="GET", path="/api/users", status="200"} 15234

# Gauge 예시
node_memory_usage_bytes{instance="web-01"} 1073741824

# Histogram 예시
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444
http_request_duration_seconds_bucket{le="1.0"} 34055

로그 (Logs)

이벤트의 텍스트 기록입니다. 개별 이벤트에 대한 상세 정보를 제공합니다.

{
  "timestamp": "2025-03-15T10:30:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc123def456",
  "spanId": "span789",
  "message": "Payment processing failed",
  "userId": "user-42",
  "orderId": "order-1234",
  "error": "Timeout connecting to payment gateway",
  "duration_ms": 5000
}

**구조화 로깅(Structured Logging)**을 사용하면 검색과 분석이 훨씬 쉬워집니다.

트레이스 (Traces)

요청이 여러 서비스를 거치는 전체 경로를 추적합니다.

[Trace: abc123def456]
├── [Span: API Gateway] 2ms
│   ├── [Span: Auth Service] 5ms
│   │   └── [Span: Redis Cache Lookup] 1ms
│   ├── [Span: User Service] 15ms
│   │   └── [Span: PostgreSQL Query] 8ms
│   └── [Span: Payment Service] 5003ms  ← 병목!
│       └── [Span: External Payment API] 5000ms (TIMEOUT)
└── Total: 5025ms

세 가지 축이 결합되면 "무엇이(What) 잘못되었고, 왜(Why) 잘못되었으며, 어디서(Where) 잘못되었는지"를 모두 파악할 수 있습니다.


2. Prometheus

아키텍처

Prometheus는 Pull 기반 모니터링 시스템입니다.

┌─────────────┐     ┌──────────────┐     ┌───────────┐
Targets     │────>Prometheus  │────>Grafana  (exporters) │pull │  Server      │query│           │
└─────────────┘     │  - TSDB      │     └───────────┘
- Rules- AlertMgr                    └──────────────┘
                    ┌─────▼─────┐
AlertMgr- Routing- Silence                    └───────────┘

Prometheus 설정

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app-service'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)

PromQL 핵심 쿼리

# 1. 현재 초당 요청 수 (rate)
rate(http_requests_total[5m])

# 2. 서비스별 에러율
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# 3. 95번째 백분위 응답 시간
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# 4. 메모리 사용률 (%)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

# 5. CPU 사용률
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 6. 디스크 여유 공간이 10% 미만인 노드
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 10

# 7. Pod 재시작 횟수 (지난 1시간)
increase(kube_pod_container_status_restarts_total[1h]) > 3

# 8. 서비스 가용성 (지난 30일)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  /
  sum(increase(http_requests_total[30d]))
)

Recording Rules (성능 최적화)

# recording_rules.yml
groups:
  - name: service_metrics
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      - record: service:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

      - record: service:http_error_rate:ratio
        expr: service:http_errors:rate5m / service:http_requests:rate5m

      - record: service:http_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Alert Rules

# alert_rules.yml
groups:
  - name: service_alerts
    rules:
      - alert: HighErrorRate
        expr: service:http_error_rate:ratio > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on service {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for 5+ minutes"

      - alert: HighLatency
        expr: service:http_latency:p95 > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.service }}"
          description: "P95 latency is {{ $value }}s (threshold: 2s)"

      - alert: PodCrashLooping
        expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

3. Grafana

대시보드 설계 원칙

USE 방법론: Utilization(사용률), Saturation(포화도), Errors(에러) RED 방법론: Rate(비율), Errors(에러), Duration(지속시간)

Grafana 대시보드 JSON 구조

{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            },
            "unit": "percent"
          }
        }
      }
    ],
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 2
        },
        {
          "name": "environment",
          "type": "custom",
          "options": ["production", "staging", "development"]
        }
      ]
    }
  }
}

Grafana Alerting

# Grafana Alert Rule (provisioning)
apiVersion: 1
groups:
  - orgId: 1
    name: service_alerts
    folder: Production
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          - refId: B
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total[5m])) by (service)
          - refId: C
            datasourceUid: __expr__
            model:
              type: math
              expression: "$A / $B > 0.05"
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 5%"

4. OpenTelemetry

OpenTelemetry 개요

OpenTelemetry(OTel)는 메트릭, 로그, 트레이스를 수집하는 벤더 중립적인 표준입니다.

┌──────────────┐     ┌────────────────┐     ┌─────────────┐
Application  │────>OTel          │────>Backend+ OTel SDK   │     │  Collector     │     │  - Jaeger│              │     │  - Receivers   │     │  - Tempo│              │     │  - Processors  │     │  - Prometheus│
│              │     │  - Exporters   │     │  - Loki└──────────────┘     └────────────────┘     └─────────────┘

SDK 계측 (Node.js)

// tracing.ts - 애플리케이션 시작 시 가장 먼저 import
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'payment-service',
    [ATTR_SERVICE_VERSION]: '1.2.0',
    environment: 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

수동 계측 (Custom Spans)

import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

async function processPayment(orderId: string, amount: number) {
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      span.setAttribute('payment.amount', amount);
      span.setAttribute('payment.currency', 'USD');

      // 하위 span 생성
      const validationResult = await tracer.startActiveSpan(
        'validatePayment',
        async (validationSpan) => {
          const result = await validatePaymentDetails(orderId);
          validationSpan.setAttribute('validation.result', result.valid);
          validationSpan.end();
          return result;
        }
      );

      if (!validationResult.valid) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: 'Payment validation failed',
        });
        throw new Error('Invalid payment');
      }

      const result = await chargePayment(orderId, amount);
      span.setAttribute('payment.transactionId', result.transactionId);
      span.setStatus({ code: SpanStatusCode.OK });

      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      throw error;
    } finally {
      span.end();
    }
  });
}

OTel Collector 설정

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['0.0.0.0:8888']

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  memory_limiter:
    check_interval: 1s
    limit_mib: 512

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [otlp/jaeger]

    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

5. 분산 추적 (Distributed Tracing)

Jaeger와 Grafana Tempo

Jaeger: 독립형 분산 추적 시스템. UI가 내장되어 있어 빠르게 시작할 수 있습니다.

Grafana Tempo: Grafana 생태계에 통합된 추적 백엔드. 인덱스가 없어 저장 비용이 낮습니다.

Docker Compose로 추적 스택 구성

# docker-compose.yaml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
    depends_on:
      - jaeger

트레이스 분석 팁

  1. 느린 Span 찾기: 전체 트레이스에서 가장 긴 span을 식별합니다
  2. 에러 Span 필터: status=ERROR로 필터링하여 실패 지점을 파악합니다
  3. 서비스 맵: 서비스 간 의존성과 호출 패턴을 시각화합니다
  4. 비교 분석: 정상 트레이스와 문제 트레이스를 나란히 비교합니다

6. 로깅 (Logging)

ELK vs Loki vs CloudWatch

구분ELK StackGrafana LokiCloudWatch Logs
인덱싱전문 인덱스라벨 기반로그 그룹
스토리지 비용높음낮음중간
쿼리 언어KQL/LuceneLogQLInsights
Grafana 통합플러그인네이티브플러그인
적합한 규모대규모중소규모AWS 네이티브

Grafana Loki + LogQL

# 서비스별 에러 로그
{service="payment-service"} |= "ERROR"

# JSON 파싱 후 필터
{service="api-gateway"} | json | status >= 500

# 에러 발생 빈도 (1분당)
count_over_time({service="payment-service"} |= "ERROR" [1m])

# 느린 요청 필터 (1초 이상)
{service="api-gateway"} | json | duration > 1000

# 특정 트레이스 ID로 전체 로그 검색
{service=~".+"} |= "trace_id=abc123def456"

구조화 로깅 구현 (Node.js)

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level(label) {
      return { level: label };
    },
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  base: {
    service: 'payment-service',
    version: '1.2.0',
    environment: process.env.NODE_ENV,
  },
});

// 요청별 컨텍스트 포함
function createRequestLogger(req) {
  return logger.child({
    requestId: req.id,
    traceId: req.headers['x-trace-id'],
    userId: req.user?.id,
    method: req.method,
    path: req.url,
  });
}

// 사용 예시
app.use((req, res, next) => {
  req.log = createRequestLogger(req);
  req.log.info('Request received');

  res.on('finish', () => {
    req.log.info({
      statusCode: res.statusCode,
      duration: Date.now() - req.startTime,
    }, 'Request completed');
  });

  next();
});

7. SRE 핵심 개념

SLI (Service Level Indicator)

서비스 품질을 측정하는 구체적인 지표입니다.

# 가용성 SLI: 성공 요청 비율
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# 지연 SLI: P99 < 300ms인 요청 비율
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

SLO (Service Level Objective)

SLI의 목표값입니다.

  • 가용성 SLO: 99.9% (월간 다운타임 43분)
  • 지연 SLO: P99 응답 시간 300ms 미만
# SLO 정의 (Sloth 형식)
version: "prometheus/v1"
service: "payment-service"
labels:
  team: "platform"
slos:
  - name: "availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5..",service="payment"}[{{.window}}]))
        total_query: sum(rate(http_requests_total{service="payment"}[{{.window}}]))
    alerting:
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Error Budget (에러 예산)

SLO 99.9%라면, 에러 예산은 0.1%입니다.

  • 30일 기준: 43.2분의 다운타임 허용
  • 에러 예산이 남으면: 새 기능 배포, 실험 가능
  • 에러 예산을 소진하면: 안정화에 집중, 배포 동결
# 남은 에러 예산 (%)
1 - (
  (1 - service:availability:ratio30d)
  /
  (1 - 0.999)
)

SLA (Service Level Agreement)

고객과의 계약입니다. SLO보다 느슨하게 설정합니다.

SLA > SLO > SLI (측정)

예시:
- SLA: 99.9% (계약, 위반 시 환불)
- SLO: 99.95% (내부 목표, SLA보다 엄격)
- SLI: 99.97% (실제 측정값)

8. 알림 전략 (Alerting Strategy)

알림 피라미드

          /  P1: Page  \          → 즉시 대응 (PagerDuty)
         / (심각, 고객 영향) \
        /──────────────────\
       /   P2: Ticket      \     → 업무 시간 내 처리 (Jira)
      / (성능 저하, 잠재 위험)  \
     /──────────────────────\
    /    P3: Notification    \   → 인지만 필요 (Slack)
   /  (경고, 트렌드 변화)       \
  /──────────────────────────\
 /     P4: Dashboard only     \  → 대시보드 확인
/ (참고 지표, 자동 복구 가능)      \

AlertManager 라우팅

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

    - match:
        severity: info
      receiver: 'slack-info'
      repeat_interval: 12h

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-warning'
        title: '[WARNING] {{ .GroupLabels.alertname }}'

  - name: 'slack-info'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-info'

  - name: 'default-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'service']

좋은 알림의 조건

  1. 실행 가능(Actionable): 알림을 받으면 할 수 있는 일이 있어야 합니다
  2. 긴급도 구분: 진짜 긴급한 것만 페이지(호출)합니다
  3. 컨텍스트 포함: Runbook 링크, 관련 대시보드 링크를 포함합니다
  4. 알림 피로 방지: 너무 많은 알림은 모든 알림을 무시하게 만듭니다
  5. 자동 복구 우선: 가능하면 자동 복구 후 알림을 줍니다

9. 온콜(On-Call) 문화

온콜 로테이션 설계

주간 로테이션 예시:
- Primary: 첫 번째 대응자 (5분 내 응답)
- Secondary: 백업 대응자 (Primary 미응답 시 10분 후 에스컬레이션)
- 관리자: 30분 이상 미해결 시 에스컬레이션

교대 주기: 1주일
핸드오프: 매주 월요일 오전 10보상: 온콜 수당, 대체 휴무

인시던트 대응 프로세스

1. 감지(Detect)
   └── 알림 수신, 영향 범위 초기 파악

2. 대응(Respond)
   └── 인시던트 채널 생성, 역할 할당
       - IC (Incident Commander): 조율
       - Tech Lead: 기술 조사
       - Comms: 고객/이해관계자 소통

3. 완화(Mitigate)
   └── 즉각적인 조치 (롤백, 스케일 아웃 등)

4. 해결(Resolve)
   └── 근본 원인 수정, 서비스 복구 확인

5. 사후 분석(Postmortem)
   └── 비난 없는 회고, 재발 방지 액션 아이템 도출

10. 프로덕션 모니터링 스택 아키텍처

권장 스택 구성

┌─────────────────────────────────────────────┐
Grafana  (대시보드, 알림, 탐색)└───────┬─────────────┬──────────────┬─────────┘
        │             │              │
   ┌────▼────┐  ┌─────▼─────┐  ┌────▼────┐
   │Prometheus│  │   Loki    │  │ Tempo   (Metrics)  (Logs)(Traces)   └────▲────┘  └─────▲─────┘  └────▲────┘
        │             │              │
   ┌────┴─────────────┴──────────────┴────┐
OpenTelemetry Collector     (수집, 처리, 라우팅)   └────▲─────────────▲──────────────▲────┘
        │             │              │
   ┌────┴────┐  ┌─────┴─────┐  ┌────┴────┐
   │Service A│  │Service B  │  │Service C+OTel SDK│  │+OTel SDK  │  │+OTel SDK   └─────────┘  └───────────┘  └─────────┘

Kubernetes 환경 모니터링

# kube-prometheus-stack values.yaml (Helm)
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 100Gi

grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: 'default'
          folder: ''
          type: file
          options:
            path: /var/lib/grafana/dashboards

alertmanager:
  config:
    route:
      receiver: 'slack'
      group_by: ['alertname', 'namespace']
    receivers:
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/xxx'
            channel: '#k8s-alerts'

11. 실무 면접 질문 15선

기초 (1-5)

Q1. 관측성의 3가지 축을 설명하세요.

Metrics(메트릭), Logs(로그), Traces(트레이스)입니다. 메트릭은 숫자로 된 시계열 데이터로 시스템 상태의 집계 뷰를, 로그는 이벤트의 상세 텍스트 기록을, 트레이스는 요청이 여러 서비스를 거치는 경로를 보여줍니다.

Q2. Prometheus의 Pull 모델을 설명하세요.

Prometheus가 직접 타겟 서비스의 /metrics 엔드포인트를 주기적으로 스크래핑합니다. Push 모델과 달리 서버가 수집 대상을 제어하며, 서비스 디스커버리와 결합하여 동적 환경을 지원합니다.

Q3. Counter, Gauge, Histogram의 차이를 설명하세요.

Counter는 단조 증가하는 값(총 요청 수), Gauge는 오르내리는 현재값(메모리 사용량), Histogram은 값의 분포를 버킷으로 관측하는 타입(응답 시간 분포)입니다.

Q4. 구조화 로깅이 왜 중요한가요?

JSON 등 일관된 형식으로 로그를 남기면 자동 파싱, 필터링, 검색이 가능합니다. traceId를 포함하면 분산 시스템에서 로그와 트레이스를 연결하여 디버깅이 훨씬 빨라집니다.

Q5. SLI, SLO, SLA의 차이를 설명하세요.

SLI(Service Level Indicator)는 실제 측정 지표, SLO(Service Level Objective)는 내부 목표값, SLA(Service Level Agreement)는 고객과의 법적 계약입니다. SLA는 SLO보다 느슨하게 설정합니다.

중급 (6-10)

Q6. PromQL의 rate()와 increase()의 차이는?

rate()는 초당 평균 증가율을 반환하고, increase()는 주어진 시간 범위 동안의 총 증가량을 반환합니다. rate()는 그래프에, increase()는 총 카운트에 적합합니다.

Q7. OpenTelemetry Collector의 역할을 설명하세요.

텔레메트리 데이터(메트릭, 로그, 트레이스)를 수신(Receiver)하고, 처리(Processor, 배치, 필터링)한 뒤, 여러 백엔드(Exporter)로 전송합니다. 애플리케이션과 백엔드 사이의 중간 계층으로 벤더 종속을 방지합니다.

Q8. Error Budget의 개념과 활용법을 설명하세요.

SLO에서 허용하는 에러 비율입니다. 99.9% SLO라면 에러 예산은 0.1%(월 43분). 예산이 남으면 새 기능을 배포하고, 소진하면 안정화에 집중합니다. 개발 속도와 신뢰성 사이의 균형을 수치로 관리합니다.

Q9. Distributed Tracing에서 Span과 Trace의 관계는?

Trace는 하나의 요청이 시스템을 통과하는 전체 여정이고, Span은 그 여정 내 개별 작업 단위입니다. Span은 부모-자식 관계로 트리를 형성하며, 각 span에는 시작/종료 시간, 속성, 상태가 있습니다.

Q10. Grafana Loki와 ELK의 주요 차이점은?

ELK는 로그 텍스트를 전문 인덱싱하여 강력한 검색을 제공하지만 스토리지 비용이 높습니다. Loki는 라벨만 인덱싱하고 로그 텍스트는 압축 저장하여 비용이 낮지만, 라벨 기반 필터링 후 텍스트 검색이 필요합니다.

고급 (11-15)

Q11. 알림 피로(Alert Fatigue)를 어떻게 방지하나요?

실행 가능한 알림만 설정하고, 심각도를 명확히 구분합니다. inhibit rules로 중복 알림을 억제하고, grouping으로 유사 알림을 묶습니다. 정기적으로 알림을 리뷰하여 노이즈를 제거합니다.

Q12. Prometheus의 Recording Rules은 왜 필요한가요?

복잡한 PromQL 쿼리를 미리 계산하여 새로운 시계열로 저장합니다. 대시보드 로딩 시간을 줄이고, 같은 쿼리의 반복 실행을 방지합니다. 특히 SLO 대시보드처럼 장기 범위 쿼리에 효과적입니다.

Q13. OpenTelemetry에서 Context Propagation이란?

트레이스 컨텍스트(trace ID, span ID)를 서비스 간 전파하는 메커니즘입니다. HTTP 헤더(W3C Trace Context)나 메시지 큐의 메타데이터를 통해 전파하며, 이를 통해 분산 시스템에서 하나의 요청을 엔드투엔드로 추적합니다.

Q14. Golden Signals과 RED/USE 방법론을 비교하세요.

Google의 Golden Signals는 Latency, Traffic, Errors, Saturation입니다. RED(Rate, Errors, Duration)는 서비스 관점, USE(Utilization, Saturation, Errors)는 인프라 관점에 적합합니다. 서비스에는 RED, 인프라에는 USE를 적용하는 것이 일반적입니다.

Q15. 비난 없는(Blameless) Postmortem의 핵심 원칙은?

개인을 비난하지 않고 시스템의 실패에 초점을 맞춥니다. 타임라인을 재구성하고, 기여 요인을 분석하며, 구체적이고 측정 가능한 액션 아이템을 도출합니다. 목표는 같은 문제가 재발하지 않도록 시스템을 개선하는 것입니다.


12. 실전 퀴즈 5문제

Q1. Prometheus의 metric 타입 중, "현재 메모리 사용량"처럼 오르내리는 값을 표현하기에 가장 적합한 타입은?

정답: Gauge

Gauge는 올라갔다 내려갔다 하는 순간 값을 나타냅니다. Counter는 단조 증가만 하므로 메모리 사용량처럼 감소할 수 있는 값에는 적합하지 않습니다. Histogram은 분포를 측정하는 데 사용합니다.

Q2. SLO가 99.9%일 때, 30일 동안의 에러 예산(허용 다운타임)은 대략 몇 분인가요?

정답: 약 43분

30일 = 43,200분. 에러 예산 = 43,200 x 0.001 = 43.2분. 이 시간 내의 장애는 SLO를 위반하지 않습니다.

Q3. OpenTelemetry Collector의 세 가지 주요 구성 요소는?

정답: Receivers, Processors, Exporters

Receivers는 데이터를 수신하고(OTLP, Prometheus 등), Processors는 데이터를 처리하며(배치, 필터링, 속성 추가 등), Exporters는 데이터를 백엔드로 전송합니다(Jaeger, Prometheus, Loki 등).

Q4. 다음 PromQL 쿼리는 무엇을 계산하나요? histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

정답: 최근 5분간 HTTP 요청 응답 시간의 95번째 백분위수(P95)

histogram_quantile은 히스토그램 버킷에서 분위수를 계산합니다. 0.95는 95%를 의미하며, le(less than or equal) 라벨별로 그룹화된 버킷 데이터에서 P95를 추출합니다.

Q5. Distributed Tracing에서 "Context Propagation"이 없으면 어떤 문제가 발생하나요?

정답: 서비스 간 요청을 하나의 트레이스로 연결할 수 없게 됩니다.

Context Propagation이 없으면 각 서비스가 독립적인 트레이스를 생성합니다. 하나의 사용자 요청이 여러 서비스를 거칠 때 전체 경로를 파악할 수 없어, 분산 시스템에서의 디버깅이 극히 어려워집니다.


참고 자료 (References)

  1. Prometheus 공식 문서
  2. Grafana 공식 문서
  3. OpenTelemetry 공식 문서
  4. Jaeger 공식 문서
  5. Grafana Loki 문서
  6. Grafana Tempo 문서
  7. Google SRE Book
  8. Google SRE Workbook
  9. PromQL 치트시트
  10. Sloth - SLO Generator
  11. OpenTelemetry Collector 설정
  12. Alertmanager 라우팅 트리
  13. LogQL 문서
  14. Pino Logger (Node.js)
  15. kube-prometheus-stack Helm Chart
  16. DORA Metrics 가이드

Observability Complete Guide 2025: Making Systems Transparent with Prometheus, Grafana, and OpenTelemetry

Introduction: Why Observability Matters

"Monitoring tells you whether a system is working. Observability lets you understand why it is not working."

In modern distributed systems, simply monitoring CPU usage or memory is insufficient. In microservice architectures, containers, and serverless environments, a single request traverses dozens of services. To identify the root cause of problems, you need observability — the ability to look inside your system.


1. Three Pillars of Observability

Metrics

Time-series data expressed as numbers. They provide an aggregated view of system state.

  • Counter: Monotonically increasing value (e.g., total request count)
  • Gauge: Value that goes up and down (e.g., current memory usage)
  • Histogram: Distribution of values (e.g., response time distribution)
  • Summary: Client-side calculated quantiles
# Counter example
http_requests_total{method="GET", path="/api/users", status="200"} 15234

# Gauge example
node_memory_usage_bytes{instance="web-01"} 1073741824

# Histogram example
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444
http_request_duration_seconds_bucket{le="1.0"} 34055

Logs

Text records of events. They provide detailed information about individual events.

{
  "timestamp": "2025-03-15T10:30:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc123def456",
  "spanId": "span789",
  "message": "Payment processing failed",
  "userId": "user-42",
  "orderId": "order-1234",
  "error": "Timeout connecting to payment gateway",
  "duration_ms": 5000
}

Structured logging makes searching and analysis much easier.

Traces

Track the complete path of a request across multiple services.

[Trace: abc123def456]
|-- [Span: API Gateway] 2ms
|   |-- [Span: Auth Service] 5ms
|   |   +-- [Span: Redis Cache Lookup] 1ms
|   |-- [Span: User Service] 15ms
|   |   +-- [Span: PostgreSQL Query] 8ms
|   +-- [Span: Payment Service] 5003ms  <-- bottleneck!
|       +-- [Span: External Payment API] 5000ms (TIMEOUT)
+-- Total: 5025ms

When all three pillars are combined, you can understand "What went wrong, Why it went wrong, and Where it went wrong."


2. Prometheus

Architecture

Prometheus is a pull-based monitoring system.

+-------------+     +--------------+     +-----------+
|  Targets    |---->|  Prometheus  |---->|  Grafana  |
|  (exporters)|pull |  Server      |query|           |
+-------------+     |  - TSDB      |     +-----------+
                    |  - Rules     |
                    |  - AlertMgr  |
                    +--------------+
                          |
                    +-----v-----+
                    | AlertMgr  |
                    | - Routing |
                    | - Silence |
                    +-----------+

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app-service'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)

Essential PromQL Queries

# 1. Current requests per second (rate)
rate(http_requests_total[5m])

# 2. Error rate by service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# 3. 95th percentile response time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# 4. Memory utilization (%)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

# 5. CPU utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 6. Nodes with less than 10% free disk space
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 10

# 7. Pod restart count (last 1 hour)
increase(kube_pod_container_status_restarts_total[1h]) > 3

# 8. Service availability (last 30 days)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  /
  sum(increase(http_requests_total[30d]))
)

Recording Rules (Performance Optimization)

# recording_rules.yml
groups:
  - name: service_metrics
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      - record: service:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

      - record: service:http_error_rate:ratio
        expr: service:http_errors:rate5m / service:http_requests:rate5m

      - record: service:http_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Alert Rules

# alert_rules.yml
groups:
  - name: service_alerts
    rules:
      - alert: HighErrorRate
        expr: service:http_error_rate:ratio > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on service {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for 5+ minutes"

      - alert: HighLatency
        expr: service:http_latency:p95 > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.service }}"
          description: "P95 latency is {{ $value }}s (threshold: 2s)"

      - alert: PodCrashLooping
        expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

3. Grafana

Dashboard Design Principles

USE Method: Utilization, Saturation, Errors RED Method: Rate, Errors, Duration

Grafana Dashboard JSON Structure

{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            },
            "unit": "percent"
          }
        }
      }
    ],
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 2
        },
        {
          "name": "environment",
          "type": "custom",
          "options": ["production", "staging", "development"]
        }
      ]
    }
  }
}

Grafana Alerting

# Grafana Alert Rule (provisioning)
apiVersion: 1
groups:
  - orgId: 1
    name: service_alerts
    folder: Production
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          - refId: B
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total[5m])) by (service)
          - refId: C
            datasourceUid: __expr__
            model:
              type: math
              expression: "$A / $B > 0.05"
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 5%"

4. OpenTelemetry

OpenTelemetry Overview

OpenTelemetry (OTel) is a vendor-neutral standard for collecting metrics, logs, and traces.

+--------------+     +----------------+     +-------------+
| Application  |---->|  OTel          |---->|  Backend    |
| + OTel SDK   |     |  Collector     |     |  - Jaeger   |
|              |     |  - Receivers   |     |  - Tempo    |
|              |     |  - Processors  |     |  - Prometheus|
|              |     |  - Exporters   |     |  - Loki     |
+--------------+     +----------------+     +-------------+

SDK Instrumentation (Node.js)

// tracing.ts - import first when application starts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'payment-service',
    [ATTR_SERVICE_VERSION]: '1.2.0',
    environment: 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

Manual Instrumentation (Custom Spans)

import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

async function processPayment(orderId: string, amount: number) {
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      span.setAttribute('payment.amount', amount);
      span.setAttribute('payment.currency', 'USD');

      // Create child span
      const validationResult = await tracer.startActiveSpan(
        'validatePayment',
        async (validationSpan) => {
          const result = await validatePaymentDetails(orderId);
          validationSpan.setAttribute('validation.result', result.valid);
          validationSpan.end();
          return result;
        }
      );

      if (!validationResult.valid) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: 'Payment validation failed',
        });
        throw new Error('Invalid payment');
      }

      const result = await chargePayment(orderId, amount);
      span.setAttribute('payment.transactionId', result.transactionId);
      span.setStatus({ code: SpanStatusCode.OK });

      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      throw error;
    } finally {
      span.end();
    }
  });
}

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['0.0.0.0:8888']

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  memory_limiter:
    check_interval: 1s
    limit_mib: 512

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [otlp/jaeger]

    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

5. Distributed Tracing

Jaeger and Grafana Tempo

Jaeger: Standalone distributed tracing system with a built-in UI for quick start.

Grafana Tempo: Tracing backend integrated into the Grafana ecosystem. Lower storage costs due to indexless design.

Tracing Stack with Docker Compose

# docker-compose.yaml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
    depends_on:
      - jaeger

Trace Analysis Tips

  1. Find slow spans: Identify the longest span in the overall trace
  2. Filter error spans: Filter by status=ERROR to pinpoint failure locations
  3. Service map: Visualize dependencies and call patterns between services
  4. Comparative analysis: Compare normal traces with problematic ones side by side

6. Logging

ELK vs Loki vs CloudWatch

AspectELK StackGrafana LokiCloudWatch Logs
IndexingFull-text indexLabel-basedLog groups
Storage costHighLowMedium
Query languageKQL/LuceneLogQLInsights
Grafana integrationPluginNativePlugin
Best for scaleLargeSmall to mediumAWS native

Grafana Loki + LogQL

# Error logs by service
{service="payment-service"} |= "ERROR"

# JSON parsing then filter
{service="api-gateway"} | json | status >= 500

# Error frequency (per minute)
count_over_time({service="payment-service"} |= "ERROR" [1m])

# Slow request filter (over 1 second)
{service="api-gateway"} | json | duration > 1000

# Search all logs by specific trace ID
{service=~".+"} |= "trace_id=abc123def456"

Structured Logging Implementation (Node.js)

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level(label) {
      return { level: label };
    },
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  base: {
    service: 'payment-service',
    version: '1.2.0',
    environment: process.env.NODE_ENV,
  },
});

// Include per-request context
function createRequestLogger(req) {
  return logger.child({
    requestId: req.id,
    traceId: req.headers['x-trace-id'],
    userId: req.user?.id,
    method: req.method,
    path: req.url,
  });
}

// Usage example
app.use((req, res, next) => {
  req.log = createRequestLogger(req);
  req.log.info('Request received');

  res.on('finish', () => {
    req.log.info({
      statusCode: res.statusCode,
      duration: Date.now() - req.startTime,
    }, 'Request completed');
  });

  next();
});

7. SRE Core Concepts

SLI (Service Level Indicator)

Specific metrics that measure service quality.

# Availability SLI: ratio of successful requests
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# Latency SLI: ratio of requests with P99 < 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

SLO (Service Level Objective)

The target value for an SLI.

  • Availability SLO: 99.9% (monthly downtime of 43 minutes)
  • Latency SLO: P99 response time under 300ms
# SLO definition (Sloth format)
version: "prometheus/v1"
service: "payment-service"
labels:
  team: "platform"
slos:
  - name: "availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5..",service="payment"}[{{.window}}]))
        total_query: sum(rate(http_requests_total{service="payment"}[{{.window}}]))
    alerting:
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Error Budget

With a 99.9% SLO, the error budget is 0.1%.

  • Over 30 days: 43.2 minutes of downtime allowed
  • Budget remaining: Deploy new features, run experiments
  • Budget exhausted: Focus on stability, freeze deployments
# Remaining error budget (%)
1 - (
  (1 - service:availability:ratio30d)
  /
  (1 - 0.999)
)

SLA (Service Level Agreement)

A contract with customers. Set more loosely than SLOs.

SLA > SLO > SLI (measurement)

Example:
- SLA: 99.9% (contract, refund on violation)
- SLO: 99.95% (internal target, stricter than SLA)
- SLI: 99.97% (actual measurement)

8. Alerting Strategy

Alert Pyramid

          /  P1: Page  \          -> Immediate response (PagerDuty)
         / (Critical,   \
        /  customer impact)\
       /-------------------\
      /   P2: Ticket        \    -> Handle during business hours (Jira)
     / (Degradation, risk)   \
    /------------------------\
   /    P3: Notification      \  -> Awareness only (Slack)
  /  (Warning, trend changes)  \
 /-----------------------------\
/     P4: Dashboard only        \ -> Check dashboards
/ (Reference metrics, auto-heal) \

AlertManager Routing

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

    - match:
        severity: info
      receiver: 'slack-info'
      repeat_interval: 12h

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-warning'
        title: '[WARNING] {{ .GroupLabels.alertname }}'

  - name: 'slack-info'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-info'

  - name: 'default-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'service']

Qualities of Good Alerts

  1. Actionable: There should be something you can do when alerted
  2. Severity distinction: Only page for truly urgent matters
  3. Context included: Include runbook links and related dashboard links
  4. Prevent alert fatigue: Too many alerts cause all alerts to be ignored
  5. Auto-remediation first: Whenever possible, auto-recover then notify

9. On-Call Culture

On-Call Rotation Design

Weekly rotation example:
- Primary: First responder (respond within 5 minutes)
- Secondary: Backup responder (escalated 10 min after Primary no response)
- Manager: Escalated after 30+ minutes unresolved

Rotation period: 1 week
Handoff: Every Monday at 10 AM
Compensation: On-call pay, compensatory time off

Incident Response Process

1. Detect
   +-- Receive alert, initial impact assessment

2. Respond
   +-- Create incident channel, assign roles
       - IC (Incident Commander): Coordination
       - Tech Lead: Technical investigation
       - Comms: Customer/stakeholder communication

3. Mitigate
   +-- Immediate action (rollback, scale out, etc.)

4. Resolve
   +-- Fix root cause, verify service recovery

5. Postmortem
   +-- Blameless retrospective, derive action items to prevent recurrence

10. Production Monitoring Stack Architecture

+---------------------------------------------+
|                  Grafana                     |
|  (Dashboards, Alerts, Exploration)          |
+-------+-------------+------ --------+------+
        |             |               |
   +----v----+  +-----v-----+  +------v---+
   |Prometheus|  |   Loki    |  | Tempo    |
   |(Metrics) |  |  (Logs)   |  |(Traces)  |
   +----^----+  +-----^-----+  +-----^----+
        |             |              |
   +----+-------------+--------------+----+
   |        OpenTelemetry Collector        |
   |  (Collection, Processing, Routing)    |
   +----^-------------^--------------^----+
        |             |              |
   +----+----+  +-----+-----+  +----+----+
   |Service A|  |Service B  |  |Service C|
   |+OTel SDK|  |+OTel SDK  |  |+OTel SDK|
   +---------+  +-----------+  +---------+

Kubernetes Environment Monitoring

# kube-prometheus-stack values.yaml (Helm)
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 100Gi

grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: 'default'
          folder: ''
          type: file
          options:
            path: /var/lib/grafana/dashboards

alertmanager:
  config:
    route:
      receiver: 'slack'
      group_by: ['alertname', 'namespace']
    receivers:
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/xxx'
            channel: '#k8s-alerts'

11. Interview Questions: 15 Essential Topics

Basics (1-5)

Q1. Explain the three pillars of observability.

Metrics, Logs, and Traces. Metrics are numerical time-series data providing aggregated views of system state. Logs are detailed text records of events. Traces show the path of a request across multiple services.

Q2. Explain Prometheus's pull model.

Prometheus directly scrapes target service /metrics endpoints periodically. Unlike push models, the server controls collection targets, and combined with service discovery, supports dynamic environments.

Q3. Explain the differences between Counter, Gauge, and Histogram.

Counter is a monotonically increasing value (total requests), Gauge is a current value that goes up and down (memory usage), Histogram observes value distributions in buckets (response time distribution).

Q4. Why is structured logging important?

Logging in a consistent format like JSON enables automated parsing, filtering, and searching. Including traceId connects logs with traces in distributed systems, making debugging much faster.

Q5. Explain the differences between SLI, SLO, and SLA.

SLI (Service Level Indicator) is an actual measured metric, SLO (Service Level Objective) is an internal target value, SLA (Service Level Agreement) is a legal contract with customers. SLAs are set more loosely than SLOs.

Intermediate (6-10)

Q6. What's the difference between PromQL's rate() and increase()?

rate() returns the average per-second increase rate, while increase() returns the total increase over a given time range. rate() is suited for graphs, increase() for total counts.

Q7. Explain the role of the OpenTelemetry Collector.

It receives telemetry data (metrics, logs, traces) through Receivers, processes it through Processors (batching, filtering), and sends it to multiple backends through Exporters. It acts as an intermediate layer between applications and backends, preventing vendor lock-in.

Q8. Explain the concept and application of Error Budget.

The error ratio allowed by an SLO. With 99.9% SLO, the error budget is 0.1% (43 min/month). When budget remains, deploy new features. When exhausted, focus on stabilization. It manages the balance between development velocity and reliability with numbers.

Q9. What is the relationship between Span and Trace in Distributed Tracing?

A Trace is the complete journey of a request through the system. A Span is an individual unit of work within that journey. Spans form a tree with parent-child relationships, each containing start/end times, attributes, and status.

Q10. What are the key differences between Grafana Loki and ELK?

ELK full-text indexes log content providing powerful search but with high storage costs. Loki only indexes labels and stores log text compressed, resulting in lower costs but requiring label-based filtering before text search.

Advanced (11-15)

Q11. How do you prevent alert fatigue?

Set only actionable alerts with clearly distinguished severity levels. Use inhibit rules to suppress duplicate alerts and grouping to consolidate similar ones. Regularly review alerts to eliminate noise.

Q12. Why are Prometheus Recording Rules needed?

They pre-compute complex PromQL queries and store results as new time series. This reduces dashboard loading times and prevents repeated execution of the same queries. Especially effective for long-range queries like SLO dashboards.

Q13. What is Context Propagation in OpenTelemetry?

The mechanism for propagating trace context (trace ID, span ID) between services. Propagated through HTTP headers (W3C Trace Context) or message queue metadata, enabling end-to-end tracking of a single request in distributed systems.

Q14. Compare Golden Signals with RED/USE methodologies.

Google's Golden Signals are Latency, Traffic, Errors, Saturation. RED (Rate, Errors, Duration) is service-oriented, USE (Utilization, Saturation, Errors) is infrastructure-oriented. Typically apply RED for services and USE for infrastructure.

Q15. What are the core principles of blameless postmortems?

Focus on system failures rather than blaming individuals. Reconstruct the timeline, analyze contributing factors, and derive specific, measurable action items. The goal is to improve systems so the same problem does not recur.


12. Practice Quiz: 5 Questions

Q1. Which Prometheus metric type is most suitable for representing fluctuating values like "current memory usage"?

Answer: Gauge

Gauge represents instantaneous values that go up and down. Counter only increases monotonically, making it unsuitable for values like memory usage that can decrease. Histogram is used for measuring distributions.

Q2. With a 99.9% SLO, approximately how many minutes of error budget (allowed downtime) do you have over 30 days?

Answer: Approximately 43 minutes

30 days = 43,200 minutes. Error budget = 43,200 x 0.001 = 43.2 minutes. Outages within this time do not violate the SLO.

Q3. What are the three main components of the OpenTelemetry Collector?

Answer: Receivers, Processors, Exporters

Receivers receive data (OTLP, Prometheus, etc.), Processors process data (batching, filtering, adding attributes, etc.), and Exporters send data to backends (Jaeger, Prometheus, Loki, etc.).

Q4. What does this PromQL query calculate? histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Answer: The 95th percentile (P95) of HTTP request response times over the last 5 minutes

histogram_quantile calculates quantiles from histogram buckets. 0.95 means 95%, extracting P95 from bucket data grouped by the le (less than or equal) label.

Q5. What problem occurs without "Context Propagation" in Distributed Tracing?

Answer: Requests across services cannot be linked into a single trace.

Without Context Propagation, each service creates independent traces. When a single user request traverses multiple services, you cannot see the full path, making debugging in distributed systems extremely difficult.


References

  1. Prometheus Official Documentation
  2. Grafana Official Documentation
  3. OpenTelemetry Official Documentation
  4. Jaeger Official Documentation
  5. Grafana Loki Documentation
  6. Grafana Tempo Documentation
  7. Google SRE Book
  8. Google SRE Workbook
  9. PromQL Cheat Sheet
  10. Sloth - SLO Generator
  11. OpenTelemetry Collector Configuration
  12. Alertmanager Routing Tree
  13. LogQL Documentation
  14. Pino Logger (Node.js)
  15. kube-prometheus-stack Helm Chart
  16. DORA Metrics Guide