Split View: Observability & 모니터링 완전 가이드 2025: 로깅, 메트릭, 트레이싱, 알림 전략

✨ Learn with Quiz

Observability & 모니터링 완전 가이드 2025: 로깅, 메트릭, 트레이싱, 알림 전략

1. 모니터링 vs Observability

1.1 모니터링의 한계

전통적인 모니터링은 **알려진 문제(known unknowns)**를 감지하는 데 초점을 맞춥니다. CPU 사용률이 90%를 넘으면 알림, 디스크 사용량이 80%를 넘으면 알림 -- 이런 임계값 기반 접근법입니다.

하지만 현대의 분산 시스템에서는 이것만으로 부족합니다:

마이크로서비스 간 복잡한 상호작용
일시적(transient) 오류의 빈번한 발생
예측하지 못한 문제(unknown unknowns)의 증가
단일 메트릭으로 설명할 수 없는 성능 저하

1.2 Observability란

Observability는 시스템의 **외부 출력(external outputs)**을 통해 **내부 상태(internal state)**를 이해할 수 있는 능력입니다.

핵심 차이:

모니터링: "무엇이 고장났는가?" (What is broken?)
Observability: "왜 고장났는가?" (Why is it broken?)

Observability의 세 가지 축(Three Pillars):

Logs - 이산 이벤트의 기록
Metrics - 시간에 따른 수치 측정
Traces - 분산 시스템에서의 요청 흐름

2. OpenTelemetry (OTel) 핵심 이해

2.1 OpenTelemetry란

OpenTelemetry는 CNCF(Cloud Native Computing Foundation)에서 관리하는 벤더 중립적 텔레메트리 수집 프레임워크입니다. 로그, 메트릭, 트레이스를 하나의 표준으로 통합합니다.

구성 요소:

API: 계측(instrumentation)을 위한 인터페이스
SDK: API의 구현체
Collector: 텔레메트리 데이터 수집, 처리, 내보내기
자동 계측(Auto-instrumentation): 코드 수정 없이 텔레메트리 수집

2.2 OTel SDK 설정

// Node.js OpenTelemetry 설정
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: '1.2.0',
    environment: process.env.NODE_ENV || 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4318/v1/metrics',
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // HTTP, Express, pg, Redis 등 자동 계측
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingRequestHook: (req) => {
          // 헬스체크 경로 제외
          return req.url === '/health';
        },
      },
      '@opentelemetry/instrumentation-express': {
        enabled: true,
      },
    }),
  ],
});

sdk.start();

// 정상 종료 시 텔레메트리 플러시
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});

2.3 수동 Span 생성

import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processOrder(orderData) {
  // 수동 Span 생성
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      // Span에 속성 추가
      span.setAttribute('order.id', orderData.id);
      span.setAttribute('order.total', orderData.total);
      span.setAttribute('order.items_count', orderData.items.length);

      // 하위 Span: 재고 확인
      const inventory = await tracer.startActiveSpan(
        'checkInventory',
        async (childSpan) => {
          childSpan.setAttribute('inventory.warehouse', 'us-east-1');
          const result = await inventoryService.check(orderData.items);
          childSpan.end();
          return result;
        }
      );

      if (!inventory.available) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: 'Insufficient inventory',
        });
        throw new Error('Insufficient inventory');
      }

      // 하위 Span: 결제 처리
      const payment = await tracer.startActiveSpan(
        'processPayment',
        async (childSpan) => {
          childSpan.setAttribute('payment.method', orderData.paymentMethod);
          childSpan.setAttribute('payment.amount', orderData.total);
          const result = await paymentService.charge(orderData);
          childSpan.end();
          return result;
        }
      );

      // Span 이벤트 추가
      span.addEvent('order_completed', {
        'order.id': orderData.id,
        'payment.transaction_id': payment.transactionId,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return { orderId: orderData.id, transactionId: payment.transactionId };
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

2.4 OTel Collector 설정

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  # 테일 샘플링 (에러 트레이스 우선 수집)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, tail_sampling]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

3. 로깅 (Logging)

3.1 구조화된 로깅 (Structured Logging)

// pino를 활용한 구조화된 로깅
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  // JSON 포맷 (기본)
  formatters: {
    level(label) {
      return { level: label };
    },
    bindings(bindings) {
      return {
        service: 'order-service',
        version: '1.2.0',
        host: bindings.hostname,
        pid: bindings.pid,
      };
    },
  },
  // 타임스탬프 포맷
  timestamp: pino.stdTimeFunctions.isoTime,
  // 민감 정보 제거
  redact: ['req.headers.authorization', 'req.headers.cookie', '*.password'],
});

// Correlation ID 미들웨어
function correlationMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || generateId();
  req.correlationId = correlationId;
  res.setHeader('x-correlation-id', correlationId);

  // 요청별 child logger
  req.log = logger.child({
    correlationId,
    requestId: generateId(),
    method: req.method,
    path: req.path,
    userAgent: req.headers['user-agent'],
  });

  next();
}

// 사용 예시
app.post('/api/orders', correlationMiddleware, async (req, res) => {
  req.log.info({ body: req.body }, 'Order creation started');

  try {
    const order = await createOrder(req.body);
    req.log.info(
      { orderId: order.id, duration: Date.now() - req.startTime },
      'Order created successfully'
    );
    res.json(order);
  } catch (error) {
    req.log.error(
      { err: error, body: req.body },
      'Order creation failed'
    );
    res.status(500).json({ error: 'Internal server error' });
  }
});

3.2 로그 레벨 전략

레벨	용도	예시
FATAL	시스템 중단	데이터베이스 연결 완전 실패
ERROR	오류 발생	결제 처리 실패
WARN	잠재적 문제	재시도 성공, 캐시 미스
INFO	주요 비즈니스 이벤트	주문 생성, 사용자 로그인
DEBUG	디버깅 정보	SQL 쿼리, API 요청/응답
TRACE	상세 추적	함수 진입/종료, 변수 값

// 환경별 로그 레벨 설정
const logLevels = {
  production: 'info',
  staging: 'debug',
  development: 'trace',
};

3.3 ELK Stack (Elasticsearch + Logstash + Kibana)

# docker-compose.yml - ELK Stack
version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.12.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch

volumes:
  es-data:

# logstash.conf
input {
  tcp {
    port => 5044
    codec => json
  }
}

filter {
  # 타임스탬프 파싱
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
  }

  # 에러 스택 트레이스 파싱
  if [level] == "error" {
    grok {
      match => {
        "stack" => "%{GREEDYDATA:error_class}: %{GREEDYDATA:error_message}"
      }
    }
  }

  # 지역 정보 추가
  if [client_ip] {
    geoip {
      source => "client_ip"
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

3.4 Grafana Loki (경량 로그 수집)

# Loki 설정
auth_enabled: false

server:
  http_listen_port: 3100

common:
  ring:
    kvstore:
      store: inmemory
  replication_factor: 1

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 30d

// Loki에 직접 로그 전송 (winston-loki)
import winston from 'winston';
import LokiTransport from 'winston-loki';

const logger = winston.createLogger({
  transports: [
    new LokiTransport({
      host: 'http://loki:3100',
      labels: {
        service: 'order-service',
        environment: 'production',
      },
      json: true,
      batching: true,
      interval: 5,
    }),
  ],
});

4. 메트릭 (Metrics)

4.1 Prometheus 기본

Prometheus는 Pull 기반의 시계열 데이터베이스입니다.

메트릭 타입:

Counter: 단조 증가 값 (요청 수, 에러 수)
Gauge: 증감 가능한 값 (현재 연결 수, 메모리 사용량)
Histogram: 값의 분포 (응답 시간, 페이로드 크기)
Summary: 클라이언트 측 분위수 계산

// Node.js Prometheus 클라이언트
import { Registry, Counter, Gauge, Histogram, Summary } from 'prom-client';

const register = new Registry();

// 기본 메트릭 수집 (CPU, 메모리, 이벤트 루프 등)
import { collectDefaultMetrics } from 'prom-client';
collectDefaultMetrics({ register });

// Counter: HTTP 요청 수
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

// Gauge: 현재 활성 연결 수
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
  registers: [register],
});

// Histogram: 요청 응답 시간
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

// Summary: DB 쿼리 시간
const dbQueryDuration = new Summary({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['operation', 'table'],
  percentiles: [0.5, 0.9, 0.95, 0.99],
  registers: [register],
});

// Express 미들웨어
function metricsMiddleware(req, res, next) {
  const start = process.hrtime.bigint();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = Number(process.hrtime.bigint() - start) / 1e9;
    const labels = {
      method: req.method,
      path: req.route?.path || req.path,
      status: res.statusCode.toString(),
    };

    httpRequestsTotal.inc(labels);
    httpRequestDuration.observe(labels, duration);
    activeConnections.dec();
  });

  next();
}

// 메트릭 엔드포인트
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

4.2 PromQL 핵심 쿼리

# 초당 요청 수 (RPS)
rate(http_requests_total[5m])

# 서비스별 에러율
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# p99 응답 시간
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# p95 응답 시간 (경로별)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path)
)

# 메모리 사용량 증가율
deriv(process_resident_memory_bytes[1h])

# 가동 시간
time() - process_start_time_seconds

# Apdex 스코어 (목표 응답시간 0.5초)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))

4.3 Recording Rules

# prometheus-rules.yml
groups:
  - name: request_rates
    interval: 30s
    rules:
      # 사전 계산된 요청률
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      # 사전 계산된 에러율
      - record: service:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

      # 사전 계산된 지연시간 분위수
      - record: service:http_latency:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

5. 시각화 (Grafana)

5.1 RED Method 대시보드

RED Method는 서비스의 핵심 성능 지표를 모니터링하는 방법론입니다:

Rate: 초당 요청 수
Errors: 에러율
Duration: 응답 시간

{
  "panels": [
    {
      "title": "Request Rate (RPS)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "service"
        }
      ]
    },
    {
      "title": "Error Rate (%)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "100 * sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "service"
        }
      ]
    },
    {
      "title": "Response Time (p99)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
          "legendFormat": "service"
        }
      ]
    }
  ]
}

5.2 USE Method (인프라 모니터링)

USE Method는 인프라 리소스를 모니터링하는 방법론입니다:

Utilization: 리소스 사용률
Saturation: 대기열 길이
Errors: 에러 수

# CPU Utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Utilization
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk I/O Saturation
rate(node_disk_io_time_weighted_seconds_total[5m])

# Network Errors
rate(node_network_receive_errs_total[5m])
+ rate(node_network_transmit_errs_total[5m])

6. 분산 트레이싱 (Distributed Tracing)

6.1 핵심 개념

Trace: 하나의 요청이 시스템을 통과하는 전체 경로
Span: Trace 내의 개별 작업 단위
Context Propagation: 서비스 간 트레이스 컨텍스트 전달
Trace ID: 전체 요청을 식별하는 고유 ID
Span ID: 개별 작업을 식별하는 고유 ID
Parent Span ID: 상위 작업의 Span ID

6.2 Context Propagation

// W3C Trace Context 전파
// 요청 헤더:
// traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
// tracestate: congo=t61rcWkgMzE

// Express 미들웨어에서 Context 추출/주입
import { propagation, context, trace } from '@opentelemetry/api';

// HTTP 클라이언트에서 Context 주입
async function makeRequest(url, data) {
  const headers = {};

  // 현재 컨텍스트에서 trace 정보를 헤더에 주입
  propagation.inject(context.active(), headers);

  return fetch(url, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...headers, // traceparent, tracestate 포함
    },
    body: JSON.stringify(data),
  });
}

// gRPC에서 Context 전파
import { GrpcInstrumentation } from '@opentelemetry/instrumentation-grpc';

// 자동 계측으로 gRPC 메타데이터에 Context 주입/추출

6.3 Jaeger 설정

# docker-compose.yml - Jaeger
services:
  jaeger:
    image: jaegertracing/all-in-one:1.53
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200

6.4 샘플링 전략

// 샘플링 설정
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

// 비율 기반 샘플링 (10%만 수집)
const sampler = new TraceIdRatioBasedSampler(0.1);

// 부모 기반 샘플링 (부모가 샘플링되었으면 자식도 샘플링)
const parentBasedSampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1),
});

샘플링 전략 비교:

전략	설명	장점	단점
Head Sampling	트레이스 시작 시 결정	단순, 낮은 오버헤드	에러 트레이스 누락 가능
Tail Sampling	트레이스 완료 후 결정	에러/느린 트레이스 보존	높은 메모리 사용
Rate Limiting	초당 수집량 제한	예측 가능한 비용	트래픽 급증 시 누락
Probabilistic	확률 기반 수집	균등한 표본	희귀 이벤트 누락

7. 알림 전략 (Alerting Strategy)

7.1 알림 피로 (Alert Fatigue) 방지

알림 피로는 너무 많은 알림으로 인해 중요한 알림을 무시하게 되는 현상입니다.

원칙:

실행 가능한(actionable) 알림만 보내기
심각도(severity) 레벨 구분
적절한 라우팅 (누가, 언제 받을 것인가)
알림 그룹화 (같은 문제의 알림 묶기)
자동 해소 (문제 해결 시 자동으로 알림 종료)

7.2 심각도 레벨

# Prometheus 알림 규칙
groups:
  - name: service_alerts
    rules:
      # Critical: 즉시 대응 필요
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on service"
          description: "Error rate is above 5% for 5 minutes"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      # Warning: 주의 필요
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on service"
          description: "p99 latency is above 2 seconds for 10 minutes"

      # Info: 정보 제공
      - alert: PodRestart
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        labels:
          severity: info
        annotations:
          summary: "Pod restarting frequently"

7.3 Alertmanager 설정

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-slack'

  routes:
    # Critical -> PagerDuty + Slack
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
      continue: true

    - match:
        severity: critical
      receiver: 'slack-critical'

    # Warning -> Slack
    - match:
        severity: warning
      receiver: 'slack-warning'
      repeat_interval: 4h

    # Info -> Slack (업무 시간만)
    - match:
        severity: info
      receiver: 'slack-info'
      active_time_intervals:
        - business-hours

receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#alerts-general'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        severity: critical

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        color: 'danger'

  - name: 'slack-warning'
    slack_configs:
      - channel: '#alerts-warning'
        color: 'warning'

  - name: 'slack-info'
    slack_configs:
      - channel: '#alerts-info'

time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

inhibit_rules:
  # Critical이 발생하면 같은 서비스의 Warning 억제
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['service']

7.4 Runbook 작성

# Runbook: High Error Rate

## 알림 조건
- 에러율(5xx)이 5% 이상인 상태가 5분 이상 지속

## 즉시 확인 사항
1. 영향 범위 확인 (어떤 엔드포인트인지)
2. 최근 배포 이력 확인
3. 의존 서비스 상태 확인

## 대응 절차
1. Grafana 대시보드에서 에러 패턴 확인
2. 로그에서 에러 상세 내용 확인
3. 트레이스에서 실패 지점 식별
4. 최근 배포가 원인이면 롤백
5. 의존 서비스 문제이면 서킷 브레이커 확인

## 에스컬레이션
- 15분 내 해결 불가: 팀 리드에 보고
- 30분 내 해결 불가: 시니어 엔지니어 호출

8. SLO / SLI / SLA

8.1 용어 정의

SLI (Service Level Indicator): 측정 가능한 서비스 품질 지표
- 예: 성공률, 응답 시간, 가용성
SLO (Service Level Objective): SLI에 대한 목표값
- 예: 가용성 99.9%, p99 응답시간 200ms 이하
SLA (Service Level Agreement): 고객과의 계약
- SLO 위반 시 보상 조건 포함

8.2 Error Budget (에러 예산)

SLO: 99.9% 가용성
= 한 달(30일)에 허용되는 다운타임: 43.2분
= Error Budget: 0.1%

Error Budget 소진율:
- 전체의 50% 소진: 주의
- 전체의 75% 소진: 기능 릴리스 동결
- 전체의 100% 소진: 안정성 작업에만 집중

8.3 Burn Rate Alert

# Burn Rate 기반 알림
groups:
  - name: slo_alerts
    rules:
      # 빠른 소진 (1시간 내 2% 소진) - 즉시 대응
      - alert: SLOBurnRateCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "SLO burn rate critical - error budget exhausting fast"

      # 느린 소진 (6시간 내 5% 소진) - 주의
      - alert: SLOBurnRateWarning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "SLO burn rate elevated"

9. APM 도구 비교

9.1 주요 APM 솔루션

기능	Datadog	New Relic	Dynatrace	오픈소스 스택
가격 모델	호스트/사용량 기반	사용량 기반	호스트 기반	인프라 비용만
자동 계측	우수	우수	최고	보통
APM	포함	포함	포함	Jaeger/Tempo
로그 관리	포함	포함	포함	ELK/Loki
메트릭	포함	포함	포함	Prometheus
AI 기반 분석	Watchdog	Applied Intelligence	Davis AI	없음
설정 복잡도	낮음	낮음	중간	높음
벤더 종속성	높음	높음	높음	없음

9.2 오픈소스 스택 구성

+-----------------+     +------------------+
| Application     |---->| OTel Collector   |
| (OTel SDK)      |     | (수집/처리)      |
+-----------------+     +--------+---------+
                                 |
                    +------------+------------+
                    |            |            |
              +-----v-----+ +---v---+ +-----v-----+
              | Prometheus | | Tempo | | Loki      |
              | (Metrics)  | |(Trace)| | (Logs)    |
              +-----+------+ +---+---+ +-----+-----+
                    |            |            |
                    +------+-----+------+-----+
                           |            |
                      +----v----+ +----v----+
                      | Grafana | | Grafana |
                      | (시각화)| | (통합)  |
                      +---------+ +---------+

10. 비용 최적화

10.1 데이터 볼륨 관리

# 로그 보존 정책
retention_policy:
  hot_tier: 7d        # 최근 7일: 빠른 검색
  warm_tier: 30d       # 최근 30일: 느린 검색
  cold_tier: 90d       # 최근 90일: 아카이브
  delete_after: 365d   # 1년 후 삭제

# 인덱스 수명 관리 (Elasticsearch ILM)
index_lifecycle:
  phases:
    hot:
      actions:
        rollover:
          max_size: 50gb
          max_age: 1d
    warm:
      min_age: 7d
      actions:
        shrink:
          number_of_shards: 1
        forcemerge:
          max_num_segments: 1
    cold:
      min_age: 30d
      actions:
        searchable_snapshot:
          snapshot_repository: s3_repo
    delete:
      min_age: 365d

10.2 샘플링으로 비용 절감

// 적응형 샘플링
class AdaptiveSampler {
  constructor(targetRate = 100) {
    this.targetRate = targetRate; // 초당 목표 수집량
    this.currentRate = 0;
    this.samplingProbability = 1.0;

    // 10초마다 샘플링 확률 조정
    setInterval(() => this.adjust(), 10000);
  }

  shouldSample() {
    this.currentRate++;
    return Math.random() < this.samplingProbability;
  }

  adjust() {
    if (this.currentRate > this.targetRate * 10) {
      // 트래픽이 목표의 10배 이상이면 샘플링 확률 줄이기
      this.samplingProbability = this.targetRate / this.currentRate;
    } else {
      this.samplingProbability = Math.min(1.0, this.samplingProbability * 1.1);
    }
    this.currentRate = 0;
  }
}

10.3 메트릭 집계

# Prometheus Recording Rules로 원본 데이터 집계
groups:
  - name: aggregation
    interval: 1m
    rules:
      # 서비스별 집계 (인스턴스 레이블 제거로 카디널리티 감소)
      - record: service:requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, status)

      # 시간 해상도 줄이기 (5분 -> 1시간)
      - record: service:requests:rate1h
        expr: sum(rate(http_requests_total[1h])) by (service)

11. 프로덕션 체크리스트

11.1 배포 전 확인 사항

모든 서비스에 구조화된 로깅 적용
Correlation ID 전파 확인
Prometheus 메트릭 엔드포인트 노출
OTel 자동 계측 활성화
헬스체크 엔드포인트 구현
SLO 정의 및 에러 예산 설정

11.2 운영 확인 사항

Grafana 대시보드 (RED/USE method)
알림 규칙 설정 (Critical/Warning/Info)
Runbook 작성 완료
온콜 로테이션 설정
로그 보존 정책 설정
비용 모니터링

11.3 정기 검토 사항

월별 SLO 리뷰
알림 노이즈 분석
미사용 대시보드/알림 정리
비용 최적화 검토
인시던트 사후 분석 (Post-mortem)

12. 퀴즈

Q1: Observability의 Three Pillars(세 가지 축)는 무엇인가요?

정답: Logs, Metrics, Traces

Logs: 개별 이벤트의 기록. 디버깅과 감사에 사용.
Metrics: 시간에 따른 수치 측정. 시스템 성능 추이 파악.
Traces: 분산 시스템에서 요청의 전체 경로를 추적. 병목 지점 식별.

이 세 가지를 결합하면 "무엇이 고장났는가"뿐만 아니라 "왜 고장났는가"를 이해할 수 있습니다.

Q2: Prometheus에서 p99 응답 시간을 조회하는 PromQL을 작성하세요.

정답:

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

histogram_quantile 함수는 Histogram 메트릭에서 분위수를 계산합니다. 0.99는 99번째 백분위수를 의미하고, rate()로 5분간의 변화율을 구한 후 le(less than or equal) 레이블로 그룹화합니다.

Q3: Head Sampling과 Tail Sampling의 차이점은 무엇인가요?

정답:

Head Sampling은 트레이스가 시작될 때 수집 여부를 결정합니다. 단순하고 오버헤드가 낮지만, 에러가 발생한 트레이스를 놓칠 수 있습니다.

Tail Sampling은 트레이스가 완료된 후 수집 여부를 결정합니다. 에러나 느린 응답의 트레이스를 확실하게 보존할 수 있지만, 모든 Span을 메모리에 보관해야 하므로 리소스 사용량이 높습니다.

프로덕션에서는 보통 Tail Sampling을 OTel Collector에서 수행하여 에러 트레이스를 우선적으로 수집합니다.

Q4: SLO가 99.9%일 때 한 달(30일) 동안 허용되는 다운타임은 얼마인가요?

정답: 약 43.2분

계산: 30일 * 24시간 * 60분 = 43,200분 Error Budget = 43,200 * 0.001 = 43.2분

즉, 한 달에 43.2분까지의 다운타임은 SLO 범위 내입니다. Error Budget이 소진되면 새 기능 릴리스를 중단하고 안정성 개선에 집중해야 합니다.

Q5: RED Method와 USE Method의 차이점을 설명하세요.

정답:

RED Method는 서비스 모니터링에 사용됩니다:

Rate: 초당 요청 수
Errors: 에러율
Duration: 응답 시간

USE Method는 인프라 리소스 모니터링에 사용됩니다:

Utilization: 리소스 사용률 (CPU, 메모리 등)
Saturation: 대기열 길이, 과부하 정도
Errors: 하드웨어/시스템 에러

RED는 사용자 경험 관점에서 서비스를 모니터링하고, USE는 시스템 관점에서 인프라를 모니터링합니다. 둘을 함께 사용하면 문제의 근본 원인을 빠르게 파악할 수 있습니다.

13. 참고 자료

OpenTelemetry Documentation - https://opentelemetry.io/docs/
Prometheus Documentation - https://prometheus.io/docs/
Grafana Documentation - https://grafana.com/docs/
Jaeger Documentation - https://www.jaegertracing.io/docs/
Grafana Loki - https://grafana.com/oss/loki/
Grafana Tempo - https://grafana.com/oss/tempo/
ELK Stack - https://www.elastic.co/elk-stack
Google SRE Book - Monitoring - https://sre.google/sre-book/monitoring-distributed-systems/
Google SRE Book - Service Level Objectives - https://sre.google/sre-book/service-level-objectives/
pino Logger - https://getpino.io/
Alertmanager - https://prometheus.io/docs/alerting/latest/alertmanager/
PagerDuty Incident Response - https://response.pagerduty.com/
The RED Method - https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
The USE Method - https://www.brendangregg.com/usemethod.html

Observability는 단순히 도구를 설치하는 것이 아니라 문화입니다. 모든 팀원이 로그를 구조화하고, 메트릭을 정의하고, 트레이스를 활용하는 습관을 갖추어야 합니다. SLO를 중심으로 알림 전략을 세우고, Error Budget으로 릴리스와 안정성의 균형을 맞추세요. 비용 최적화를 위해 샘플링과 보존 정책을 적극 활용하는 것도 잊지 마세요.

Observability & Monitoring Complete Guide 2025: Logging, Metrics, Tracing, Alerting Strategy

1. Monitoring vs Observability

1.1 Limitations of Monitoring

Traditional monitoring focuses on detecting known unknowns. Alert when CPU usage exceeds 90%, alert when disk usage exceeds 80% -- this threshold-based approach has clear limits.

In modern distributed systems, this alone is insufficient:

Complex interactions between microservices
Frequent transient errors
Increase in unpredictable problems (unknown unknowns)
Performance degradation that cannot be explained by a single metric

1.2 What is Observability

Observability is the ability to understand a system's internal state through its external outputs.

Key difference:

Monitoring: "What is broken?"
Observability: "Why is it broken?"

The Three Pillars of Observability:

Logs - Records of discrete events
Metrics - Numeric measurements over time
Traces - Request flow through distributed systems

2. OpenTelemetry (OTel) Core Concepts

2.1 What is OpenTelemetry

OpenTelemetry is a vendor-neutral telemetry collection framework managed by the CNCF (Cloud Native Computing Foundation). It unifies logs, metrics, and traces under a single standard.

Components:

API: Interfaces for instrumentation
SDK: Implementation of the API
Collector: Collects, processes, and exports telemetry data
Auto-instrumentation: Collects telemetry without code changes

2.2 OTel SDK Setup

// Node.js OpenTelemetry setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: '1.2.0',
    environment: process.env.NODE_ENV || 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4318/v1/metrics',
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instrument HTTP, Express, pg, Redis, etc.
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingRequestHook: (req) => {
          // Exclude health check paths
          return req.url === '/health';
        },
      },
      '@opentelemetry/instrumentation-express': {
        enabled: true,
      },
    }),
  ],
});

sdk.start();

// Flush telemetry on graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});

2.3 Manual Span Creation

import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processOrder(orderData) {
  // Create a manual span
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      // Add attributes to span
      span.setAttribute('order.id', orderData.id);
      span.setAttribute('order.total', orderData.total);
      span.setAttribute('order.items_count', orderData.items.length);

      // Child span: inventory check
      const inventory = await tracer.startActiveSpan(
        'checkInventory',
        async (childSpan) => {
          childSpan.setAttribute('inventory.warehouse', 'us-east-1');
          const result = await inventoryService.check(orderData.items);
          childSpan.end();
          return result;
        }
      );

      if (!inventory.available) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: 'Insufficient inventory',
        });
        throw new Error('Insufficient inventory');
      }

      // Child span: payment processing
      const payment = await tracer.startActiveSpan(
        'processPayment',
        async (childSpan) => {
          childSpan.setAttribute('payment.method', orderData.paymentMethod);
          childSpan.setAttribute('payment.amount', orderData.total);
          const result = await paymentService.charge(orderData);
          childSpan.end();
          return result;
        }
      );

      // Add span event
      span.addEvent('order_completed', {
        'order.id': orderData.id,
        'payment.transaction_id': payment.transactionId,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return { orderId: orderData.id, transactionId: payment.transactionId };
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

2.4 OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  # Tail sampling (prioritize error traces)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, tail_sampling]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

3. Logging

3.1 Structured Logging

// Structured logging with pino
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  // JSON format (default)
  formatters: {
    level(label) {
      return { level: label };
    },
    bindings(bindings) {
      return {
        service: 'order-service',
        version: '1.2.0',
        host: bindings.hostname,
        pid: bindings.pid,
      };
    },
  },
  // Timestamp format
  timestamp: pino.stdTimeFunctions.isoTime,
  // Redact sensitive information
  redact: ['req.headers.authorization', 'req.headers.cookie', '*.password'],
});

// Correlation ID middleware
function correlationMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || generateId();
  req.correlationId = correlationId;
  res.setHeader('x-correlation-id', correlationId);

  // Per-request child logger
  req.log = logger.child({
    correlationId,
    requestId: generateId(),
    method: req.method,
    path: req.path,
    userAgent: req.headers['user-agent'],
  });

  next();
}

// Usage example
app.post('/api/orders', correlationMiddleware, async (req, res) => {
  req.log.info({ body: req.body }, 'Order creation started');

  try {
    const order = await createOrder(req.body);
    req.log.info(
      { orderId: order.id, duration: Date.now() - req.startTime },
      'Order created successfully'
    );
    res.json(order);
  } catch (error) {
    req.log.error(
      { err: error, body: req.body },
      'Order creation failed'
    );
    res.status(500).json({ error: 'Internal server error' });
  }
});

3.2 Log Level Strategy

Level	Purpose	Example
FATAL	System shutdown	Complete database connection failure
ERROR	Error occurred	Payment processing failed
WARN	Potential issues	Retry succeeded, cache miss
INFO	Key business events	Order created, user login
DEBUG	Debugging information	SQL queries, API request/response
TRACE	Detailed tracing	Function entry/exit, variable values

// Log levels by environment
const logLevels = {
  production: 'info',
  staging: 'debug',
  development: 'trace',
};

3.3 ELK Stack (Elasticsearch + Logstash + Kibana)

# docker-compose.yml - ELK Stack
version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.12.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch

volumes:
  es-data:

# logstash.conf
input {
  tcp {
    port => 5044
    codec => json
  }
}

filter {
  # Timestamp parsing
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
  }

  # Parse error stack traces
  if [level] == "error" {
    grok {
      match => {
        "stack" => "%{GREEDYDATA:error_class}: %{GREEDYDATA:error_message}"
      }
    }
  }

  # Add geolocation data
  if [client_ip] {
    geoip {
      source => "client_ip"
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

3.4 Grafana Loki (Lightweight Log Aggregation)

# Loki configuration
auth_enabled: false

server:
  http_listen_port: 3100

common:
  ring:
    kvstore:
      store: inmemory
  replication_factor: 1

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 30d

// Send logs directly to Loki (winston-loki)
import winston from 'winston';
import LokiTransport from 'winston-loki';

const logger = winston.createLogger({
  transports: [
    new LokiTransport({
      host: 'http://loki:3100',
      labels: {
        service: 'order-service',
        environment: 'production',
      },
      json: true,
      batching: true,
      interval: 5,
    }),
  ],
});

4. Metrics

4.1 Prometheus Fundamentals

Prometheus is a pull-based time-series database.

Metric types:

Counter: Monotonically increasing value (request count, error count)
Gauge: Value that can increase or decrease (active connections, memory usage)
Histogram: Distribution of values (response times, payload sizes)
Summary: Client-side percentile computation

// Node.js Prometheus client
import { Registry, Counter, Gauge, Histogram, Summary } from 'prom-client';

const register = new Registry();

// Collect default metrics (CPU, memory, event loop, etc.)
import { collectDefaultMetrics } from 'prom-client';
collectDefaultMetrics({ register });

// Counter: HTTP request count
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

// Gauge: Current active connections
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
  registers: [register],
});

// Histogram: Request response time
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

// Summary: DB query time
const dbQueryDuration = new Summary({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['operation', 'table'],
  percentiles: [0.5, 0.9, 0.95, 0.99],
  registers: [register],
});

// Express middleware
function metricsMiddleware(req, res, next) {
  const start = process.hrtime.bigint();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = Number(process.hrtime.bigint() - start) / 1e9;
    const labels = {
      method: req.method,
      path: req.route?.path || req.path,
      status: res.statusCode.toString(),
    };

    httpRequestsTotal.inc(labels);
    httpRequestDuration.observe(labels, duration);
    activeConnections.dec();
  });

  next();
}

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

4.2 Essential PromQL Queries

# Requests per second (RPS)
rate(http_requests_total[5m])

# Error rate by service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# p99 response time
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# p95 response time (by path)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path)
)

# Memory usage growth rate
deriv(process_resident_memory_bytes[1h])

# Uptime
time() - process_start_time_seconds

# Apdex score (target response time 0.5s)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))

4.3 Recording Rules

# prometheus-rules.yml
groups:
  - name: request_rates
    interval: 30s
    rules:
      # Pre-computed request rate
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      # Pre-computed error rate
      - record: service:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

      # Pre-computed latency percentiles
      - record: service:http_latency:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

5. Visualization (Grafana)

5.1 RED Method Dashboard

The RED Method is a methodology for monitoring core service performance indicators:

Rate: Requests per second
Errors: Error rate
Duration: Response time

{
  "panels": [
    {
      "title": "Request Rate (RPS)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "service"
        }
      ]
    },
    {
      "title": "Error Rate (%)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "100 * sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "service"
        }
      ]
    },
    {
      "title": "Response Time (p99)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
          "legendFormat": "service"
        }
      ]
    }
  ]
}

5.2 USE Method (Infrastructure Monitoring)

The USE Method is a methodology for monitoring infrastructure resources:

Utilization: Resource usage percentage
Saturation: Queue length
Errors: Error count

# CPU Utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Utilization
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk I/O Saturation
rate(node_disk_io_time_weighted_seconds_total[5m])

# Network Errors
rate(node_network_receive_errs_total[5m])
+ rate(node_network_transmit_errs_total[5m])

6. Distributed Tracing

6.1 Core Concepts

Trace: The complete path of a request through the system
Span: An individual unit of work within a Trace
Context Propagation: Passing trace context between services
Trace ID: A unique ID identifying the entire request
Span ID: A unique ID identifying an individual operation
Parent Span ID: The Span ID of the parent operation

6.2 Context Propagation

// W3C Trace Context propagation
// Request headers:
// traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
// tracestate: congo=t61rcWkgMzE

// Extract/inject context in Express middleware
import { propagation, context, trace } from '@opentelemetry/api';

// Inject context in HTTP client
async function makeRequest(url, data) {
  const headers = {};

  // Inject trace info from current context into headers
  propagation.inject(context.active(), headers);

  return fetch(url, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...headers, // Includes traceparent, tracestate
    },
    body: JSON.stringify(data),
  });
}

// gRPC context propagation
import { GrpcInstrumentation } from '@opentelemetry/instrumentation-grpc';

// Auto-instrumentation injects/extracts context in gRPC metadata

6.3 Jaeger Setup

# docker-compose.yml - Jaeger
services:
  jaeger:
    image: jaegertracing/all-in-one:1.53
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200

6.4 Sampling Strategies

// Sampling configuration
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

// Ratio-based sampling (collect only 10%)
const sampler = new TraceIdRatioBasedSampler(0.1);

// Parent-based sampling (if parent is sampled, child is also sampled)
const parentBasedSampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1),
});

Sampling strategy comparison:

Strategy	Description	Pros	Cons
Head Sampling	Decision at trace start	Simple, low overhead	May miss error traces
Tail Sampling	Decision after trace completes	Preserves error/slow traces	High memory usage
Rate Limiting	Limit collections per second	Predictable cost	Misses during traffic spikes
Probabilistic	Probability-based collection	Uniform sample	Misses rare events

7. Alerting Strategy

7.1 Preventing Alert Fatigue

Alert fatigue occurs when too many alerts cause important ones to be ignored.

Principles:

Only send actionable alerts
Distinguish severity levels
Proper routing (who receives it and when)
Alert grouping (bundle alerts for the same issue)
Auto-resolve (automatically close alerts when issues are fixed)

7.2 Severity Levels

# Prometheus alerting rules
groups:
  - name: service_alerts
    rules:
      # Critical: Immediate response required
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on service"
          description: "Error rate is above 5% for 5 minutes"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      # Warning: Attention needed
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on service"
          description: "p99 latency is above 2 seconds for 10 minutes"

      # Info: Informational
      - alert: PodRestart
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        labels:
          severity: info
        annotations:
          summary: "Pod restarting frequently"

7.3 Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-slack'

  routes:
    # Critical -> PagerDuty + Slack
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
      continue: true

    - match:
        severity: critical
      receiver: 'slack-critical'

    # Warning -> Slack
    - match:
        severity: warning
      receiver: 'slack-warning'
      repeat_interval: 4h

    # Info -> Slack (business hours only)
    - match:
        severity: info
      receiver: 'slack-info'
      active_time_intervals:
        - business-hours

receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#alerts-general'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        severity: critical

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        color: 'danger'

  - name: 'slack-warning'
    slack_configs:
      - channel: '#alerts-warning'
        color: 'warning'

  - name: 'slack-info'
    slack_configs:
      - channel: '#alerts-info'

time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

inhibit_rules:
  # When Critical fires, suppress Warning for the same service
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['service']

7.4 Writing Runbooks

# Runbook: High Error Rate

## Alert Condition
- Error rate (5xx) above 5% persisting for more than 5 minutes

## Immediate Checks
1. Identify affected endpoints
2. Check recent deployment history
3. Verify dependent service health

## Response Procedure
1. Check error patterns in Grafana dashboard
2. Review error details in logs
3. Identify failure point using traces
4. If caused by recent deployment, rollback
5. If dependent service issue, check circuit breaker

## Escalation
- 15 min unresolved: Report to team lead
- 30 min unresolved: Page senior engineer

8. SLO / SLI / SLA

8.1 Definitions

SLI (Service Level Indicator): Measurable service quality metric
- e.g., Success rate, response time, availability
SLO (Service Level Objective): Target value for an SLI
- e.g., 99.9% availability, p99 latency under 200ms
SLA (Service Level Agreement): Contract with customers
- Includes compensation terms for SLO violations

8.2 Error Budget

SLO: 99.9% availability
= Allowed downtime per month (30 days): 43.2 minutes
= Error Budget: 0.1%

Error Budget consumption rate:
- 50% consumed: Caution
- 75% consumed: Feature release freeze
- 100% consumed: Focus exclusively on reliability work

8.3 Burn Rate Alerts

# Burn Rate based alerting
groups:
  - name: slo_alerts
    rules:
      # Fast burn (2% consumed in 1 hour) - Immediate response
      - alert: SLOBurnRateCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "SLO burn rate critical - error budget exhausting fast"

      # Slow burn (5% consumed in 6 hours) - Attention
      - alert: SLOBurnRateWarning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "SLO burn rate elevated"

9. APM Tool Comparison

9.1 Major APM Solutions

Feature	Datadog	New Relic	Dynatrace	Open Source Stack
Pricing Model	Host/usage based	Usage based	Host based	Infra costs only
Auto-instrumentation	Excellent	Excellent	Best	Good
APM	Included	Included	Included	Jaeger/Tempo
Log Management	Included	Included	Included	ELK/Loki
Metrics	Included	Included	Included	Prometheus
AI Analysis	Watchdog	Applied Intelligence	Davis AI	None
Setup Complexity	Low	Low	Medium	High
Vendor Lock-in	High	High	High	None

9.2 Open Source Stack Composition

+-----------------+     +------------------+
| Application     |---->| OTel Collector   |
| (OTel SDK)      |     | (collect/process)|
+-----------------+     +--------+---------+
                                 |
                    +------------+------------+
                    |            |            |
              +-----v-----+ +---v---+ +-----v-----+
              | Prometheus | | Tempo | | Loki      |
              | (Metrics)  | |(Trace)| | (Logs)    |
              +-----+------+ +---+---+ +-----+-----+
                    |            |            |
                    +------+-----+------+-----+
                           |            |
                      +----v----+ +----v----+
                      | Grafana | | Grafana |
                      | (visual)| | (unified)|
                      +---------+ +---------+

10. Cost Optimization

10.1 Data Volume Management

# Log retention policy
retention_policy:
  hot_tier: 7d        # Recent 7 days: Fast search
  warm_tier: 30d       # Recent 30 days: Slow search
  cold_tier: 90d       # Recent 90 days: Archive
  delete_after: 365d   # Delete after 1 year

# Index Lifecycle Management (Elasticsearch ILM)
index_lifecycle:
  phases:
    hot:
      actions:
        rollover:
          max_size: 50gb
          max_age: 1d
    warm:
      min_age: 7d
      actions:
        shrink:
          number_of_shards: 1
        forcemerge:
          max_num_segments: 1
    cold:
      min_age: 30d
      actions:
        searchable_snapshot:
          snapshot_repository: s3_repo
    delete:
      min_age: 365d

10.2 Cost Reduction Through Sampling

// Adaptive sampling
class AdaptiveSampler {
  constructor(targetRate = 100) {
    this.targetRate = targetRate; // Target collections per second
    this.currentRate = 0;
    this.samplingProbability = 1.0;

    // Adjust sampling probability every 10 seconds
    setInterval(() => this.adjust(), 10000);
  }

  shouldSample() {
    this.currentRate++;
    return Math.random() < this.samplingProbability;
  }

  adjust() {
    if (this.currentRate > this.targetRate * 10) {
      // If traffic is 10x above target, reduce sampling probability
      this.samplingProbability = this.targetRate / this.currentRate;
    } else {
      this.samplingProbability = Math.min(1.0, this.samplingProbability * 1.1);
    }
    this.currentRate = 0;
  }
}

10.3 Metric Aggregation

# Aggregate raw data with Prometheus Recording Rules
groups:
  - name: aggregation
    interval: 1m
    rules:
      # Service-level aggregation (reduce cardinality by removing instance label)
      - record: service:requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, status)

      # Reduce time resolution (5min -> 1hour)
      - record: service:requests:rate1h
        expr: sum(rate(http_requests_total[1h])) by (service)

11. Production Checklist

11.1 Pre-deployment Verification

Structured logging applied to all services
Correlation ID propagation verified
Prometheus metrics endpoint exposed
OTel auto-instrumentation enabled
Health check endpoint implemented
SLO defined with error budget established

11.2 Operational Verification

Grafana dashboards (RED/USE method)
Alert rules configured (Critical/Warning/Info)
Runbooks completed
On-call rotation established
Log retention policies configured
Cost monitoring in place

11.3 Periodic Review Items

Monthly SLO review
Alert noise analysis
Clean up unused dashboards/alerts
Cost optimization review
Incident post-mortems

12. Quiz

Q1: What are the Three Pillars of Observability?

Answer: Logs, Metrics, Traces

Logs: Records of individual events. Used for debugging and auditing.
Metrics: Numeric measurements over time. For understanding system performance trends.
Traces: Tracking the complete path of requests through distributed systems. Identifying bottlenecks.

Combining these three enables understanding not just "what is broken" but "why it is broken."

Q2: Write the PromQL query to retrieve p99 response time in Prometheus.

Answer:

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

The histogram_quantile function calculates percentiles from Histogram metrics. 0.99 represents the 99th percentile, rate() computes the rate of change over 5 minutes, and results are grouped by the le (less than or equal) label.

Q3: What is the difference between Head Sampling and Tail Sampling?

Answer:

Head Sampling makes the collection decision at the start of a trace. It is simple with low overhead, but may miss traces where errors occur.

Tail Sampling makes the collection decision after the trace completes. It reliably preserves traces with errors or slow responses, but requires holding all spans in memory, resulting in higher resource usage.

In production, tail sampling is typically performed in the OTel Collector to prioritize collecting error traces.

Q4: With a 99.9% SLO, how much downtime is allowed in a 30-day month?

Answer: Approximately 43.2 minutes

Calculation: 30 days x 24 hours x 60 minutes = 43,200 minutes Error Budget = 43,200 x 0.001 = 43.2 minutes

This means up to 43.2 minutes of downtime per month is within SLO bounds. When the error budget is exhausted, new feature releases should be halted to focus on reliability improvements.

Q5: Explain the difference between the RED Method and the USE Method.

Answer:

The RED Method is used for service monitoring:

Rate: Requests per second
Errors: Error rate
Duration: Response time

The USE Method is used for infrastructure resource monitoring:

Utilization: Resource usage percentage (CPU, memory, etc.)
Saturation: Queue length, degree of overload
Errors: Hardware/system errors

RED monitors services from the user experience perspective, while USE monitors infrastructure from the system perspective. Using both together enables rapid identification of root causes.

13. References

OpenTelemetry Documentation - https://opentelemetry.io/docs/
Prometheus Documentation - https://prometheus.io/docs/
Grafana Documentation - https://grafana.com/docs/
Jaeger Documentation - https://www.jaegertracing.io/docs/
Grafana Loki - https://grafana.com/oss/loki/
Grafana Tempo - https://grafana.com/oss/tempo/
ELK Stack - https://www.elastic.co/elk-stack
Google SRE Book - Monitoring - https://sre.google/sre-book/monitoring-distributed-systems/
Google SRE Book - Service Level Objectives - https://sre.google/sre-book/service-level-objectives/
pino Logger - https://getpino.io/
Alertmanager - https://prometheus.io/docs/alerting/latest/alertmanager/
PagerDuty Incident Response - https://response.pagerduty.com/
The RED Method - https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
The USE Method - https://www.brendangregg.com/usemethod.html

Observability is not simply about installing tools -- it is a culture. Every team member must develop the habit of structuring logs, defining metrics, and utilizing traces. Build your alerting strategy around SLOs, and balance releases and reliability through error budgets. Do not forget to actively leverage sampling and retention policies for cost optimization.

Observability & 모니터링 완전 가이드 2025: 로깅, 메트릭, 트레이싱, 알림 전략

TOC

1. 모니터링 vs Observability

1.1 모니터링의 한계

1.2 Observability란

2. OpenTelemetry (OTel) 핵심 이해

2.1 OpenTelemetry란

2.2 OTel SDK 설정

2.3 수동 Span 생성

2.4 OTel Collector 설정

3. 로깅 (Logging)

3.1 구조화된 로깅 (Structured Logging)

3.2 로그 레벨 전략

3.3 ELK Stack (Elasticsearch + Logstash + Kibana)

3.4 Grafana Loki (경량 로그 수집)

4. 메트릭 (Metrics)

4.1 Prometheus 기본

4.2 PromQL 핵심 쿼리

4.3 Recording Rules

5. 시각화 (Grafana)

5.1 RED Method 대시보드

5.2 USE Method (인프라 모니터링)

6. 분산 트레이싱 (Distributed Tracing)

6.1 핵심 개념

6.2 Context Propagation

6.3 Jaeger 설정

6.4 샘플링 전략

7. 알림 전략 (Alerting Strategy)

7.1 알림 피로 (Alert Fatigue) 방지

7.2 심각도 레벨

7.3 Alertmanager 설정

7.4 Runbook 작성

8. SLO / SLI / SLA

8.1 용어 정의

8.2 Error Budget (에러 예산)

8.3 Burn Rate Alert

9. APM 도구 비교

9.1 주요 APM 솔루션

9.2 오픈소스 스택 구성

10. 비용 최적화

10.1 데이터 볼륨 관리

10.2 샘플링으로 비용 절감

10.3 메트릭 집계

11. 프로덕션 체크리스트

11.1 배포 전 확인 사항

11.2 운영 확인 사항

11.3 정기 검토 사항

12. 퀴즈

13. 참고 자료

Observability & Monitoring Complete Guide 2025: Logging, Metrics, Tracing, Alerting Strategy

TOC

1. Monitoring vs Observability

1.1 Limitations of Monitoring

1.2 What is Observability

2. OpenTelemetry (OTel) Core Concepts

2.1 What is OpenTelemetry

2.2 OTel SDK Setup

2.3 Manual Span Creation

2.4 OTel Collector Configuration

3. Logging

3.1 Structured Logging

3.2 Log Level Strategy

3.3 ELK Stack (Elasticsearch + Logstash + Kibana)

3.4 Grafana Loki (Lightweight Log Aggregation)

4. Metrics

4.1 Prometheus Fundamentals

4.2 Essential PromQL Queries

4.3 Recording Rules

5. Visualization (Grafana)

5.1 RED Method Dashboard

5.2 USE Method (Infrastructure Monitoring)

6. Distributed Tracing

6.1 Core Concepts

6.2 Context Propagation

6.3 Jaeger Setup

6.4 Sampling Strategies

7. Alerting Strategy

7.1 Preventing Alert Fatigue

7.2 Severity Levels

7.3 Alertmanager Configuration