Split View: WebSocket 실시간 통신 & 옵저버빌리티(모니터링) 완전 가이드

WebSocket 실시간 통신 & 옵저버빌리티(모니터링) 완전 가이드

Part 1: 실시간 통신
Part 2: 옵저버빌리티

Part 1: 실시간 통신

1. HTTP 폴링 vs Long Polling vs SSE vs WebSocket

실시간 데이터 전송 방식은 요구사항에 따라 적합한 기술이 다릅니다.

비교표

특성	HTTP 폴링	Long Polling	SSE	WebSocket
방향	단방향(클라이언트 요청)	단방향(서버 응답 대기)	단방향(서버 푸시)	양방향(전이중)
연결 유지	매 요청마다 새 연결	응답 후 재연결	단일 HTTP 연결	단일 TCP 연결
오버헤드	높음(반복 요청)	중간	낮음	매우 낮음
지연시간	폴링 간격 의존	이벤트 즉시	이벤트 즉시	이벤트 즉시
프로토콜	HTTP/1.1	HTTP/1.1	HTTP/1.1	ws:// 또는 wss://
브라우저 지원	모두 지원	모두 지원	IE 제외	모두 지원
사용 사례	간단한 상태 확인	알림	주식 시세, 뉴스 피드	채팅, 게임, 협업 편집

사용 사례별 권장 기술

대시보드 데이터 갱신: SSE (서버에서 일방적으로 데이터 푸시)
실시간 채팅: WebSocket (양방향 메시지 교환)
파일 업로드 진행률: SSE (서버가 진행률 푸시)
온라인 게임: WebSocket (초저지연 양방향 통신)
IoT 센서 모니터링: WebSocket 또는 MQTT

2. WebSocket 프로토콜 심화

핸드셰이크 과정

WebSocket 연결은 HTTP Upgrade 요청으로 시작됩니다.

GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

서버 응답:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

프레임 구조

WebSocket 프레임은 다음과 같은 바이너리 구조를 가집니다.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+-------------------------------+

주요 opcode 값:

0x1: 텍스트 프레임
0x2: 바이너리 프레임
0x8: 연결 종료
0x9: Ping
0xA: Pong

Ping/Pong 하트비트

// 서버 측 (Node.js)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
  ws.isAlive = true;

  ws.on('pong', () => {
    ws.isAlive = true;
  });
});

// 30초마다 하트비트 확인
const interval = setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) return ws.terminate();
    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

연결 관리와 에러 처리

// 클라이언트 측
class WebSocketClient {
  constructor(url) {
    this.url = url;
    this.reconnectAttempts = 0;
    this.maxReconnectAttempts = 5;
    this.reconnectDelay = 1000;
    this.connect();
  }

  connect() {
    this.ws = new WebSocket(this.url);

    this.ws.onopen = () => {
      console.log('Connected');
      this.reconnectAttempts = 0;
    };

    this.ws.onclose = (event) => {
      if (event.code !== 1000) {
        this.reconnect();
      }
    };

    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
  }

  reconnect() {
    if (this.reconnectAttempts >= this.maxReconnectAttempts) {
      console.error('Max reconnect attempts reached');
      return;
    }
    this.reconnectAttempts++;
    const delay = this.reconnectDelay * Math.pow(2, this.reconnectAttempts);
    setTimeout(() => this.connect(), delay);
  }
}

3. Socket.IO 활용

Socket.IO는 WebSocket 위에 추상화 레이어를 제공하여 실시간 통신 개발을 단순화합니다.

네임스페이스와 룸

const io = require('socket.io')(server);

// 네임스페이스 분리
const chatNamespace = io.of('/chat');
const notificationNamespace = io.of('/notifications');

chatNamespace.on('connection', (socket) => {
  // 룸 참여
  socket.on('joinRoom', (roomId) => {
    socket.join(roomId);
    chatNamespace.to(roomId).emit('userJoined', {
      userId: socket.id,
      room: roomId
    });
  });

  // 룸 내 메시지 전송
  socket.on('message', (data) => {
    chatNamespace.to(data.roomId).emit('newMessage', {
      userId: socket.id,
      content: data.content,
      timestamp: Date.now()
    });
  });

  // 룸 퇴장
  socket.on('leaveRoom', (roomId) => {
    socket.leave(roomId);
  });
});

자동 재연결과 폴백

// 클라이언트 설정
const socket = io('https://api.example.com', {
  // 전송 계층 폴백 (WebSocket 실패 시 HTTP Long Polling)
  transports: ['websocket', 'polling'],

  // 재연결 설정
  reconnection: true,
  reconnectionAttempts: 10,
  reconnectionDelay: 1000,
  reconnectionDelayMax: 5000,
  randomizationFactor: 0.5,

  // 타임아웃
  timeout: 20000,

  // 인증
  auth: {
    token: 'my-jwt-token'
  }
});

socket.on('connect_error', (error) => {
  if (error.message === 'unauthorized') {
    // 토큰 갱신 로직
    refreshToken().then((newToken) => {
      socket.auth.token = newToken;
      socket.connect();
    });
  }
});

4. gRPC 스트리밍

gRPC는 HTTP/2 기반으로 4가지 통신 모드를 지원합니다.

4가지 스트리밍 모드

syntax = "proto3";

service ChatService {
  // 1. Unary: 요청 1개 - 응답 1개
  rpc SendMessage (ChatMessage) returns (MessageResponse);

  // 2. Server Streaming: 요청 1개 - 응답 스트림
  rpc SubscribeMessages (SubscriptionRequest) returns (stream ChatMessage);

  // 3. Client Streaming: 요청 스트림 - 응답 1개
  rpc UploadLogs (stream LogEntry) returns (UploadSummary);

  // 4. Bidirectional Streaming: 요청 스트림 - 응답 스트림
  rpc Chat (stream ChatMessage) returns (stream ChatMessage);
}

message ChatMessage {
  string user_id = 1;
  string content = 2;
  int64 timestamp = 3;
}

Bidirectional Streaming 구현 (Go)

func (s *chatServer) Chat(stream pb.ChatService_ChatServer) error {
    for {
        msg, err := stream.Recv()
        if err == io.EOF {
            return nil
        }
        if err != nil {
            return err
        }

        // 메시지 처리 및 브로드캐스트
        response := &pb.ChatMessage{
            UserId:    msg.UserId,
            Content:   msg.Content,
            Timestamp: time.Now().UnixMilli(),
        }

        if err := stream.Send(response); err != nil {
            return err
        }
    }
}

5. 실전: 실시간 채팅 아키텍처

Redis Pub/Sub + WebSocket

여러 WebSocket 서버 인스턴스 간 메시지를 동기화하기 위해 Redis Pub/Sub를 사용합니다.

[클라이언트 A] <--ws--> [서버 1] <--pub/sub--> [Redis]
[클라이언트 B] <--ws--> [서버 2] <--pub/sub--> [Redis]
[클라이언트 C] <--ws--> [서버 1] <--pub/sub--> [Redis]

Node.js 구현

const Redis = require('ioredis');
const WebSocket = require('ws');

const redisPub = new Redis();
const redisSub = new Redis();
const wss = new WebSocket.Server({ port: 8080 });

// 채널별 클라이언트 관리
const channels = new Map();

wss.on('connection', (ws) => {
  ws.on('message', (raw) => {
    const data = JSON.parse(raw);

    switch (data.type) {
      case 'subscribe':
        // 채널 구독
        if (!channels.has(data.channel)) {
          channels.set(data.channel, new Set());
          redisSub.subscribe(data.channel);
        }
        channels.get(data.channel).add(ws);
        break;

      case 'message':
        // Redis를 통해 모든 서버에 메시지 전파
        redisPub.publish(data.channel, JSON.stringify({
          user: data.user,
          content: data.content,
          timestamp: Date.now()
        }));
        break;
    }
  });

  ws.on('close', () => {
    // 연결 종료 시 모든 채널에서 제거
    channels.forEach((clients, channel) => {
      clients.delete(ws);
      if (clients.size === 0) {
        channels.delete(channel);
        redisSub.unsubscribe(channel);
      }
    });
  });
});

// Redis 메시지 수신 시 해당 채널의 모든 클라이언트에게 전달
redisSub.on('message', (channel, message) => {
  const clients = channels.get(channel);
  if (clients) {
    clients.forEach((ws) => {
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(message);
      }
    });
  }
});

확장성을 위한 고려사항

수평 확장: Redis Pub/Sub 또는 Redis Streams로 서버 간 메시지 동기화
연결 제한: 서버당 최대 연결 수 설정 (OS의 파일 디스크립터 제한 고려)
로드 밸런싱: Sticky Session 또는 IP Hash 기반 로드 밸런싱
메시지 영속성: Redis Streams 또는 Kafka로 메시지 이력 보관
인증: JWT 기반 WebSocket 핸드셰이크 인증

Part 2: 옵저버빌리티

6. 옵저버빌리티 3대 축

옵저버빌리티(Observability)는 시스템의 외부 출력만으로 내부 상태를 파악할 수 있는 정도를 의미합니다.

메트릭(Metrics)

숫자로 표현되는 시계열 데이터입니다.

- 요청 수 (requests_total)
- 응답 시간 (response_duration_seconds)
- 에러율 (error_rate_percent)
- CPU/메모리 사용률 (cpu_usage_percent, memory_usage_bytes)

로그(Logs)

특정 시점에 발생한 이벤트의 텍스트 기록입니다.

{
  "timestamp": "2026-04-12T16:00:00Z",
  "level": "ERROR",
  "service": "chat-server",
  "trace_id": "abc123def456",
  "message": "WebSocket connection failed",
  "details": {
    "user_id": "user-789",
    "error_code": 1006,
    "reason": "abnormal closure"
  }
}

트레이스(Traces)

분산 시스템에서 요청이 여러 서비스를 거치는 전체 경로를 추적합니다.

[API Gateway] ---> [Auth Service] ---> [Chat Service] ---> [Redis]
   Span 1            Span 2              Span 3            Span 4
   |____________ Trace ID: abc123 __________________________|

3대 축의 상호 관계

질문	해결 수단
무엇이 잘못되었나?	메트릭 (에러율 급증 감지)
왜 잘못되었나?	트레이스 (병목 지점 식별)
정확히 무슨 일이 일어났나?	로그 (상세 에러 메시지 확인)

7. OpenTelemetry

OpenTelemetry(OTel)는 CNCF 산하 프로젝트로, 메트릭/로그/트레이스를 벤더 중립적으로 수집합니다.

아키텍처

[애플리케이션 + OTel SDK]
        |
        | OTLP (gRPC/HTTP)
        v
[OTel Collector]
   |       |       |
   v       v       v
[Jaeger] [Prom.] [Loki]

Auto-Instrumentation (Node.js)

// tracing.js - 애플리케이션 시작 전 로드
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');

const sdk = new NodeSDK({
  serviceName: 'chat-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

OTel Collector 설정

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

8. Prometheus + Grafana

Prometheus 메트릭 유형

1. Counter   - 단조 증가값 (예: 총 요청 수)
2. Gauge     - 증감 가능값 (예: 현재 활성 연결 수)
3. Histogram - 값 분포 (예: 응답 시간 분포)
4. Summary   - 백분위 수 (예: p99 응답 시간)

WebSocket 커스텀 메트릭 예시 (Node.js)

const promClient = require('prom-client');

// Counter: 총 연결 수
const wsConnectionsTotal = new promClient.Counter({
  name: 'websocket_connections_total',
  help: 'Total WebSocket connections',
  labelNames: ['status'],
});

// Gauge: 현재 활성 연결
const wsActiveConnections = new promClient.Gauge({
  name: 'websocket_active_connections',
  help: 'Current active WebSocket connections',
  labelNames: ['namespace'],
});

// Histogram: 메시지 처리 시간
const wsMessageDuration = new promClient.Histogram({
  name: 'websocket_message_duration_seconds',
  help: 'WebSocket message processing duration',
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});

// WebSocket 이벤트에 메트릭 연동
wss.on('connection', (ws) => {
  wsConnectionsTotal.inc({ status: 'connected' });
  wsActiveConnections.inc({ namespace: 'default' });

  ws.on('message', (data) => {
    const end = wsMessageDuration.startTimer();
    processMessage(data);
    end();
  });

  ws.on('close', () => {
    wsActiveConnections.dec({ namespace: 'default' });
  });
});

주요 PromQL 쿼리

# 초당 WebSocket 연결 요청률 (5분 이동 평균)
rate(websocket_connections_total[5m])

# 현재 활성 WebSocket 연결 수
websocket_active_connections

# 메시지 처리 시간 p99
histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))

# 에러율 (퍼센트)
rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100

# CPU 사용률 80% 초과 인스턴스
node_cpu_usage_percent > 80

Grafana 대시보드 설계

{
  "dashboard": {
    "title": "WebSocket Realtime Monitoring",
    "panels": [
      {
        "title": "Active Connections",
        "type": "stat",
        "targets": [
          { "expr": "sum(websocket_active_connections)" }
        ]
      },
      {
        "title": "Connection Rate",
        "type": "timeseries",
        "targets": [
          { "expr": "rate(websocket_connections_total[5m])" }
        ]
      },
      {
        "title": "Message Latency (p50/p95/p99)",
        "type": "timeseries",
        "targets": [
          { "expr": "histogram_quantile(0.50, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p50" },
          { "expr": "histogram_quantile(0.95, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p95" },
          { "expr": "histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p99" }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "targets": [
          { "expr": "rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100" }
        ]
      }
    ]
  }
}

9. 분산 트레이싱

스팬(Span)과 트레이스(Trace) 구조

Trace ID: 7f3a8b2c1d4e5f6a

Span A: API Gateway        [============                    ] 0-200ms
  Span B: Auth Service        [====                         ] 10-50ms
  Span C: Chat Service            [==================       ] 60-180ms
    Span D: Redis Pub/Sub              [====                ] 80-120ms
    Span E: Database Write                  [=======        ] 130-170ms

Jaeger 설정 (Docker Compose)

version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:1.54
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200

수동 스팬 생성 (Python)

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("chat-service")

def handle_websocket_message(message):
    with tracer.start_as_current_span("process_message") as span:
        span.set_attribute("message.type", message.get("type"))
        span.set_attribute("message.size", len(str(message)))

        try:
            # 메시지 처리 로직
            result = process(message)
            span.set_attribute("processing.result", "success")
            return result
        except Exception as e:
            span.set_status(StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

10. 로그 집계

ELK 스택 (Elasticsearch + Logstash + Kibana)

# logstash.conf
input {
  tcp {
    port => 5000
    codec => json
  }
}

filter {
  if [level] == "ERROR" {
    mutate {
      add_tag => ["alert"]
    }
  }
  date {
    match => ["timestamp", "ISO8601"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "websocket-logs-%{+YYYY.MM.dd}"
  }
}

Loki + Grafana (경량 대안)

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2026-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

구조화된 로깅 (Node.js + Winston)

const winston = require('winston');

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: { service: 'websocket-server' },
  transports: [
    new winston.transports.Console(),
    new winston.transports.Http({
      host: 'loki-gateway',
      port: 3100,
      path: '/loki/api/v1/push',
    }),
  ],
});

// 사용 예시
logger.info('WebSocket connection established', {
  userId: 'user-123',
  remoteAddress: '192.168.1.100',
  protocol: 'wss',
});

logger.error('Message processing failed', {
  userId: 'user-456',
  errorCode: 'PARSE_ERROR',
  rawMessage: '<sanitized>',
  traceId: 'abc123',
});

11. 알림 설계

알림 피로(Alert Fatigue) 방지

알림이 너무 많으면 중요한 알림도 무시하게 됩니다. 효과적인 알림 설계 원칙은 다음과 같습니다.

원칙	설명
실행 가능성	알림을 받으면 즉시 조치할 수 있어야 함
증상 기반	원인이 아닌 사용자 영향(증상)에 기반한 알림
계층화	Critical / Warning / Info 3단계로 분리
자동 해결 감지	문제 해결 시 자동으로 알림 종료
그룹핑	관련 알림을 묶어서 한 번만 전송

Prometheus Alerting Rules

# alerting-rules.yaml
groups:
  - name: websocket-alerts
    rules:
      - alert: HighConnectionDropRate
        expr: rate(websocket_disconnections_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WebSocket 연결 끊김 비율 높음"
          description: "5분간 초당 10건 이상의 연결 끊김 발생"

      - alert: WebSocketLatencyHigh
        expr: histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m])) > 1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "WebSocket 메시지 지연 시간 높음"
          description: "p99 지연 시간이 1초를 초과"

      - alert: ActiveConnectionsSpiking
        expr: websocket_active_connections > 10000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "활성 WebSocket 연결 수 급증"
          description: "활성 연결 수가 10,000을 초과"

Alertmanager 라우팅 설정

# alertmanager.yaml
global:
  resolve_timeout: 5m

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default-receiver'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXXX'
        channel: '#alerts-general'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXXX'
        channel: '#alerts-warning'
        title: 'Warning Alert'

온콜 연동: PagerDuty / OpsGenie

온콜 로테이션과 에스컬레이션 정책을 설정하면, 장애 발생 시 담당자에게 자동으로 알림이 전달됩니다.

[Prometheus Alert 발생]
        |
        v
[Alertmanager 수신 및 그룹핑]
        |
        v
[PagerDuty / OpsGenie 인시던트 생성]
        |
        v
[온콜 담당자에게 전화/SMS/Slack 알림]
        |
        v
[ACK(확인) 안 되면 -> 에스컬레이션 (매니저에게 전달)]

마무리 체크리스트

실시간 통신과 옵저버빌리티를 운영 환경에 적용할 때 확인할 항목입니다.

실시간 통신

WebSocket 핸드셰이크 인증 구현 여부
Ping/Pong 하트비트 주기 설정
자동 재연결과 지수 백오프 구현
Redis Pub/Sub로 수평 확장 대비
메시지 큐(Kafka/Redis Streams)로 메시지 영속성 확보

옵저버빌리티

메트릭/로그/트레이스 3대 축 모두 수집
OpenTelemetry Collector 배포 및 파이프라인 설정
Grafana 대시보드에 핵심 지표 시각화
증상 기반 알림 규칙 설정
온콜 로테이션과 에스컬레이션 정책 수립
알림 피로 방지를 위한 임계값 지속 튜닝

퀴즈: 실시간 통신 & 옵저버빌리티 핵심 점검

Q1. WebSocket과 SSE의 가장 큰 차이점은 무엇인가요?

A1. WebSocket은 양방향(전이중) 통신을 지원하고, SSE는 서버에서 클라이언트로의 단방향 푸시만 지원합니다. WebSocket은 채팅이나 게임처럼 양쪽 모두 데이터를 보내야 하는 경우에 적합하고, SSE는 주식 시세나 알림처럼 서버가 일방적으로 데이터를 보내는 경우에 적합합니다.

Q2. 옵저버빌리티의 3대 축은 무엇이며, 각각의 역할은?

A2. 메트릭(Metrics)은 시계열 숫자 데이터로 문제 감지, 트레이스(Traces)는 분산 시스템의 요청 경로 추적으로 병목 식별, 로그(Logs)는 이벤트 상세 기록으로 근본 원인 분석에 사용됩니다.

Q3. 알림 피로를 방지하기 위한 핵심 원칙은?

A3. 실행 가능한 알림만 설정하고, 원인이 아닌 증상 기반으로 알림을 구성하며, Critical/Warning/Info 계층화, 관련 알림 그룹핑, 자동 해결 감지를 적용합니다.

WebSocket Real-Time Communication & Observability/Monitoring Complete Guide

Part 1: Real-Time Communication
Part 2: Observability

Part 1: Real-Time Communication

1. HTTP Polling vs Long Polling vs SSE vs WebSocket

Different real-time data delivery methods are suited to different requirements.

Comparison Table

Feature	HTTP Polling	Long Polling	SSE	WebSocket
Direction	Unidirectional (client request)	Unidirectional (server response wait)	Unidirectional (server push)	Bidirectional (full-duplex)
Connection	New connection per request	Reconnect after response	Single HTTP connection	Single TCP connection
Overhead	High (repeated requests)	Medium	Low	Very low
Latency	Depends on polling interval	Immediate on event	Immediate on event	Immediate on event
Protocol	HTTP/1.1	HTTP/1.1	HTTP/1.1	ws:// or wss://
Browser Support	All	All	Except IE	All
Use Cases	Simple status checks	Notifications	Stock tickers, news feeds	Chat, gaming, collaborative editing

Recommended Technology by Use Case

Dashboard data refresh: SSE (server unilaterally pushes data)
Real-time chat: WebSocket (bidirectional message exchange)
File upload progress: SSE (server pushes progress updates)
Online gaming: WebSocket (ultra-low-latency bidirectional communication)
IoT sensor monitoring: WebSocket or MQTT

2. WebSocket Protocol Deep Dive

Handshake Process

A WebSocket connection begins with an HTTP Upgrade request.

GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

Server response:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

Frame Structure

WebSocket frames have the following binary structure:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+-------------------------------+

Key opcode values:

0x1: Text frame
0x2: Binary frame
0x8: Connection close
0x9: Ping
0xA: Pong

Ping/Pong Heartbeat

// Server-side (Node.js)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
  ws.isAlive = true;

  ws.on('pong', () => {
    ws.isAlive = true;
  });
});

// Check heartbeat every 30 seconds
const interval = setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) return ws.terminate();
    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

Connection Management and Error Handling

// Client-side
class WebSocketClient {
  constructor(url) {
    this.url = url;
    this.reconnectAttempts = 0;
    this.maxReconnectAttempts = 5;
    this.reconnectDelay = 1000;
    this.connect();
  }

  connect() {
    this.ws = new WebSocket(this.url);

    this.ws.onopen = () => {
      console.log('Connected');
      this.reconnectAttempts = 0;
    };

    this.ws.onclose = (event) => {
      if (event.code !== 1000) {
        this.reconnect();
      }
    };

    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
  }

  reconnect() {
    if (this.reconnectAttempts >= this.maxReconnectAttempts) {
      console.error('Max reconnect attempts reached');
      return;
    }
    this.reconnectAttempts++;
    const delay = this.reconnectDelay * Math.pow(2, this.reconnectAttempts);
    setTimeout(() => this.connect(), delay);
  }
}

3. Working with Socket.IO

Socket.IO provides an abstraction layer on top of WebSocket to simplify real-time communication development.

Namespaces and Rooms

const io = require('socket.io')(server);

// Separate namespaces
const chatNamespace = io.of('/chat');
const notificationNamespace = io.of('/notifications');

chatNamespace.on('connection', (socket) => {
  // Join room
  socket.on('joinRoom', (roomId) => {
    socket.join(roomId);
    chatNamespace.to(roomId).emit('userJoined', {
      userId: socket.id,
      room: roomId
    });
  });

  // Send message within room
  socket.on('message', (data) => {
    chatNamespace.to(data.roomId).emit('newMessage', {
      userId: socket.id,
      content: data.content,
      timestamp: Date.now()
    });
  });

  // Leave room
  socket.on('leaveRoom', (roomId) => {
    socket.leave(roomId);
  });
});

Auto-Reconnection and Fallback

// Client configuration
const socket = io('https://api.example.com', {
  // Transport layer fallback (HTTP Long Polling if WebSocket fails)
  transports: ['websocket', 'polling'],

  // Reconnection settings
  reconnection: true,
  reconnectionAttempts: 10,
  reconnectionDelay: 1000,
  reconnectionDelayMax: 5000,
  randomizationFactor: 0.5,

  // Timeout
  timeout: 20000,

  // Authentication
  auth: {
    token: 'my-jwt-token'
  }
});

socket.on('connect_error', (error) => {
  if (error.message === 'unauthorized') {
    // Token refresh logic
    refreshToken().then((newToken) => {
      socket.auth.token = newToken;
      socket.connect();
    });
  }
});

4. gRPC Streaming

gRPC supports four communication modes based on HTTP/2.

Four Streaming Modes

syntax = "proto3";

service ChatService {
  // 1. Unary: one request - one response
  rpc SendMessage (ChatMessage) returns (MessageResponse);

  // 2. Server Streaming: one request - response stream
  rpc SubscribeMessages (SubscriptionRequest) returns (stream ChatMessage);

  // 3. Client Streaming: request stream - one response
  rpc UploadLogs (stream LogEntry) returns (UploadSummary);

  // 4. Bidirectional Streaming: request stream - response stream
  rpc Chat (stream ChatMessage) returns (stream ChatMessage);
}

message ChatMessage {
  string user_id = 1;
  string content = 2;
  int64 timestamp = 3;
}

Bidirectional Streaming Implementation (Go)

func (s *chatServer) Chat(stream pb.ChatService_ChatServer) error {
    for {
        msg, err := stream.Recv()
        if err == io.EOF {
            return nil
        }
        if err != nil {
            return err
        }

        // Process and broadcast message
        response := &pb.ChatMessage{
            UserId:    msg.UserId,
            Content:   msg.Content,
            Timestamp: time.Now().UnixMilli(),
        }

        if err := stream.Send(response); err != nil {
            return err
        }
    }
}

5. Practical Architecture: Real-Time Chat

Redis Pub/Sub + WebSocket

Redis Pub/Sub is used to synchronize messages across multiple WebSocket server instances.

[Client A] <--ws--> [Server 1] <--pub/sub--> [Redis]
[Client B] <--ws--> [Server 2] <--pub/sub--> [Redis]
[Client C] <--ws--> [Server 1] <--pub/sub--> [Redis]

Node.js Implementation

const Redis = require('ioredis');
const WebSocket = require('ws');

const redisPub = new Redis();
const redisSub = new Redis();
const wss = new WebSocket.Server({ port: 8080 });

// Manage clients per channel
const channels = new Map();

wss.on('connection', (ws) => {
  ws.on('message', (raw) => {
    const data = JSON.parse(raw);

    switch (data.type) {
      case 'subscribe':
        // Subscribe to channel
        if (!channels.has(data.channel)) {
          channels.set(data.channel, new Set());
          redisSub.subscribe(data.channel);
        }
        channels.get(data.channel).add(ws);
        break;

      case 'message':
        // Broadcast message to all servers via Redis
        redisPub.publish(data.channel, JSON.stringify({
          user: data.user,
          content: data.content,
          timestamp: Date.now()
        }));
        break;
    }
  });

  ws.on('close', () => {
    // Remove from all channels on disconnect
    channels.forEach((clients, channel) => {
      clients.delete(ws);
      if (clients.size === 0) {
        channels.delete(channel);
        redisSub.unsubscribe(channel);
      }
    });
  });
});

// Deliver Redis messages to all clients in the channel
redisSub.on('message', (channel, message) => {
  const clients = channels.get(channel);
  if (clients) {
    clients.forEach((ws) => {
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(message);
      }
    });
  }
});

Scalability Considerations

Horizontal scaling: Synchronize messages across servers with Redis Pub/Sub or Redis Streams
Connection limits: Set maximum connections per server (consider OS file descriptor limits)
Load balancing: Sticky session or IP hash-based load balancing
Message persistence: Store message history with Redis Streams or Kafka
Authentication: JWT-based WebSocket handshake authentication

Part 2: Observability

6. The Three Pillars of Observability

Observability refers to the degree to which a system's internal state can be understood from its external outputs.

Metrics

Numeric time-series data.

- Request count (requests_total)
- Response time (response_duration_seconds)
- Error rate (error_rate_percent)
- CPU/memory utilization (cpu_usage_percent, memory_usage_bytes)

Logs

Text records of events that occurred at specific points in time.

{
  "timestamp": "2026-04-12T16:00:00Z",
  "level": "ERROR",
  "service": "chat-server",
  "trace_id": "abc123def456",
  "message": "WebSocket connection failed",
  "details": {
    "user_id": "user-789",
    "error_code": 1006,
    "reason": "abnormal closure"
  }
}

Traces

Track the complete path of a request as it traverses multiple services in a distributed system.

[API Gateway] ---> [Auth Service] ---> [Chat Service] ---> [Redis]
   Span 1            Span 2              Span 3            Span 4
   |____________ Trace ID: abc123 __________________________|

How the Three Pillars Relate

Question	Resolution Tool
What went wrong?	Metrics (detect error rate spikes)
Why did it go wrong?	Traces (identify bottlenecks)
What exactly happened?	Logs (examine detailed error messages)

7. OpenTelemetry

OpenTelemetry (OTel) is a CNCF project that collects metrics, logs, and traces in a vendor-neutral manner.

Architecture

[Application + OTel SDK]
        |
        | OTLP (gRPC/HTTP)
        v
[OTel Collector]
   |       |       |
   v       v       v
[Jaeger] [Prom.] [Loki]

Auto-Instrumentation (Node.js)

// tracing.js - load before application starts
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');

const sdk = new NodeSDK({
  serviceName: 'chat-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

8. Prometheus + Grafana

Prometheus Metric Types

1. Counter   - Monotonically increasing value (e.g., total request count)
2. Gauge     - Value that can increase or decrease (e.g., current active connections)
3. Histogram - Value distribution (e.g., response time distribution)
4. Summary   - Percentiles (e.g., p99 response time)

WebSocket Custom Metrics Example (Node.js)

const promClient = require('prom-client');

// Counter: total connections
const wsConnectionsTotal = new promClient.Counter({
  name: 'websocket_connections_total',
  help: 'Total WebSocket connections',
  labelNames: ['status'],
});

// Gauge: current active connections
const wsActiveConnections = new promClient.Gauge({
  name: 'websocket_active_connections',
  help: 'Current active WebSocket connections',
  labelNames: ['namespace'],
});

// Histogram: message processing duration
const wsMessageDuration = new promClient.Histogram({
  name: 'websocket_message_duration_seconds',
  help: 'WebSocket message processing duration',
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});

// Connect metrics to WebSocket events
wss.on('connection', (ws) => {
  wsConnectionsTotal.inc({ status: 'connected' });
  wsActiveConnections.inc({ namespace: 'default' });

  ws.on('message', (data) => {
    const end = wsMessageDuration.startTimer();
    processMessage(data);
    end();
  });

  ws.on('close', () => {
    wsActiveConnections.dec({ namespace: 'default' });
  });
});

Key PromQL Queries

# WebSocket connection request rate per second (5-minute moving average)
rate(websocket_connections_total[5m])

# Current active WebSocket connections
websocket_active_connections

# Message processing time p99
histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))

# Error rate (percentage)
rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100

# Instances with CPU utilization above 80%
node_cpu_usage_percent > 80

Grafana Dashboard Design

{
  "dashboard": {
    "title": "WebSocket Realtime Monitoring",
    "panels": [
      {
        "title": "Active Connections",
        "type": "stat",
        "targets": [
          { "expr": "sum(websocket_active_connections)" }
        ]
      },
      {
        "title": "Connection Rate",
        "type": "timeseries",
        "targets": [
          { "expr": "rate(websocket_connections_total[5m])" }
        ]
      },
      {
        "title": "Message Latency (p50/p95/p99)",
        "type": "timeseries",
        "targets": [
          { "expr": "histogram_quantile(0.50, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p50" },
          { "expr": "histogram_quantile(0.95, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p95" },
          { "expr": "histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p99" }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "targets": [
          { "expr": "rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100" }
        ]
      }
    ]
  }
}

9. Distributed Tracing

Span and Trace Structure

Trace ID: 7f3a8b2c1d4e5f6a

Span A: API Gateway        [============                    ] 0-200ms
  Span B: Auth Service        [====                         ] 10-50ms
  Span C: Chat Service            [==================       ] 60-180ms
    Span D: Redis Pub/Sub              [====                ] 80-120ms
    Span E: Database Write                  [=======        ] 130-170ms

Jaeger Setup (Docker Compose)

version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:1.54
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200

Manual Span Creation (Python)

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("chat-service")

def handle_websocket_message(message):
    with tracer.start_as_current_span("process_message") as span:
        span.set_attribute("message.type", message.get("type"))
        span.set_attribute("message.size", len(str(message)))

        try:
            # Message processing logic
            result = process(message)
            span.set_attribute("processing.result", "success")
            return result
        except Exception as e:
            span.set_status(StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

10. Log Aggregation

ELK Stack (Elasticsearch + Logstash + Kibana)

# logstash.conf
input {
  tcp {
    port => 5000
    codec => json
  }
}

filter {
  if [level] == "ERROR" {
    mutate {
      add_tag => ["alert"]
    }
  }
  date {
    match => ["timestamp", "ISO8601"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "websocket-logs-%{+YYYY.MM.dd}"
  }
}

Loki + Grafana (Lightweight Alternative)

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2026-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

Structured Logging (Node.js + Winston)

const winston = require('winston');

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: { service: 'websocket-server' },
  transports: [
    new winston.transports.Console(),
    new winston.transports.Http({
      host: 'loki-gateway',
      port: 3100,
      path: '/loki/api/v1/push',
    }),
  ],
});

// Usage example
logger.info('WebSocket connection established', {
  userId: 'user-123',
  remoteAddress: '192.168.1.100',
  protocol: 'wss',
});

logger.error('Message processing failed', {
  userId: 'user-456',
  errorCode: 'PARSE_ERROR',
  rawMessage: '<sanitized>',
  traceId: 'abc123',
});

11. Alert Design

Preventing Alert Fatigue

When there are too many alerts, even critical ones get ignored. Effective alert design principles are as follows.

Principle	Description
Actionability	Receiving an alert should enable immediate action
Symptom-based	Alerts based on user impact (symptoms), not causes
Tiered	Separate into Critical / Warning / Info levels
Auto-resolve	Automatically close alerts when the issue resolves
Grouping	Bundle related alerts and send only once

Prometheus Alerting Rules

# alerting-rules.yaml
groups:
  - name: websocket-alerts
    rules:
      - alert: HighConnectionDropRate
        expr: rate(websocket_disconnections_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High WebSocket disconnection rate"
          description: "More than 10 disconnections per second over 5 minutes"

      - alert: WebSocketLatencyHigh
        expr: histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m])) > 1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "High WebSocket message latency"
          description: "p99 latency exceeds 1 second"

      - alert: ActiveConnectionsSpiking
        expr: websocket_active_connections > 10000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Active WebSocket connections spiking"
          description: "Active connection count exceeds 10,000"

Alertmanager Routing Configuration

# alertmanager.yaml
global:
  resolve_timeout: 5m

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default-receiver'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXXX'
        channel: '#alerts-general'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXXX'
        channel: '#alerts-warning'
        title: 'Warning Alert'

On-Call Integration: PagerDuty / OpsGenie

Setting up on-call rotation and escalation policies ensures that alerts are automatically routed to the responsible person during incidents.

[Prometheus Alert Fires]
        |
        v
[Alertmanager Receives and Groups]
        |
        v
[PagerDuty / OpsGenie Creates Incident]
        |
        v
[On-call Engineer Notified via Phone/SMS/Slack]
        |
        v
[If Not ACKed -> Escalation (Notify Manager)]

Final Checklist

Items to verify when applying real-time communication and observability in production environments.

Real-Time Communication

WebSocket handshake authentication implemented
Ping/Pong heartbeat interval configured
Auto-reconnection with exponential backoff implemented
Horizontal scaling prepared with Redis Pub/Sub
Message persistence ensured with Kafka or Redis Streams

Observability

All three pillars collected: metrics, logs, traces
OpenTelemetry Collector deployed and pipeline configured
Key indicators visualized in Grafana dashboards
Symptom-based alerting rules configured
On-call rotation and escalation policies established
Alert thresholds continuously tuned to prevent alert fatigue

Quiz: Real-Time Communication and Observability Key Review

Q1. What is the biggest difference between WebSocket and SSE?

A1. WebSocket supports bidirectional (full-duplex) communication, while SSE only supports unidirectional push from server to client. WebSocket is suitable for scenarios where both sides need to send data, such as chat or gaming. SSE is suitable for scenarios where the server unilaterally sends data, such as stock tickers or notifications.

Q2. What are the three pillars of observability and what is the role of each?

A2. Metrics are numeric time-series data for problem detection. Traces track request paths across distributed systems for bottleneck identification. Logs provide detailed event records for root cause analysis.

Q3. What are the key principles for preventing alert fatigue?

A3. Set only actionable alerts, base alerts on symptoms rather than causes, tier them into Critical/Warning/Info levels, group related alerts, and implement auto-resolve detection.