Split View: WebSocket 실시간 통신 & 옵저버빌리티(모니터링) 완전 가이드
WebSocket 실시간 통신 & 옵저버빌리티(모니터링) 완전 가이드
- Part 1: 실시간 통신
- Part 2: 옵저버빌리티
Part 1: 실시간 통신
1. HTTP 폴링 vs Long Polling vs SSE vs WebSocket
실시간 데이터 전송 방식은 요구사항에 따라 적합한 기술이 다릅니다.
비교표
| 특성 | HTTP 폴링 | Long Polling | SSE | WebSocket |
|---|---|---|---|---|
| 방향 | 단방향(클라이언트 요청) | 단방향(서버 응답 대기) | 단방향(서버 푸시) | 양방향(전이중) |
| 연결 유지 | 매 요청마다 새 연결 | 응답 후 재연결 | 단일 HTTP 연결 | 단일 TCP 연결 |
| 오버헤드 | 높음(반복 요청) | 중간 | 낮음 | 매우 낮음 |
| 지연시간 | 폴링 간격 의존 | 이벤트 즉시 | 이벤트 즉시 | 이벤트 즉시 |
| 프로토콜 | HTTP/1.1 | HTTP/1.1 | HTTP/1.1 | ws:// 또는 wss:// |
| 브라우저 지원 | 모두 지원 | 모두 지원 | IE 제외 | 모두 지원 |
| 사용 사례 | 간단한 상태 확인 | 알림 | 주식 시세, 뉴스 피드 | 채팅, 게임, 협업 편집 |
사용 사례별 권장 기술
- 대시보드 데이터 갱신: SSE (서버에서 일방적으로 데이터 푸시)
- 실시간 채팅: WebSocket (양방향 메시지 교환)
- 파일 업로드 진행률: SSE (서버가 진행률 푸시)
- 온라인 게임: WebSocket (초저지연 양방향 통신)
- IoT 센서 모니터링: WebSocket 또는 MQTT
2. WebSocket 프로토콜 심화
핸드셰이크 과정
WebSocket 연결은 HTTP Upgrade 요청으로 시작됩니다.
GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
서버 응답:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
프레임 구조
WebSocket 프레임은 다음과 같은 바이너리 구조를 가집니다.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len | Extended payload length |
|I|S|S|S| (4) |A| (7) | (16/64) |
|N|V|V|V| |S| | (if payload len==126/127) |
| |1|2|3| |K| | |
+-+-+-+-+-------+-+-------------+-------------------------------+
주요 opcode 값:
0x1: 텍스트 프레임0x2: 바이너리 프레임0x8: 연결 종료0x9: Ping0xA: Pong
Ping/Pong 하트비트
// 서버 측 (Node.js)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws) => {
ws.isAlive = true;
ws.on('pong', () => {
ws.isAlive = true;
});
});
// 30초마다 하트비트 확인
const interval = setInterval(() => {
wss.clients.forEach((ws) => {
if (!ws.isAlive) return ws.terminate();
ws.isAlive = false;
ws.ping();
});
}, 30000);
연결 관리와 에러 처리
// 클라이언트 측
class WebSocketClient {
constructor(url) {
this.url = url;
this.reconnectAttempts = 0;
this.maxReconnectAttempts = 5;
this.reconnectDelay = 1000;
this.connect();
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
console.log('Connected');
this.reconnectAttempts = 0;
};
this.ws.onclose = (event) => {
if (event.code !== 1000) {
this.reconnect();
}
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
}
reconnect() {
if (this.reconnectAttempts >= this.maxReconnectAttempts) {
console.error('Max reconnect attempts reached');
return;
}
this.reconnectAttempts++;
const delay = this.reconnectDelay * Math.pow(2, this.reconnectAttempts);
setTimeout(() => this.connect(), delay);
}
}
3. Socket.IO 활용
Socket.IO는 WebSocket 위에 추상화 레이어를 제공하여 실시간 통신 개발을 단순화합니다.
네임스페이스와 룸
const io = require('socket.io')(server);
// 네임스페이스 분리
const chatNamespace = io.of('/chat');
const notificationNamespace = io.of('/notifications');
chatNamespace.on('connection', (socket) => {
// 룸 참여
socket.on('joinRoom', (roomId) => {
socket.join(roomId);
chatNamespace.to(roomId).emit('userJoined', {
userId: socket.id,
room: roomId
});
});
// 룸 내 메시지 전송
socket.on('message', (data) => {
chatNamespace.to(data.roomId).emit('newMessage', {
userId: socket.id,
content: data.content,
timestamp: Date.now()
});
});
// 룸 퇴장
socket.on('leaveRoom', (roomId) => {
socket.leave(roomId);
});
});
자동 재연결과 폴백
// 클라이언트 설정
const socket = io('https://api.example.com', {
// 전송 계층 폴백 (WebSocket 실패 시 HTTP Long Polling)
transports: ['websocket', 'polling'],
// 재연결 설정
reconnection: true,
reconnectionAttempts: 10,
reconnectionDelay: 1000,
reconnectionDelayMax: 5000,
randomizationFactor: 0.5,
// 타임아웃
timeout: 20000,
// 인증
auth: {
token: 'my-jwt-token'
}
});
socket.on('connect_error', (error) => {
if (error.message === 'unauthorized') {
// 토큰 갱신 로직
refreshToken().then((newToken) => {
socket.auth.token = newToken;
socket.connect();
});
}
});
4. gRPC 스트리밍
gRPC는 HTTP/2 기반으로 4가지 통신 모드를 지원합니다.
4가지 스트리밍 모드
syntax = "proto3";
service ChatService {
// 1. Unary: 요청 1개 - 응답 1개
rpc SendMessage (ChatMessage) returns (MessageResponse);
// 2. Server Streaming: 요청 1개 - 응답 스트림
rpc SubscribeMessages (SubscriptionRequest) returns (stream ChatMessage);
// 3. Client Streaming: 요청 스트림 - 응답 1개
rpc UploadLogs (stream LogEntry) returns (UploadSummary);
// 4. Bidirectional Streaming: 요청 스트림 - 응답 스트림
rpc Chat (stream ChatMessage) returns (stream ChatMessage);
}
message ChatMessage {
string user_id = 1;
string content = 2;
int64 timestamp = 3;
}
Bidirectional Streaming 구현 (Go)
func (s *chatServer) Chat(stream pb.ChatService_ChatServer) error {
for {
msg, err := stream.Recv()
if err == io.EOF {
return nil
}
if err != nil {
return err
}
// 메시지 처리 및 브로드캐스트
response := &pb.ChatMessage{
UserId: msg.UserId,
Content: msg.Content,
Timestamp: time.Now().UnixMilli(),
}
if err := stream.Send(response); err != nil {
return err
}
}
}
5. 실전: 실시간 채팅 아키텍처
Redis Pub/Sub + WebSocket
여러 WebSocket 서버 인스턴스 간 메시지를 동기화하기 위해 Redis Pub/Sub를 사용합니다.
[클라이언트 A] <--ws--> [서버 1] <--pub/sub--> [Redis]
[클라이언트 B] <--ws--> [서버 2] <--pub/sub--> [Redis]
[클라이언트 C] <--ws--> [서버 1] <--pub/sub--> [Redis]
Node.js 구현
const Redis = require('ioredis');
const WebSocket = require('ws');
const redisPub = new Redis();
const redisSub = new Redis();
const wss = new WebSocket.Server({ port: 8080 });
// 채널별 클라이언트 관리
const channels = new Map();
wss.on('connection', (ws) => {
ws.on('message', (raw) => {
const data = JSON.parse(raw);
switch (data.type) {
case 'subscribe':
// 채널 구독
if (!channels.has(data.channel)) {
channels.set(data.channel, new Set());
redisSub.subscribe(data.channel);
}
channels.get(data.channel).add(ws);
break;
case 'message':
// Redis를 통해 모든 서버에 메시지 전파
redisPub.publish(data.channel, JSON.stringify({
user: data.user,
content: data.content,
timestamp: Date.now()
}));
break;
}
});
ws.on('close', () => {
// 연결 종료 시 모든 채널에서 제거
channels.forEach((clients, channel) => {
clients.delete(ws);
if (clients.size === 0) {
channels.delete(channel);
redisSub.unsubscribe(channel);
}
});
});
});
// Redis 메시지 수신 시 해당 채널의 모든 클라이언트에게 전달
redisSub.on('message', (channel, message) => {
const clients = channels.get(channel);
if (clients) {
clients.forEach((ws) => {
if (ws.readyState === WebSocket.OPEN) {
ws.send(message);
}
});
}
});
확장성을 위한 고려사항
- 수평 확장: Redis Pub/Sub 또는 Redis Streams로 서버 간 메시지 동기화
- 연결 제한: 서버당 최대 연결 수 설정 (OS의 파일 디스크립터 제한 고려)
- 로드 밸런싱: Sticky Session 또는 IP Hash 기반 로드 밸런싱
- 메시지 영속성: Redis Streams 또는 Kafka로 메시지 이력 보관
- 인증: JWT 기반 WebSocket 핸드셰이크 인증
Part 2: 옵저버빌리티
6. 옵저버빌리티 3대 축
옵저버빌리티(Observability)는 시스템의 외부 출력만으로 내부 상태를 파악할 수 있는 정도를 의미합니다.
메트릭(Metrics)
숫자로 표현되는 시계열 데이터입니다.
- 요청 수 (requests_total)
- 응답 시간 (response_duration_seconds)
- 에러율 (error_rate_percent)
- CPU/메모리 사용률 (cpu_usage_percent, memory_usage_bytes)
로그(Logs)
특정 시점에 발생한 이벤트의 텍스트 기록입니다.
{
"timestamp": "2026-04-12T16:00:00Z",
"level": "ERROR",
"service": "chat-server",
"trace_id": "abc123def456",
"message": "WebSocket connection failed",
"details": {
"user_id": "user-789",
"error_code": 1006,
"reason": "abnormal closure"
}
}
트레이스(Traces)
분산 시스템에서 요청이 여러 서비스를 거치는 전체 경로를 추적합니다.
[API Gateway] ---> [Auth Service] ---> [Chat Service] ---> [Redis]
Span 1 Span 2 Span 3 Span 4
|____________ Trace ID: abc123 __________________________|
3대 축의 상호 관계
| 질문 | 해결 수단 |
|---|---|
| 무엇이 잘못되었나? | 메트릭 (에러율 급증 감지) |
| 왜 잘못되었나? | 트레이스 (병목 지점 식별) |
| 정확히 무슨 일이 일어났나? | 로그 (상세 에러 메시지 확인) |
7. OpenTelemetry
OpenTelemetry(OTel)는 CNCF 산하 프로젝트로, 메트릭/로그/트레이스를 벤더 중립적으로 수집합니다.
아키텍처
[애플리케이션 + OTel SDK]
|
| OTLP (gRPC/HTTP)
v
[OTel Collector]
| | |
v v v
[Jaeger] [Prom.] [Loki]
Auto-Instrumentation (Node.js)
// tracing.js - 애플리케이션 시작 전 로드
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const sdk = new NodeSDK({
serviceName: 'chat-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://otel-collector:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
OTel Collector 설정
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheus:
endpoint: 0.0.0.0:8889
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, memory_limiter]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
8. Prometheus + Grafana
Prometheus 메트릭 유형
1. Counter - 단조 증가값 (예: 총 요청 수)
2. Gauge - 증감 가능값 (예: 현재 활성 연결 수)
3. Histogram - 값 분포 (예: 응답 시간 분포)
4. Summary - 백분위 수 (예: p99 응답 시간)
WebSocket 커스텀 메트릭 예시 (Node.js)
const promClient = require('prom-client');
// Counter: 총 연결 수
const wsConnectionsTotal = new promClient.Counter({
name: 'websocket_connections_total',
help: 'Total WebSocket connections',
labelNames: ['status'],
});
// Gauge: 현재 활성 연결
const wsActiveConnections = new promClient.Gauge({
name: 'websocket_active_connections',
help: 'Current active WebSocket connections',
labelNames: ['namespace'],
});
// Histogram: 메시지 처리 시간
const wsMessageDuration = new promClient.Histogram({
name: 'websocket_message_duration_seconds',
help: 'WebSocket message processing duration',
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});
// WebSocket 이벤트에 메트릭 연동
wss.on('connection', (ws) => {
wsConnectionsTotal.inc({ status: 'connected' });
wsActiveConnections.inc({ namespace: 'default' });
ws.on('message', (data) => {
const end = wsMessageDuration.startTimer();
processMessage(data);
end();
});
ws.on('close', () => {
wsActiveConnections.dec({ namespace: 'default' });
});
});
주요 PromQL 쿼리
# 초당 WebSocket 연결 요청률 (5분 이동 평균)
rate(websocket_connections_total[5m])
# 현재 활성 WebSocket 연결 수
websocket_active_connections
# 메시지 처리 시간 p99
histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))
# 에러율 (퍼센트)
rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100
# CPU 사용률 80% 초과 인스턴스
node_cpu_usage_percent > 80
Grafana 대시보드 설계
{
"dashboard": {
"title": "WebSocket Realtime Monitoring",
"panels": [
{
"title": "Active Connections",
"type": "stat",
"targets": [
{ "expr": "sum(websocket_active_connections)" }
]
},
{
"title": "Connection Rate",
"type": "timeseries",
"targets": [
{ "expr": "rate(websocket_connections_total[5m])" }
]
},
{
"title": "Message Latency (p50/p95/p99)",
"type": "timeseries",
"targets": [
{ "expr": "histogram_quantile(0.50, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p99" }
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{ "expr": "rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100" }
]
}
]
}
}
9. 분산 트레이싱
스팬(Span)과 트레이스(Trace) 구조
Trace ID: 7f3a8b2c1d4e5f6a
Span A: API Gateway [============ ] 0-200ms
Span B: Auth Service [==== ] 10-50ms
Span C: Chat Service [================== ] 60-180ms
Span D: Redis Pub/Sub [==== ] 80-120ms
Span E: Database Write [======= ] 130-170ms
Jaeger 설정 (Docker Compose)
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.54
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200
수동 스팬 생성 (Python)
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("chat-service")
def handle_websocket_message(message):
with tracer.start_as_current_span("process_message") as span:
span.set_attribute("message.type", message.get("type"))
span.set_attribute("message.size", len(str(message)))
try:
# 메시지 처리 로직
result = process(message)
span.set_attribute("processing.result", "success")
return result
except Exception as e:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e)
raise
10. 로그 집계
ELK 스택 (Elasticsearch + Logstash + Kibana)
# logstash.conf
input {
tcp {
port => 5000
codec => json
}
}
filter {
if [level] == "ERROR" {
mutate {
add_tag => ["alert"]
}
}
date {
match => ["timestamp", "ISO8601"]
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "websocket-logs-%{+YYYY.MM.dd}"
}
}
Loki + Grafana (경량 대안)
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2026-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
filesystem:
directory: /loki/chunks
구조화된 로깅 (Node.js + Winston)
const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
defaultMeta: { service: 'websocket-server' },
transports: [
new winston.transports.Console(),
new winston.transports.Http({
host: 'loki-gateway',
port: 3100,
path: '/loki/api/v1/push',
}),
],
});
// 사용 예시
logger.info('WebSocket connection established', {
userId: 'user-123',
remoteAddress: '192.168.1.100',
protocol: 'wss',
});
logger.error('Message processing failed', {
userId: 'user-456',
errorCode: 'PARSE_ERROR',
rawMessage: '<sanitized>',
traceId: 'abc123',
});
11. 알림 설계
알림 피로(Alert Fatigue) 방지
알림이 너무 많으면 중요한 알림도 무시하게 됩니다. 효과적인 알림 설계 원칙은 다음과 같습니다.
| 원칙 | 설명 |
|---|---|
| 실행 가능성 | 알림을 받으면 즉시 조치할 수 있어야 함 |
| 증상 기반 | 원인이 아닌 사용자 영향(증상)에 기반한 알림 |
| 계층화 | Critical / Warning / Info 3단계로 분리 |
| 자동 해결 감지 | 문제 해결 시 자동으로 알림 종료 |
| 그룹핑 | 관련 알림을 묶어서 한 번만 전송 |
Prometheus Alerting Rules
# alerting-rules.yaml
groups:
- name: websocket-alerts
rules:
- alert: HighConnectionDropRate
expr: rate(websocket_disconnections_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "WebSocket 연결 끊김 비율 높음"
description: "5분간 초당 10건 이상의 연결 끊김 발생"
- alert: WebSocketLatencyHigh
expr: histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m])) > 1
for: 3m
labels:
severity: critical
annotations:
summary: "WebSocket 메시지 지연 시간 높음"
description: "p99 지연 시간이 1초를 초과"
- alert: ActiveConnectionsSpiking
expr: websocket_active_connections > 10000
for: 1m
labels:
severity: warning
annotations:
summary: "활성 WebSocket 연결 수 급증"
description: "활성 연결 수가 10,000을 초과"
Alertmanager 라우팅 설정
# alertmanager.yaml
global:
resolve_timeout: 5m
route:
receiver: 'default-receiver'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXX'
channel: '#alerts-general'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
severity: critical
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXX'
channel: '#alerts-warning'
title: 'Warning Alert'
온콜 연동: PagerDuty / OpsGenie
온콜 로테이션과 에스컬레이션 정책을 설정하면, 장애 발생 시 담당자에게 자동으로 알림이 전달됩니다.
[Prometheus Alert 발생]
|
v
[Alertmanager 수신 및 그룹핑]
|
v
[PagerDuty / OpsGenie 인시던트 생성]
|
v
[온콜 담당자에게 전화/SMS/Slack 알림]
|
v
[ACK(확인) 안 되면 -> 에스컬레이션 (매니저에게 전달)]
마무리 체크리스트
실시간 통신과 옵저버빌리티를 운영 환경에 적용할 때 확인할 항목입니다.
실시간 통신
- WebSocket 핸드셰이크 인증 구현 여부
- Ping/Pong 하트비트 주기 설정
- 자동 재연결과 지수 백오프 구현
- Redis Pub/Sub로 수평 확장 대비
- 메시지 큐(Kafka/Redis Streams)로 메시지 영속성 확보
옵저버빌리티
- 메트릭/로그/트레이스 3대 축 모두 수집
- OpenTelemetry Collector 배포 및 파이프라인 설정
- Grafana 대시보드에 핵심 지표 시각화
- 증상 기반 알림 규칙 설정
- 온콜 로테이션과 에스컬레이션 정책 수립
- 알림 피로 방지를 위한 임계값 지속 튜닝
퀴즈: 실시간 통신 & 옵저버빌리티 핵심 점검
Q1. WebSocket과 SSE의 가장 큰 차이점은 무엇인가요?
A1. WebSocket은 양방향(전이중) 통신을 지원하고, SSE는 서버에서 클라이언트로의 단방향 푸시만 지원합니다. WebSocket은 채팅이나 게임처럼 양쪽 모두 데이터를 보내야 하는 경우에 적합하고, SSE는 주식 시세나 알림처럼 서버가 일방적으로 데이터를 보내는 경우에 적합합니다.
Q2. 옵저버빌리티의 3대 축은 무엇이며, 각각의 역할은?
A2. 메트릭(Metrics)은 시계열 숫자 데이터로 문제 감지, 트레이스(Traces)는 분산 시스템의 요청 경로 추적으로 병목 식별, 로그(Logs)는 이벤트 상세 기록으로 근본 원인 분석에 사용됩니다.
Q3. 알림 피로를 방지하기 위한 핵심 원칙은?
A3. 실행 가능한 알림만 설정하고, 원인이 아닌 증상 기반으로 알림을 구성하며, Critical/Warning/Info 계층화, 관련 알림 그룹핑, 자동 해결 감지를 적용합니다.
WebSocket Real-Time Communication & Observability/Monitoring Complete Guide
- Part 1: Real-Time Communication
- Part 2: Observability
Part 1: Real-Time Communication
1. HTTP Polling vs Long Polling vs SSE vs WebSocket
Different real-time data delivery methods are suited to different requirements.
Comparison Table
| Feature | HTTP Polling | Long Polling | SSE | WebSocket |
|---|---|---|---|---|
| Direction | Unidirectional (client request) | Unidirectional (server response wait) | Unidirectional (server push) | Bidirectional (full-duplex) |
| Connection | New connection per request | Reconnect after response | Single HTTP connection | Single TCP connection |
| Overhead | High (repeated requests) | Medium | Low | Very low |
| Latency | Depends on polling interval | Immediate on event | Immediate on event | Immediate on event |
| Protocol | HTTP/1.1 | HTTP/1.1 | HTTP/1.1 | ws:// or wss:// |
| Browser Support | All | All | Except IE | All |
| Use Cases | Simple status checks | Notifications | Stock tickers, news feeds | Chat, gaming, collaborative editing |
Recommended Technology by Use Case
- Dashboard data refresh: SSE (server unilaterally pushes data)
- Real-time chat: WebSocket (bidirectional message exchange)
- File upload progress: SSE (server pushes progress updates)
- Online gaming: WebSocket (ultra-low-latency bidirectional communication)
- IoT sensor monitoring: WebSocket or MQTT
2. WebSocket Protocol Deep Dive
Handshake Process
A WebSocket connection begins with an HTTP Upgrade request.
GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Server response:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Frame Structure
WebSocket frames have the following binary structure:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len | Extended payload length |
|I|S|S|S| (4) |A| (7) | (16/64) |
|N|V|V|V| |S| | (if payload len==126/127) |
| |1|2|3| |K| | |
+-+-+-+-+-------+-+-------------+-------------------------------+
Key opcode values:
0x1: Text frame0x2: Binary frame0x8: Connection close0x9: Ping0xA: Pong
Ping/Pong Heartbeat
// Server-side (Node.js)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws) => {
ws.isAlive = true;
ws.on('pong', () => {
ws.isAlive = true;
});
});
// Check heartbeat every 30 seconds
const interval = setInterval(() => {
wss.clients.forEach((ws) => {
if (!ws.isAlive) return ws.terminate();
ws.isAlive = false;
ws.ping();
});
}, 30000);
Connection Management and Error Handling
// Client-side
class WebSocketClient {
constructor(url) {
this.url = url;
this.reconnectAttempts = 0;
this.maxReconnectAttempts = 5;
this.reconnectDelay = 1000;
this.connect();
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
console.log('Connected');
this.reconnectAttempts = 0;
};
this.ws.onclose = (event) => {
if (event.code !== 1000) {
this.reconnect();
}
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
}
reconnect() {
if (this.reconnectAttempts >= this.maxReconnectAttempts) {
console.error('Max reconnect attempts reached');
return;
}
this.reconnectAttempts++;
const delay = this.reconnectDelay * Math.pow(2, this.reconnectAttempts);
setTimeout(() => this.connect(), delay);
}
}
3. Working with Socket.IO
Socket.IO provides an abstraction layer on top of WebSocket to simplify real-time communication development.
Namespaces and Rooms
const io = require('socket.io')(server);
// Separate namespaces
const chatNamespace = io.of('/chat');
const notificationNamespace = io.of('/notifications');
chatNamespace.on('connection', (socket) => {
// Join room
socket.on('joinRoom', (roomId) => {
socket.join(roomId);
chatNamespace.to(roomId).emit('userJoined', {
userId: socket.id,
room: roomId
});
});
// Send message within room
socket.on('message', (data) => {
chatNamespace.to(data.roomId).emit('newMessage', {
userId: socket.id,
content: data.content,
timestamp: Date.now()
});
});
// Leave room
socket.on('leaveRoom', (roomId) => {
socket.leave(roomId);
});
});
Auto-Reconnection and Fallback
// Client configuration
const socket = io('https://api.example.com', {
// Transport layer fallback (HTTP Long Polling if WebSocket fails)
transports: ['websocket', 'polling'],
// Reconnection settings
reconnection: true,
reconnectionAttempts: 10,
reconnectionDelay: 1000,
reconnectionDelayMax: 5000,
randomizationFactor: 0.5,
// Timeout
timeout: 20000,
// Authentication
auth: {
token: 'my-jwt-token'
}
});
socket.on('connect_error', (error) => {
if (error.message === 'unauthorized') {
// Token refresh logic
refreshToken().then((newToken) => {
socket.auth.token = newToken;
socket.connect();
});
}
});
4. gRPC Streaming
gRPC supports four communication modes based on HTTP/2.
Four Streaming Modes
syntax = "proto3";
service ChatService {
// 1. Unary: one request - one response
rpc SendMessage (ChatMessage) returns (MessageResponse);
// 2. Server Streaming: one request - response stream
rpc SubscribeMessages (SubscriptionRequest) returns (stream ChatMessage);
// 3. Client Streaming: request stream - one response
rpc UploadLogs (stream LogEntry) returns (UploadSummary);
// 4. Bidirectional Streaming: request stream - response stream
rpc Chat (stream ChatMessage) returns (stream ChatMessage);
}
message ChatMessage {
string user_id = 1;
string content = 2;
int64 timestamp = 3;
}
Bidirectional Streaming Implementation (Go)
func (s *chatServer) Chat(stream pb.ChatService_ChatServer) error {
for {
msg, err := stream.Recv()
if err == io.EOF {
return nil
}
if err != nil {
return err
}
// Process and broadcast message
response := &pb.ChatMessage{
UserId: msg.UserId,
Content: msg.Content,
Timestamp: time.Now().UnixMilli(),
}
if err := stream.Send(response); err != nil {
return err
}
}
}
5. Practical Architecture: Real-Time Chat
Redis Pub/Sub + WebSocket
Redis Pub/Sub is used to synchronize messages across multiple WebSocket server instances.
[Client A] <--ws--> [Server 1] <--pub/sub--> [Redis]
[Client B] <--ws--> [Server 2] <--pub/sub--> [Redis]
[Client C] <--ws--> [Server 1] <--pub/sub--> [Redis]
Node.js Implementation
const Redis = require('ioredis');
const WebSocket = require('ws');
const redisPub = new Redis();
const redisSub = new Redis();
const wss = new WebSocket.Server({ port: 8080 });
// Manage clients per channel
const channels = new Map();
wss.on('connection', (ws) => {
ws.on('message', (raw) => {
const data = JSON.parse(raw);
switch (data.type) {
case 'subscribe':
// Subscribe to channel
if (!channels.has(data.channel)) {
channels.set(data.channel, new Set());
redisSub.subscribe(data.channel);
}
channels.get(data.channel).add(ws);
break;
case 'message':
// Broadcast message to all servers via Redis
redisPub.publish(data.channel, JSON.stringify({
user: data.user,
content: data.content,
timestamp: Date.now()
}));
break;
}
});
ws.on('close', () => {
// Remove from all channels on disconnect
channels.forEach((clients, channel) => {
clients.delete(ws);
if (clients.size === 0) {
channels.delete(channel);
redisSub.unsubscribe(channel);
}
});
});
});
// Deliver Redis messages to all clients in the channel
redisSub.on('message', (channel, message) => {
const clients = channels.get(channel);
if (clients) {
clients.forEach((ws) => {
if (ws.readyState === WebSocket.OPEN) {
ws.send(message);
}
});
}
});
Scalability Considerations
- Horizontal scaling: Synchronize messages across servers with Redis Pub/Sub or Redis Streams
- Connection limits: Set maximum connections per server (consider OS file descriptor limits)
- Load balancing: Sticky session or IP hash-based load balancing
- Message persistence: Store message history with Redis Streams or Kafka
- Authentication: JWT-based WebSocket handshake authentication
Part 2: Observability
6. The Three Pillars of Observability
Observability refers to the degree to which a system's internal state can be understood from its external outputs.
Metrics
Numeric time-series data.
- Request count (requests_total)
- Response time (response_duration_seconds)
- Error rate (error_rate_percent)
- CPU/memory utilization (cpu_usage_percent, memory_usage_bytes)
Logs
Text records of events that occurred at specific points in time.
{
"timestamp": "2026-04-12T16:00:00Z",
"level": "ERROR",
"service": "chat-server",
"trace_id": "abc123def456",
"message": "WebSocket connection failed",
"details": {
"user_id": "user-789",
"error_code": 1006,
"reason": "abnormal closure"
}
}
Traces
Track the complete path of a request as it traverses multiple services in a distributed system.
[API Gateway] ---> [Auth Service] ---> [Chat Service] ---> [Redis]
Span 1 Span 2 Span 3 Span 4
|____________ Trace ID: abc123 __________________________|
How the Three Pillars Relate
| Question | Resolution Tool |
|---|---|
| What went wrong? | Metrics (detect error rate spikes) |
| Why did it go wrong? | Traces (identify bottlenecks) |
| What exactly happened? | Logs (examine detailed error messages) |
7. OpenTelemetry
OpenTelemetry (OTel) is a CNCF project that collects metrics, logs, and traces in a vendor-neutral manner.
Architecture
[Application + OTel SDK]
|
| OTLP (gRPC/HTTP)
v
[OTel Collector]
| | |
v v v
[Jaeger] [Prom.] [Loki]
Auto-Instrumentation (Node.js)
// tracing.js - load before application starts
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const sdk = new NodeSDK({
serviceName: 'chat-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://otel-collector:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
OTel Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheus:
endpoint: 0.0.0.0:8889
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, memory_limiter]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
8. Prometheus + Grafana
Prometheus Metric Types
1. Counter - Monotonically increasing value (e.g., total request count)
2. Gauge - Value that can increase or decrease (e.g., current active connections)
3. Histogram - Value distribution (e.g., response time distribution)
4. Summary - Percentiles (e.g., p99 response time)
WebSocket Custom Metrics Example (Node.js)
const promClient = require('prom-client');
// Counter: total connections
const wsConnectionsTotal = new promClient.Counter({
name: 'websocket_connections_total',
help: 'Total WebSocket connections',
labelNames: ['status'],
});
// Gauge: current active connections
const wsActiveConnections = new promClient.Gauge({
name: 'websocket_active_connections',
help: 'Current active WebSocket connections',
labelNames: ['namespace'],
});
// Histogram: message processing duration
const wsMessageDuration = new promClient.Histogram({
name: 'websocket_message_duration_seconds',
help: 'WebSocket message processing duration',
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});
// Connect metrics to WebSocket events
wss.on('connection', (ws) => {
wsConnectionsTotal.inc({ status: 'connected' });
wsActiveConnections.inc({ namespace: 'default' });
ws.on('message', (data) => {
const end = wsMessageDuration.startTimer();
processMessage(data);
end();
});
ws.on('close', () => {
wsActiveConnections.dec({ namespace: 'default' });
});
});
Key PromQL Queries
# WebSocket connection request rate per second (5-minute moving average)
rate(websocket_connections_total[5m])
# Current active WebSocket connections
websocket_active_connections
# Message processing time p99
histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))
# Error rate (percentage)
rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100
# Instances with CPU utilization above 80%
node_cpu_usage_percent > 80
Grafana Dashboard Design
{
"dashboard": {
"title": "WebSocket Realtime Monitoring",
"panels": [
{
"title": "Active Connections",
"type": "stat",
"targets": [
{ "expr": "sum(websocket_active_connections)" }
]
},
{
"title": "Connection Rate",
"type": "timeseries",
"targets": [
{ "expr": "rate(websocket_connections_total[5m])" }
]
},
{
"title": "Message Latency (p50/p95/p99)",
"type": "timeseries",
"targets": [
{ "expr": "histogram_quantile(0.50, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p99" }
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{ "expr": "rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100" }
]
}
]
}
}
9. Distributed Tracing
Span and Trace Structure
Trace ID: 7f3a8b2c1d4e5f6a
Span A: API Gateway [============ ] 0-200ms
Span B: Auth Service [==== ] 10-50ms
Span C: Chat Service [================== ] 60-180ms
Span D: Redis Pub/Sub [==== ] 80-120ms
Span E: Database Write [======= ] 130-170ms
Jaeger Setup (Docker Compose)
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.54
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200
Manual Span Creation (Python)
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("chat-service")
def handle_websocket_message(message):
with tracer.start_as_current_span("process_message") as span:
span.set_attribute("message.type", message.get("type"))
span.set_attribute("message.size", len(str(message)))
try:
# Message processing logic
result = process(message)
span.set_attribute("processing.result", "success")
return result
except Exception as e:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e)
raise
10. Log Aggregation
ELK Stack (Elasticsearch + Logstash + Kibana)
# logstash.conf
input {
tcp {
port => 5000
codec => json
}
}
filter {
if [level] == "ERROR" {
mutate {
add_tag => ["alert"]
}
}
date {
match => ["timestamp", "ISO8601"]
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "websocket-logs-%{+YYYY.MM.dd}"
}
}
Loki + Grafana (Lightweight Alternative)
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2026-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
filesystem:
directory: /loki/chunks
Structured Logging (Node.js + Winston)
const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
defaultMeta: { service: 'websocket-server' },
transports: [
new winston.transports.Console(),
new winston.transports.Http({
host: 'loki-gateway',
port: 3100,
path: '/loki/api/v1/push',
}),
],
});
// Usage example
logger.info('WebSocket connection established', {
userId: 'user-123',
remoteAddress: '192.168.1.100',
protocol: 'wss',
});
logger.error('Message processing failed', {
userId: 'user-456',
errorCode: 'PARSE_ERROR',
rawMessage: '<sanitized>',
traceId: 'abc123',
});
11. Alert Design
Preventing Alert Fatigue
When there are too many alerts, even critical ones get ignored. Effective alert design principles are as follows.
| Principle | Description |
|---|---|
| Actionability | Receiving an alert should enable immediate action |
| Symptom-based | Alerts based on user impact (symptoms), not causes |
| Tiered | Separate into Critical / Warning / Info levels |
| Auto-resolve | Automatically close alerts when the issue resolves |
| Grouping | Bundle related alerts and send only once |
Prometheus Alerting Rules
# alerting-rules.yaml
groups:
- name: websocket-alerts
rules:
- alert: HighConnectionDropRate
expr: rate(websocket_disconnections_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High WebSocket disconnection rate"
description: "More than 10 disconnections per second over 5 minutes"
- alert: WebSocketLatencyHigh
expr: histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m])) > 1
for: 3m
labels:
severity: critical
annotations:
summary: "High WebSocket message latency"
description: "p99 latency exceeds 1 second"
- alert: ActiveConnectionsSpiking
expr: websocket_active_connections > 10000
for: 1m
labels:
severity: warning
annotations:
summary: "Active WebSocket connections spiking"
description: "Active connection count exceeds 10,000"
Alertmanager Routing Configuration
# alertmanager.yaml
global:
resolve_timeout: 5m
route:
receiver: 'default-receiver'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXX'
channel: '#alerts-general'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
severity: critical
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXX'
channel: '#alerts-warning'
title: 'Warning Alert'
On-Call Integration: PagerDuty / OpsGenie
Setting up on-call rotation and escalation policies ensures that alerts are automatically routed to the responsible person during incidents.
[Prometheus Alert Fires]
|
v
[Alertmanager Receives and Groups]
|
v
[PagerDuty / OpsGenie Creates Incident]
|
v
[On-call Engineer Notified via Phone/SMS/Slack]
|
v
[If Not ACKed -> Escalation (Notify Manager)]
Final Checklist
Items to verify when applying real-time communication and observability in production environments.
Real-Time Communication
- WebSocket handshake authentication implemented
- Ping/Pong heartbeat interval configured
- Auto-reconnection with exponential backoff implemented
- Horizontal scaling prepared with Redis Pub/Sub
- Message persistence ensured with Kafka or Redis Streams
Observability
- All three pillars collected: metrics, logs, traces
- OpenTelemetry Collector deployed and pipeline configured
- Key indicators visualized in Grafana dashboards
- Symptom-based alerting rules configured
- On-call rotation and escalation policies established
- Alert thresholds continuously tuned to prevent alert fatigue
Quiz: Real-Time Communication and Observability Key Review
Q1. What is the biggest difference between WebSocket and SSE?
A1. WebSocket supports bidirectional (full-duplex) communication, while SSE only supports unidirectional push from server to client. WebSocket is suitable for scenarios where both sides need to send data, such as chat or gaming. SSE is suitable for scenarios where the server unilaterally sends data, such as stock tickers or notifications.
Q2. What are the three pillars of observability and what is the role of each?
A2. Metrics are numeric time-series data for problem detection. Traces track request paths across distributed systems for bottleneck identification. Logs provide detailed event records for root cause analysis.
Q3. What are the key principles for preventing alert fatigue?
A3. Set only actionable alerts, base alerts on symptoms rather than causes, tier them into Critical/Warning/Info levels, group related alerts, and implement auto-resolve detection.