- Published on
WebSocket Real-Time Communication & Observability/Monitoring Complete Guide
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Part 1: Real-Time Communication
- Part 2: Observability
Part 1: Real-Time Communication
1. HTTP Polling vs Long Polling vs SSE vs WebSocket
Different real-time data delivery methods are suited to different requirements.
Comparison Table
| Feature | HTTP Polling | Long Polling | SSE | WebSocket |
|---|---|---|---|---|
| Direction | Unidirectional (client request) | Unidirectional (server response wait) | Unidirectional (server push) | Bidirectional (full-duplex) |
| Connection | New connection per request | Reconnect after response | Single HTTP connection | Single TCP connection |
| Overhead | High (repeated requests) | Medium | Low | Very low |
| Latency | Depends on polling interval | Immediate on event | Immediate on event | Immediate on event |
| Protocol | HTTP/1.1 | HTTP/1.1 | HTTP/1.1 | ws:// or wss:// |
| Browser Support | All | All | Except IE | All |
| Use Cases | Simple status checks | Notifications | Stock tickers, news feeds | Chat, gaming, collaborative editing |
Recommended Technology by Use Case
- Dashboard data refresh: SSE (server unilaterally pushes data)
- Real-time chat: WebSocket (bidirectional message exchange)
- File upload progress: SSE (server pushes progress updates)
- Online gaming: WebSocket (ultra-low-latency bidirectional communication)
- IoT sensor monitoring: WebSocket or MQTT
2. WebSocket Protocol Deep Dive
Handshake Process
A WebSocket connection begins with an HTTP Upgrade request.
GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Server response:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Frame Structure
WebSocket frames have the following binary structure:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len | Extended payload length |
|I|S|S|S| (4) |A| (7) | (16/64) |
|N|V|V|V| |S| | (if payload len==126/127) |
| |1|2|3| |K| | |
+-+-+-+-+-------+-+-------------+-------------------------------+
Key opcode values:
0x1: Text frame0x2: Binary frame0x8: Connection close0x9: Ping0xA: Pong
Ping/Pong Heartbeat
// Server-side (Node.js)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws) => {
ws.isAlive = true;
ws.on('pong', () => {
ws.isAlive = true;
});
});
// Check heartbeat every 30 seconds
const interval = setInterval(() => {
wss.clients.forEach((ws) => {
if (!ws.isAlive) return ws.terminate();
ws.isAlive = false;
ws.ping();
});
}, 30000);
Connection Management and Error Handling
// Client-side
class WebSocketClient {
constructor(url) {
this.url = url;
this.reconnectAttempts = 0;
this.maxReconnectAttempts = 5;
this.reconnectDelay = 1000;
this.connect();
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
console.log('Connected');
this.reconnectAttempts = 0;
};
this.ws.onclose = (event) => {
if (event.code !== 1000) {
this.reconnect();
}
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
}
reconnect() {
if (this.reconnectAttempts >= this.maxReconnectAttempts) {
console.error('Max reconnect attempts reached');
return;
}
this.reconnectAttempts++;
const delay = this.reconnectDelay * Math.pow(2, this.reconnectAttempts);
setTimeout(() => this.connect(), delay);
}
}
3. Working with Socket.IO
Socket.IO provides an abstraction layer on top of WebSocket to simplify real-time communication development.
Namespaces and Rooms
const io = require('socket.io')(server);
// Separate namespaces
const chatNamespace = io.of('/chat');
const notificationNamespace = io.of('/notifications');
chatNamespace.on('connection', (socket) => {
// Join room
socket.on('joinRoom', (roomId) => {
socket.join(roomId);
chatNamespace.to(roomId).emit('userJoined', {
userId: socket.id,
room: roomId
});
});
// Send message within room
socket.on('message', (data) => {
chatNamespace.to(data.roomId).emit('newMessage', {
userId: socket.id,
content: data.content,
timestamp: Date.now()
});
});
// Leave room
socket.on('leaveRoom', (roomId) => {
socket.leave(roomId);
});
});
Auto-Reconnection and Fallback
// Client configuration
const socket = io('https://api.example.com', {
// Transport layer fallback (HTTP Long Polling if WebSocket fails)
transports: ['websocket', 'polling'],
// Reconnection settings
reconnection: true,
reconnectionAttempts: 10,
reconnectionDelay: 1000,
reconnectionDelayMax: 5000,
randomizationFactor: 0.5,
// Timeout
timeout: 20000,
// Authentication
auth: {
token: 'my-jwt-token'
}
});
socket.on('connect_error', (error) => {
if (error.message === 'unauthorized') {
// Token refresh logic
refreshToken().then((newToken) => {
socket.auth.token = newToken;
socket.connect();
});
}
});
4. gRPC Streaming
gRPC supports four communication modes based on HTTP/2.
Four Streaming Modes
syntax = "proto3";
service ChatService {
// 1. Unary: one request - one response
rpc SendMessage (ChatMessage) returns (MessageResponse);
// 2. Server Streaming: one request - response stream
rpc SubscribeMessages (SubscriptionRequest) returns (stream ChatMessage);
// 3. Client Streaming: request stream - one response
rpc UploadLogs (stream LogEntry) returns (UploadSummary);
// 4. Bidirectional Streaming: request stream - response stream
rpc Chat (stream ChatMessage) returns (stream ChatMessage);
}
message ChatMessage {
string user_id = 1;
string content = 2;
int64 timestamp = 3;
}
Bidirectional Streaming Implementation (Go)
func (s *chatServer) Chat(stream pb.ChatService_ChatServer) error {
for {
msg, err := stream.Recv()
if err == io.EOF {
return nil
}
if err != nil {
return err
}
// Process and broadcast message
response := &pb.ChatMessage{
UserId: msg.UserId,
Content: msg.Content,
Timestamp: time.Now().UnixMilli(),
}
if err := stream.Send(response); err != nil {
return err
}
}
}
5. Practical Architecture: Real-Time Chat
Redis Pub/Sub + WebSocket
Redis Pub/Sub is used to synchronize messages across multiple WebSocket server instances.
[Client A] <--ws--> [Server 1] <--pub/sub--> [Redis]
[Client B] <--ws--> [Server 2] <--pub/sub--> [Redis]
[Client C] <--ws--> [Server 1] <--pub/sub--> [Redis]
Node.js Implementation
const Redis = require('ioredis');
const WebSocket = require('ws');
const redisPub = new Redis();
const redisSub = new Redis();
const wss = new WebSocket.Server({ port: 8080 });
// Manage clients per channel
const channels = new Map();
wss.on('connection', (ws) => {
ws.on('message', (raw) => {
const data = JSON.parse(raw);
switch (data.type) {
case 'subscribe':
// Subscribe to channel
if (!channels.has(data.channel)) {
channels.set(data.channel, new Set());
redisSub.subscribe(data.channel);
}
channels.get(data.channel).add(ws);
break;
case 'message':
// Broadcast message to all servers via Redis
redisPub.publish(data.channel, JSON.stringify({
user: data.user,
content: data.content,
timestamp: Date.now()
}));
break;
}
});
ws.on('close', () => {
// Remove from all channels on disconnect
channels.forEach((clients, channel) => {
clients.delete(ws);
if (clients.size === 0) {
channels.delete(channel);
redisSub.unsubscribe(channel);
}
});
});
});
// Deliver Redis messages to all clients in the channel
redisSub.on('message', (channel, message) => {
const clients = channels.get(channel);
if (clients) {
clients.forEach((ws) => {
if (ws.readyState === WebSocket.OPEN) {
ws.send(message);
}
});
}
});
Scalability Considerations
- Horizontal scaling: Synchronize messages across servers with Redis Pub/Sub or Redis Streams
- Connection limits: Set maximum connections per server (consider OS file descriptor limits)
- Load balancing: Sticky session or IP hash-based load balancing
- Message persistence: Store message history with Redis Streams or Kafka
- Authentication: JWT-based WebSocket handshake authentication
Part 2: Observability
6. The Three Pillars of Observability
Observability refers to the degree to which a system's internal state can be understood from its external outputs.
Metrics
Numeric time-series data.
- Request count (requests_total)
- Response time (response_duration_seconds)
- Error rate (error_rate_percent)
- CPU/memory utilization (cpu_usage_percent, memory_usage_bytes)
Logs
Text records of events that occurred at specific points in time.
{
"timestamp": "2026-04-12T16:00:00Z",
"level": "ERROR",
"service": "chat-server",
"trace_id": "abc123def456",
"message": "WebSocket connection failed",
"details": {
"user_id": "user-789",
"error_code": 1006,
"reason": "abnormal closure"
}
}
Traces
Track the complete path of a request as it traverses multiple services in a distributed system.
[API Gateway] ---> [Auth Service] ---> [Chat Service] ---> [Redis]
Span 1 Span 2 Span 3 Span 4
|____________ Trace ID: abc123 __________________________|
How the Three Pillars Relate
| Question | Resolution Tool |
|---|---|
| What went wrong? | Metrics (detect error rate spikes) |
| Why did it go wrong? | Traces (identify bottlenecks) |
| What exactly happened? | Logs (examine detailed error messages) |
7. OpenTelemetry
OpenTelemetry (OTel) is a CNCF project that collects metrics, logs, and traces in a vendor-neutral manner.
Architecture
[Application + OTel SDK]
|
| OTLP (gRPC/HTTP)
v
[OTel Collector]
| | |
v v v
[Jaeger] [Prom.] [Loki]
Auto-Instrumentation (Node.js)
// tracing.js - load before application starts
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const sdk = new NodeSDK({
serviceName: 'chat-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://otel-collector:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
OTel Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheus:
endpoint: 0.0.0.0:8889
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, memory_limiter]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
8. Prometheus + Grafana
Prometheus Metric Types
1. Counter - Monotonically increasing value (e.g., total request count)
2. Gauge - Value that can increase or decrease (e.g., current active connections)
3. Histogram - Value distribution (e.g., response time distribution)
4. Summary - Percentiles (e.g., p99 response time)
WebSocket Custom Metrics Example (Node.js)
const promClient = require('prom-client');
// Counter: total connections
const wsConnectionsTotal = new promClient.Counter({
name: 'websocket_connections_total',
help: 'Total WebSocket connections',
labelNames: ['status'],
});
// Gauge: current active connections
const wsActiveConnections = new promClient.Gauge({
name: 'websocket_active_connections',
help: 'Current active WebSocket connections',
labelNames: ['namespace'],
});
// Histogram: message processing duration
const wsMessageDuration = new promClient.Histogram({
name: 'websocket_message_duration_seconds',
help: 'WebSocket message processing duration',
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});
// Connect metrics to WebSocket events
wss.on('connection', (ws) => {
wsConnectionsTotal.inc({ status: 'connected' });
wsActiveConnections.inc({ namespace: 'default' });
ws.on('message', (data) => {
const end = wsMessageDuration.startTimer();
processMessage(data);
end();
});
ws.on('close', () => {
wsActiveConnections.dec({ namespace: 'default' });
});
});
Key PromQL Queries
# WebSocket connection request rate per second (5-minute moving average)
rate(websocket_connections_total[5m])
# Current active WebSocket connections
websocket_active_connections
# Message processing time p99
histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))
# Error rate (percentage)
rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100
# Instances with CPU utilization above 80%
node_cpu_usage_percent > 80
Grafana Dashboard Design
{
"dashboard": {
"title": "WebSocket Realtime Monitoring",
"panels": [
{
"title": "Active Connections",
"type": "stat",
"targets": [
{ "expr": "sum(websocket_active_connections)" }
]
},
{
"title": "Connection Rate",
"type": "timeseries",
"targets": [
{ "expr": "rate(websocket_connections_total[5m])" }
]
},
{
"title": "Message Latency (p50/p95/p99)",
"type": "timeseries",
"targets": [
{ "expr": "histogram_quantile(0.50, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p99" }
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{ "expr": "rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100" }
]
}
]
}
}
9. Distributed Tracing
Span and Trace Structure
Trace ID: 7f3a8b2c1d4e5f6a
Span A: API Gateway [============ ] 0-200ms
Span B: Auth Service [==== ] 10-50ms
Span C: Chat Service [================== ] 60-180ms
Span D: Redis Pub/Sub [==== ] 80-120ms
Span E: Database Write [======= ] 130-170ms
Jaeger Setup (Docker Compose)
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.54
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200
Manual Span Creation (Python)
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("chat-service")
def handle_websocket_message(message):
with tracer.start_as_current_span("process_message") as span:
span.set_attribute("message.type", message.get("type"))
span.set_attribute("message.size", len(str(message)))
try:
# Message processing logic
result = process(message)
span.set_attribute("processing.result", "success")
return result
except Exception as e:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e)
raise
10. Log Aggregation
ELK Stack (Elasticsearch + Logstash + Kibana)
# logstash.conf
input {
tcp {
port => 5000
codec => json
}
}
filter {
if [level] == "ERROR" {
mutate {
add_tag => ["alert"]
}
}
date {
match => ["timestamp", "ISO8601"]
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "websocket-logs-%{+YYYY.MM.dd}"
}
}
Loki + Grafana (Lightweight Alternative)
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2026-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
filesystem:
directory: /loki/chunks
Structured Logging (Node.js + Winston)
const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
defaultMeta: { service: 'websocket-server' },
transports: [
new winston.transports.Console(),
new winston.transports.Http({
host: 'loki-gateway',
port: 3100,
path: '/loki/api/v1/push',
}),
],
});
// Usage example
logger.info('WebSocket connection established', {
userId: 'user-123',
remoteAddress: '192.168.1.100',
protocol: 'wss',
});
logger.error('Message processing failed', {
userId: 'user-456',
errorCode: 'PARSE_ERROR',
rawMessage: '<sanitized>',
traceId: 'abc123',
});
11. Alert Design
Preventing Alert Fatigue
When there are too many alerts, even critical ones get ignored. Effective alert design principles are as follows.
| Principle | Description |
|---|---|
| Actionability | Receiving an alert should enable immediate action |
| Symptom-based | Alerts based on user impact (symptoms), not causes |
| Tiered | Separate into Critical / Warning / Info levels |
| Auto-resolve | Automatically close alerts when the issue resolves |
| Grouping | Bundle related alerts and send only once |
Prometheus Alerting Rules
# alerting-rules.yaml
groups:
- name: websocket-alerts
rules:
- alert: HighConnectionDropRate
expr: rate(websocket_disconnections_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High WebSocket disconnection rate"
description: "More than 10 disconnections per second over 5 minutes"
- alert: WebSocketLatencyHigh
expr: histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m])) > 1
for: 3m
labels:
severity: critical
annotations:
summary: "High WebSocket message latency"
description: "p99 latency exceeds 1 second"
- alert: ActiveConnectionsSpiking
expr: websocket_active_connections > 10000
for: 1m
labels:
severity: warning
annotations:
summary: "Active WebSocket connections spiking"
description: "Active connection count exceeds 10,000"
Alertmanager Routing Configuration
# alertmanager.yaml
global:
resolve_timeout: 5m
route:
receiver: 'default-receiver'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXX'
channel: '#alerts-general'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
severity: critical
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXX'
channel: '#alerts-warning'
title: 'Warning Alert'
On-Call Integration: PagerDuty / OpsGenie
Setting up on-call rotation and escalation policies ensures that alerts are automatically routed to the responsible person during incidents.
[Prometheus Alert Fires]
|
v
[Alertmanager Receives and Groups]
|
v
[PagerDuty / OpsGenie Creates Incident]
|
v
[On-call Engineer Notified via Phone/SMS/Slack]
|
v
[If Not ACKed -> Escalation (Notify Manager)]
Final Checklist
Items to verify when applying real-time communication and observability in production environments.
Real-Time Communication
- WebSocket handshake authentication implemented
- Ping/Pong heartbeat interval configured
- Auto-reconnection with exponential backoff implemented
- Horizontal scaling prepared with Redis Pub/Sub
- Message persistence ensured with Kafka or Redis Streams
Observability
- All three pillars collected: metrics, logs, traces
- OpenTelemetry Collector deployed and pipeline configured
- Key indicators visualized in Grafana dashboards
- Symptom-based alerting rules configured
- On-call rotation and escalation policies established
- Alert thresholds continuously tuned to prevent alert fatigue
Quiz: Real-Time Communication and Observability Key Review
Q1. What is the biggest difference between WebSocket and SSE?
A1. WebSocket supports bidirectional (full-duplex) communication, while SSE only supports unidirectional push from server to client. WebSocket is suitable for scenarios where both sides need to send data, such as chat or gaming. SSE is suitable for scenarios where the server unilaterally sends data, such as stock tickers or notifications.
Q2. What are the three pillars of observability and what is the role of each?
A2. Metrics are numeric time-series data for problem detection. Traces track request paths across distributed systems for bottleneck identification. Logs provide detailed event records for root cause analysis.
Q3. What are the key principles for preventing alert fatigue?
A3. Set only actionable alerts, base alerts on symptoms rather than causes, tier them into Critical/Warning/Info levels, group related alerts, and implement auto-resolve detection.