Part 1: Real-Time Communication

1. HTTP Polling vs Long Polling vs SSE vs WebSocket

Different real-time data delivery methods are suited to different requirements.

Comparison Table

Feature	HTTP Polling	Long Polling	SSE	WebSocket
Direction	Unidirectional (client request)	Unidirectional (server response wait)	Unidirectional (server push)	Bidirectional (full-duplex)
Connection	New connection per request	Reconnect after response	Single HTTP connection	Single TCP connection
Overhead	High (repeated requests)	Medium	Low	Very low
Latency	Depends on polling interval	Immediate on event	Immediate on event	Immediate on event
Protocol	HTTP/1.1	HTTP/1.1	HTTP/1.1	ws:// or wss://
Browser Support	All	All	Except IE	All
Use Cases	Simple status checks	Notifications	Stock tickers, news feeds	Chat, gaming, collaborative editing

Recommended Technology by Use Case

Dashboard data refresh: SSE (server unilaterally pushes data)
Real-time chat: WebSocket (bidirectional message exchange)
File upload progress: SSE (server pushes progress updates)
Online gaming: WebSocket (ultra-low-latency bidirectional communication)
IoT sensor monitoring: WebSocket or MQTT

2. WebSocket Protocol Deep Dive

Handshake Process

A WebSocket connection begins with an HTTP Upgrade request.

GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

Server response:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

Frame Structure

WebSocket frames have the following binary structure:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+-------------------------------+

Key opcode values:

0x1: Text frame
0x2: Binary frame
0x8: Connection close
0x9: Ping
0xA: Pong

Ping/Pong Heartbeat

// Server-side (Node.js)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
  ws.isAlive = true;

  ws.on('pong', () => {
    ws.isAlive = true;
  });
});

// Check heartbeat every 30 seconds
const interval = setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) return ws.terminate();
    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

Connection Management and Error Handling

// Client-side
class WebSocketClient {
  constructor(url) {
    this.url = url;
    this.reconnectAttempts = 0;
    this.maxReconnectAttempts = 5;
    this.reconnectDelay = 1000;
    this.connect();
  }

  connect() {
    this.ws = new WebSocket(this.url);

    this.ws.onopen = () => {
      console.log('Connected');
      this.reconnectAttempts = 0;
    };

    this.ws.onclose = (event) => {
      if (event.code !== 1000) {
        this.reconnect();
      }
    };

    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
  }

  reconnect() {
    if (this.reconnectAttempts >= this.maxReconnectAttempts) {
      console.error('Max reconnect attempts reached');
      return;
    }
    this.reconnectAttempts++;
    const delay = this.reconnectDelay * Math.pow(2, this.reconnectAttempts);
    setTimeout(() => this.connect(), delay);
  }
}

3. Working with Socket.IO

Socket.IO provides an abstraction layer on top of WebSocket to simplify real-time communication development.

Namespaces and Rooms

const io = require('socket.io')(server);

// Separate namespaces
const chatNamespace = io.of('/chat');
const notificationNamespace = io.of('/notifications');

chatNamespace.on('connection', (socket) => {
  // Join room
  socket.on('joinRoom', (roomId) => {
    socket.join(roomId);
    chatNamespace.to(roomId).emit('userJoined', {
      userId: socket.id,
      room: roomId
    });
  });

  // Send message within room
  socket.on('message', (data) => {
    chatNamespace.to(data.roomId).emit('newMessage', {
      userId: socket.id,
      content: data.content,
      timestamp: Date.now()
    });
  });

  // Leave room
  socket.on('leaveRoom', (roomId) => {
    socket.leave(roomId);
  });
});

Auto-Reconnection and Fallback

// Client configuration
const socket = io('https://api.example.com', {
  // Transport layer fallback (HTTP Long Polling if WebSocket fails)
  transports: ['websocket', 'polling'],

  // Reconnection settings
  reconnection: true,
  reconnectionAttempts: 10,
  reconnectionDelay: 1000,
  reconnectionDelayMax: 5000,
  randomizationFactor: 0.5,

  // Timeout
  timeout: 20000,

  // Authentication
  auth: {
    token: 'my-jwt-token'
  }
});

socket.on('connect_error', (error) => {
  if (error.message === 'unauthorized') {
    // Token refresh logic
    refreshToken().then((newToken) => {
      socket.auth.token = newToken;
      socket.connect();
    });
  }
});

4. gRPC Streaming

gRPC supports four communication modes based on HTTP/2.

Four Streaming Modes

syntax = "proto3";

service ChatService {
  // 1. Unary: one request - one response
  rpc SendMessage (ChatMessage) returns (MessageResponse);

  // 2. Server Streaming: one request - response stream
  rpc SubscribeMessages (SubscriptionRequest) returns (stream ChatMessage);

  // 3. Client Streaming: request stream - one response
  rpc UploadLogs (stream LogEntry) returns (UploadSummary);

  // 4. Bidirectional Streaming: request stream - response stream
  rpc Chat (stream ChatMessage) returns (stream ChatMessage);
}

message ChatMessage {
  string user_id = 1;
  string content = 2;
  int64 timestamp = 3;
}

Bidirectional Streaming Implementation (Go)

func (s *chatServer) Chat(stream pb.ChatService_ChatServer) error {
    for {
        msg, err := stream.Recv()
        if err == io.EOF {
            return nil
        }
        if err != nil {
            return err
        }

        // Process and broadcast message
        response := &pb.ChatMessage{
            UserId:    msg.UserId,
            Content:   msg.Content,
            Timestamp: time.Now().UnixMilli(),
        }

        if err := stream.Send(response); err != nil {
            return err
        }
    }
}

5. Practical Architecture: Real-Time Chat

Redis Pub/Sub + WebSocket

Redis Pub/Sub is used to synchronize messages across multiple WebSocket server instances.

[Client A] <--ws--> [Server 1] <--pub/sub--> [Redis]
[Client B] <--ws--> [Server 2] <--pub/sub--> [Redis]
[Client C] <--ws--> [Server 1] <--pub/sub--> [Redis]

Node.js Implementation

const Redis = require('ioredis');
const WebSocket = require('ws');

const redisPub = new Redis();
const redisSub = new Redis();
const wss = new WebSocket.Server({ port: 8080 });

// Manage clients per channel
const channels = new Map();

wss.on('connection', (ws) => {
  ws.on('message', (raw) => {
    const data = JSON.parse(raw);

    switch (data.type) {
      case 'subscribe':
        // Subscribe to channel
        if (!channels.has(data.channel)) {
          channels.set(data.channel, new Set());
          redisSub.subscribe(data.channel);
        }
        channels.get(data.channel).add(ws);
        break;

      case 'message':
        // Broadcast message to all servers via Redis
        redisPub.publish(data.channel, JSON.stringify({
          user: data.user,
          content: data.content,
          timestamp: Date.now()
        }));
        break;
    }
  });

  ws.on('close', () => {
    // Remove from all channels on disconnect
    channels.forEach((clients, channel) => {
      clients.delete(ws);
      if (clients.size === 0) {
        channels.delete(channel);
        redisSub.unsubscribe(channel);
      }
    });
  });
});

// Deliver Redis messages to all clients in the channel
redisSub.on('message', (channel, message) => {
  const clients = channels.get(channel);
  if (clients) {
    clients.forEach((ws) => {
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(message);
      }
    });
  }
});

Scalability Considerations

Horizontal scaling: Synchronize messages across servers with Redis Pub/Sub or Redis Streams
Connection limits: Set maximum connections per server (consider OS file descriptor limits)
Load balancing: Sticky session or IP hash-based load balancing
Message persistence: Store message history with Redis Streams or Kafka
Authentication: JWT-based WebSocket handshake authentication

Part 2: Observability

6. The Three Pillars of Observability

Observability refers to the degree to which a system's internal state can be understood from its external outputs.

Metrics

Numeric time-series data.

- Request count (requests_total)
- Response time (response_duration_seconds)
- Error rate (error_rate_percent)
- CPU/memory utilization (cpu_usage_percent, memory_usage_bytes)

Logs

Text records of events that occurred at specific points in time.

{
  "timestamp": "2026-04-12T16:00:00Z",
  "level": "ERROR",
  "service": "chat-server",
  "trace_id": "abc123def456",
  "message": "WebSocket connection failed",
  "details": {
    "user_id": "user-789",
    "error_code": 1006,
    "reason": "abnormal closure"
  }
}

Traces

Track the complete path of a request as it traverses multiple services in a distributed system.

[API Gateway] ---> [Auth Service] ---> [Chat Service] ---> [Redis]
   Span 1            Span 2              Span 3            Span 4
   |____________ Trace ID: abc123 __________________________|

How the Three Pillars Relate

Question	Resolution Tool
What went wrong?	Metrics (detect error rate spikes)
Why did it go wrong?	Traces (identify bottlenecks)
What exactly happened?	Logs (examine detailed error messages)

7. OpenTelemetry

OpenTelemetry (OTel) is a CNCF project that collects metrics, logs, and traces in a vendor-neutral manner.

Architecture

[Application + OTel SDK]
        |
        | OTLP (gRPC/HTTP)
        v
[OTel Collector]
   |       |       |
   v       v       v
[Jaeger] [Prom.] [Loki]

Auto-Instrumentation (Node.js)

// tracing.js - load before application starts
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');

const sdk = new NodeSDK({
  serviceName: 'chat-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

8. Prometheus + Grafana

Prometheus Metric Types

1. Counter   - Monotonically increasing value (e.g., total request count)
2. Gauge     - Value that can increase or decrease (e.g., current active connections)
3. Histogram - Value distribution (e.g., response time distribution)
4. Summary   - Percentiles (e.g., p99 response time)

WebSocket Custom Metrics Example (Node.js)

const promClient = require('prom-client');

// Counter: total connections
const wsConnectionsTotal = new promClient.Counter({
  name: 'websocket_connections_total',
  help: 'Total WebSocket connections',
  labelNames: ['status'],
});

// Gauge: current active connections
const wsActiveConnections = new promClient.Gauge({
  name: 'websocket_active_connections',
  help: 'Current active WebSocket connections',
  labelNames: ['namespace'],
});

// Histogram: message processing duration
const wsMessageDuration = new promClient.Histogram({
  name: 'websocket_message_duration_seconds',
  help: 'WebSocket message processing duration',
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});

// Connect metrics to WebSocket events
wss.on('connection', (ws) => {
  wsConnectionsTotal.inc({ status: 'connected' });
  wsActiveConnections.inc({ namespace: 'default' });

  ws.on('message', (data) => {
    const end = wsMessageDuration.startTimer();
    processMessage(data);
    end();
  });

  ws.on('close', () => {
    wsActiveConnections.dec({ namespace: 'default' });
  });
});

Key PromQL Queries

# WebSocket connection request rate per second (5-minute moving average)
rate(websocket_connections_total[5m])

# Current active WebSocket connections
websocket_active_connections

# Message processing time p99
histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))

# Error rate (percentage)
rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100

# Instances with CPU utilization above 80%
node_cpu_usage_percent > 80

Grafana Dashboard Design

{
  "dashboard": {
    "title": "WebSocket Realtime Monitoring",
    "panels": [
      {
        "title": "Active Connections",
        "type": "stat",
        "targets": [
          { "expr": "sum(websocket_active_connections)" }
        ]
      },
      {
        "title": "Connection Rate",
        "type": "timeseries",
        "targets": [
          { "expr": "rate(websocket_connections_total[5m])" }
        ]
      },
      {
        "title": "Message Latency (p50/p95/p99)",
        "type": "timeseries",
        "targets": [
          { "expr": "histogram_quantile(0.50, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p50" },
          { "expr": "histogram_quantile(0.95, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p95" },
          { "expr": "histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m]))", "legendFormat": "p99" }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "targets": [
          { "expr": "rate(websocket_errors_total[5m]) / rate(websocket_connections_total[5m]) * 100" }
        ]
      }
    ]
  }
}

9. Distributed Tracing

Span and Trace Structure

Trace ID: 7f3a8b2c1d4e5f6a

Span A: API Gateway        [============                    ] 0-200ms
  Span B: Auth Service        [====                         ] 10-50ms
  Span C: Chat Service            [==================       ] 60-180ms
    Span D: Redis Pub/Sub              [====                ] 80-120ms
    Span E: Database Write                  [=======        ] 130-170ms

Jaeger Setup (Docker Compose)

version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:1.54
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200

Manual Span Creation (Python)

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("chat-service")

def handle_websocket_message(message):
    with tracer.start_as_current_span("process_message") as span:
        span.set_attribute("message.type", message.get("type"))
        span.set_attribute("message.size", len(str(message)))

        try:
            # Message processing logic
            result = process(message)
            span.set_attribute("processing.result", "success")
            return result
        except Exception as e:
            span.set_status(StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

10. Log Aggregation

ELK Stack (Elasticsearch + Logstash + Kibana)

# logstash.conf
input {
  tcp {
    port => 5000
    codec => json
  }
}

filter {
  if [level] == "ERROR" {
    mutate {
      add_tag => ["alert"]
    }
  }
  date {
    match => ["timestamp", "ISO8601"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "websocket-logs-%{+YYYY.MM.dd}"
  }
}

Loki + Grafana (Lightweight Alternative)

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2026-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

Structured Logging (Node.js + Winston)

const winston = require('winston');

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: { service: 'websocket-server' },
  transports: [
    new winston.transports.Console(),
    new winston.transports.Http({
      host: 'loki-gateway',
      port: 3100,
      path: '/loki/api/v1/push',
    }),
  ],
});

// Usage example
logger.info('WebSocket connection established', {
  userId: 'user-123',
  remoteAddress: '192.168.1.100',
  protocol: 'wss',
});

logger.error('Message processing failed', {
  userId: 'user-456',
  errorCode: 'PARSE_ERROR',
  rawMessage: '<sanitized>',
  traceId: 'abc123',
});

11. Alert Design

Preventing Alert Fatigue

When there are too many alerts, even critical ones get ignored. Effective alert design principles are as follows.

Principle	Description
Actionability	Receiving an alert should enable immediate action
Symptom-based	Alerts based on user impact (symptoms), not causes
Tiered	Separate into Critical / Warning / Info levels
Auto-resolve	Automatically close alerts when the issue resolves
Grouping	Bundle related alerts and send only once

Prometheus Alerting Rules

# alerting-rules.yaml
groups:
  - name: websocket-alerts
    rules:
      - alert: HighConnectionDropRate
        expr: rate(websocket_disconnections_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High WebSocket disconnection rate"
          description: "More than 10 disconnections per second over 5 minutes"

      - alert: WebSocketLatencyHigh
        expr: histogram_quantile(0.99, rate(websocket_message_duration_seconds_bucket[5m])) > 1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "High WebSocket message latency"
          description: "p99 latency exceeds 1 second"

      - alert: ActiveConnectionsSpiking
        expr: websocket_active_connections > 10000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Active WebSocket connections spiking"
          description: "Active connection count exceeds 10,000"

Alertmanager Routing Configuration

# alertmanager.yaml
global:
  resolve_timeout: 5m

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default-receiver'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXXX'
        channel: '#alerts-general'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXXX'
        channel: '#alerts-warning'
        title: 'Warning Alert'

On-Call Integration: PagerDuty / OpsGenie

Setting up on-call rotation and escalation policies ensures that alerts are automatically routed to the responsible person during incidents.

[Prometheus Alert Fires]
        |
        v
[Alertmanager Receives and Groups]
        |
        v
[PagerDuty / OpsGenie Creates Incident]
        |
        v
[On-call Engineer Notified via Phone/SMS/Slack]
        |
        v
[If Not ACKed -> Escalation (Notify Manager)]

Final Checklist

Items to verify when applying real-time communication and observability in production environments.

Real-Time Communication

WebSocket handshake authentication implemented
Ping/Pong heartbeat interval configured
Auto-reconnection with exponential backoff implemented
Horizontal scaling prepared with Redis Pub/Sub
Message persistence ensured with Kafka or Redis Streams

Observability

All three pillars collected: metrics, logs, traces
OpenTelemetry Collector deployed and pipeline configured
Key indicators visualized in Grafana dashboards
Symptom-based alerting rules configured
On-call rotation and escalation policies established
Alert thresholds continuously tuned to prevent alert fatigue

Quiz: Real-Time Communication and Observability Key Review

Q1. What is the biggest difference between WebSocket and SSE?

A1. WebSocket supports bidirectional (full-duplex) communication, while SSE only supports unidirectional push from server to client. WebSocket is suitable for scenarios where both sides need to send data, such as chat or gaming. SSE is suitable for scenarios where the server unilaterally sends data, such as stock tickers or notifications.

Q2. What are the three pillars of observability and what is the role of each?

A2. Metrics are numeric time-series data for problem detection. Traces track request paths across distributed systems for bottleneck identification. Logs provide detailed event records for root cause analysis.

Q3. What are the key principles for preventing alert fatigue?

A3. Set only actionable alerts, base alerts on symptoms rather than causes, tier them into Critical/Warning/Info levels, group related alerts, and implement auto-resolve detection.