Skip to content
Published on

Observability & Monitoring Complete Guide 2025: Logging, Metrics, Tracing, Alerting Strategy

Authors

TOC

1. Monitoring vs Observability

1.1 Limitations of Monitoring

Traditional monitoring focuses on detecting known unknowns. Alert when CPU usage exceeds 90%, alert when disk usage exceeds 80% -- this threshold-based approach has clear limits.

In modern distributed systems, this alone is insufficient:

  • Complex interactions between microservices
  • Frequent transient errors
  • Increase in unpredictable problems (unknown unknowns)
  • Performance degradation that cannot be explained by a single metric

1.2 What is Observability

Observability is the ability to understand a system's internal state through its external outputs.

Key difference:

  • Monitoring: "What is broken?"
  • Observability: "Why is it broken?"

The Three Pillars of Observability:

  1. Logs - Records of discrete events
  2. Metrics - Numeric measurements over time
  3. Traces - Request flow through distributed systems

2. OpenTelemetry (OTel) Core Concepts

2.1 What is OpenTelemetry

OpenTelemetry is a vendor-neutral telemetry collection framework managed by the CNCF (Cloud Native Computing Foundation). It unifies logs, metrics, and traces under a single standard.

Components:

  • API: Interfaces for instrumentation
  • SDK: Implementation of the API
  • Collector: Collects, processes, and exports telemetry data
  • Auto-instrumentation: Collects telemetry without code changes

2.2 OTel SDK Setup

// Node.js OpenTelemetry setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: '1.2.0',
    environment: process.env.NODE_ENV || 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4318/v1/metrics',
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instrument HTTP, Express, pg, Redis, etc.
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingRequestHook: (req) => {
          // Exclude health check paths
          return req.url === '/health';
        },
      },
      '@opentelemetry/instrumentation-express': {
        enabled: true,
      },
    }),
  ],
});

sdk.start();

// Flush telemetry on graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});

2.3 Manual Span Creation

import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processOrder(orderData) {
  // Create a manual span
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      // Add attributes to span
      span.setAttribute('order.id', orderData.id);
      span.setAttribute('order.total', orderData.total);
      span.setAttribute('order.items_count', orderData.items.length);

      // Child span: inventory check
      const inventory = await tracer.startActiveSpan(
        'checkInventory',
        async (childSpan) => {
          childSpan.setAttribute('inventory.warehouse', 'us-east-1');
          const result = await inventoryService.check(orderData.items);
          childSpan.end();
          return result;
        }
      );

      if (!inventory.available) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: 'Insufficient inventory',
        });
        throw new Error('Insufficient inventory');
      }

      // Child span: payment processing
      const payment = await tracer.startActiveSpan(
        'processPayment',
        async (childSpan) => {
          childSpan.setAttribute('payment.method', orderData.paymentMethod);
          childSpan.setAttribute('payment.amount', orderData.total);
          const result = await paymentService.charge(orderData);
          childSpan.end();
          return result;
        }
      );

      // Add span event
      span.addEvent('order_completed', {
        'order.id': orderData.id,
        'payment.transaction_id': payment.transactionId,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return { orderId: orderData.id, transactionId: payment.transactionId };
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

2.4 OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  # Tail sampling (prioritize error traces)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, tail_sampling]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

3. Logging

3.1 Structured Logging

// Structured logging with pino
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  // JSON format (default)
  formatters: {
    level(label) {
      return { level: label };
    },
    bindings(bindings) {
      return {
        service: 'order-service',
        version: '1.2.0',
        host: bindings.hostname,
        pid: bindings.pid,
      };
    },
  },
  // Timestamp format
  timestamp: pino.stdTimeFunctions.isoTime,
  // Redact sensitive information
  redact: ['req.headers.authorization', 'req.headers.cookie', '*.password'],
});

// Correlation ID middleware
function correlationMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || generateId();
  req.correlationId = correlationId;
  res.setHeader('x-correlation-id', correlationId);

  // Per-request child logger
  req.log = logger.child({
    correlationId,
    requestId: generateId(),
    method: req.method,
    path: req.path,
    userAgent: req.headers['user-agent'],
  });

  next();
}

// Usage example
app.post('/api/orders', correlationMiddleware, async (req, res) => {
  req.log.info({ body: req.body }, 'Order creation started');

  try {
    const order = await createOrder(req.body);
    req.log.info(
      { orderId: order.id, duration: Date.now() - req.startTime },
      'Order created successfully'
    );
    res.json(order);
  } catch (error) {
    req.log.error(
      { err: error, body: req.body },
      'Order creation failed'
    );
    res.status(500).json({ error: 'Internal server error' });
  }
});

3.2 Log Level Strategy

LevelPurposeExample
FATALSystem shutdownComplete database connection failure
ERRORError occurredPayment processing failed
WARNPotential issuesRetry succeeded, cache miss
INFOKey business eventsOrder created, user login
DEBUGDebugging informationSQL queries, API request/response
TRACEDetailed tracingFunction entry/exit, variable values
// Log levels by environment
const logLevels = {
  production: 'info',
  staging: 'debug',
  development: 'trace',
};

3.3 ELK Stack (Elasticsearch + Logstash + Kibana)

# docker-compose.yml - ELK Stack
version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.12.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch

volumes:
  es-data:
# logstash.conf
input {
  tcp {
    port => 5044
    codec => json
  }
}

filter {
  # Timestamp parsing
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
  }

  # Parse error stack traces
  if [level] == "error" {
    grok {
      match => {
        "stack" => "%{GREEDYDATA:error_class}: %{GREEDYDATA:error_message}"
      }
    }
  }

  # Add geolocation data
  if [client_ip] {
    geoip {
      source => "client_ip"
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

3.4 Grafana Loki (Lightweight Log Aggregation)

# Loki configuration
auth_enabled: false

server:
  http_listen_port: 3100

common:
  ring:
    kvstore:
      store: inmemory
  replication_factor: 1

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 30d
// Send logs directly to Loki (winston-loki)
import winston from 'winston';
import LokiTransport from 'winston-loki';

const logger = winston.createLogger({
  transports: [
    new LokiTransport({
      host: 'http://loki:3100',
      labels: {
        service: 'order-service',
        environment: 'production',
      },
      json: true,
      batching: true,
      interval: 5,
    }),
  ],
});

4. Metrics

4.1 Prometheus Fundamentals

Prometheus is a pull-based time-series database.

Metric types:

  • Counter: Monotonically increasing value (request count, error count)
  • Gauge: Value that can increase or decrease (active connections, memory usage)
  • Histogram: Distribution of values (response times, payload sizes)
  • Summary: Client-side percentile computation
// Node.js Prometheus client
import { Registry, Counter, Gauge, Histogram, Summary } from 'prom-client';

const register = new Registry();

// Collect default metrics (CPU, memory, event loop, etc.)
import { collectDefaultMetrics } from 'prom-client';
collectDefaultMetrics({ register });

// Counter: HTTP request count
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

// Gauge: Current active connections
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
  registers: [register],
});

// Histogram: Request response time
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

// Summary: DB query time
const dbQueryDuration = new Summary({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['operation', 'table'],
  percentiles: [0.5, 0.9, 0.95, 0.99],
  registers: [register],
});

// Express middleware
function metricsMiddleware(req, res, next) {
  const start = process.hrtime.bigint();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = Number(process.hrtime.bigint() - start) / 1e9;
    const labels = {
      method: req.method,
      path: req.route?.path || req.path,
      status: res.statusCode.toString(),
    };

    httpRequestsTotal.inc(labels);
    httpRequestDuration.observe(labels, duration);
    activeConnections.dec();
  });

  next();
}

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

4.2 Essential PromQL Queries

# Requests per second (RPS)
rate(http_requests_total[5m])

# Error rate by service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# p99 response time
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# p95 response time (by path)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path)
)

# Memory usage growth rate
deriv(process_resident_memory_bytes[1h])

# Uptime
time() - process_start_time_seconds

# Apdex score (target response time 0.5s)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))

4.3 Recording Rules

# prometheus-rules.yml
groups:
  - name: request_rates
    interval: 30s
    rules:
      # Pre-computed request rate
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      # Pre-computed error rate
      - record: service:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

      # Pre-computed latency percentiles
      - record: service:http_latency:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

5. Visualization (Grafana)

5.1 RED Method Dashboard

The RED Method is a methodology for monitoring core service performance indicators:

  • Rate: Requests per second
  • Errors: Error rate
  • Duration: Response time
{
  "panels": [
    {
      "title": "Request Rate (RPS)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "service"
        }
      ]
    },
    {
      "title": "Error Rate (%)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "100 * sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "service"
        }
      ]
    },
    {
      "title": "Response Time (p99)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
          "legendFormat": "service"
        }
      ]
    }
  ]
}

5.2 USE Method (Infrastructure Monitoring)

The USE Method is a methodology for monitoring infrastructure resources:

  • Utilization: Resource usage percentage
  • Saturation: Queue length
  • Errors: Error count
# CPU Utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Utilization
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk I/O Saturation
rate(node_disk_io_time_weighted_seconds_total[5m])

# Network Errors
rate(node_network_receive_errs_total[5m])
+ rate(node_network_transmit_errs_total[5m])

6. Distributed Tracing

6.1 Core Concepts

  • Trace: The complete path of a request through the system
  • Span: An individual unit of work within a Trace
  • Context Propagation: Passing trace context between services
  • Trace ID: A unique ID identifying the entire request
  • Span ID: A unique ID identifying an individual operation
  • Parent Span ID: The Span ID of the parent operation

6.2 Context Propagation

// W3C Trace Context propagation
// Request headers:
// traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
// tracestate: congo=t61rcWkgMzE

// Extract/inject context in Express middleware
import { propagation, context, trace } from '@opentelemetry/api';

// Inject context in HTTP client
async function makeRequest(url, data) {
  const headers = {};

  // Inject trace info from current context into headers
  propagation.inject(context.active(), headers);

  return fetch(url, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...headers, // Includes traceparent, tracestate
    },
    body: JSON.stringify(data),
  });
}

// gRPC context propagation
import { GrpcInstrumentation } from '@opentelemetry/instrumentation-grpc';

// Auto-instrumentation injects/extracts context in gRPC metadata

6.3 Jaeger Setup

# docker-compose.yml - Jaeger
services:
  jaeger:
    image: jaegertracing/all-in-one:1.53
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200

6.4 Sampling Strategies

// Sampling configuration
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

// Ratio-based sampling (collect only 10%)
const sampler = new TraceIdRatioBasedSampler(0.1);

// Parent-based sampling (if parent is sampled, child is also sampled)
const parentBasedSampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1),
});

Sampling strategy comparison:

StrategyDescriptionProsCons
Head SamplingDecision at trace startSimple, low overheadMay miss error traces
Tail SamplingDecision after trace completesPreserves error/slow tracesHigh memory usage
Rate LimitingLimit collections per secondPredictable costMisses during traffic spikes
ProbabilisticProbability-based collectionUniform sampleMisses rare events

7. Alerting Strategy

7.1 Preventing Alert Fatigue

Alert fatigue occurs when too many alerts cause important ones to be ignored.

Principles:

  1. Only send actionable alerts
  2. Distinguish severity levels
  3. Proper routing (who receives it and when)
  4. Alert grouping (bundle alerts for the same issue)
  5. Auto-resolve (automatically close alerts when issues are fixed)

7.2 Severity Levels

# Prometheus alerting rules
groups:
  - name: service_alerts
    rules:
      # Critical: Immediate response required
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on service"
          description: "Error rate is above 5% for 5 minutes"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      # Warning: Attention needed
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on service"
          description: "p99 latency is above 2 seconds for 10 minutes"

      # Info: Informational
      - alert: PodRestart
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        labels:
          severity: info
        annotations:
          summary: "Pod restarting frequently"

7.3 Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-slack'

  routes:
    # Critical -> PagerDuty + Slack
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
      continue: true

    - match:
        severity: critical
      receiver: 'slack-critical'

    # Warning -> Slack
    - match:
        severity: warning
      receiver: 'slack-warning'
      repeat_interval: 4h

    # Info -> Slack (business hours only)
    - match:
        severity: info
      receiver: 'slack-info'
      active_time_intervals:
        - business-hours

receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#alerts-general'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        severity: critical

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        color: 'danger'

  - name: 'slack-warning'
    slack_configs:
      - channel: '#alerts-warning'
        color: 'warning'

  - name: 'slack-info'
    slack_configs:
      - channel: '#alerts-info'

time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

inhibit_rules:
  # When Critical fires, suppress Warning for the same service
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['service']

7.4 Writing Runbooks

# Runbook: High Error Rate

## Alert Condition
- Error rate (5xx) above 5% persisting for more than 5 minutes

## Immediate Checks
1. Identify affected endpoints
2. Check recent deployment history
3. Verify dependent service health

## Response Procedure
1. Check error patterns in Grafana dashboard
2. Review error details in logs
3. Identify failure point using traces
4. If caused by recent deployment, rollback
5. If dependent service issue, check circuit breaker

## Escalation
- 15 min unresolved: Report to team lead
- 30 min unresolved: Page senior engineer

8. SLO / SLI / SLA

8.1 Definitions

  • SLI (Service Level Indicator): Measurable service quality metric
    • e.g., Success rate, response time, availability
  • SLO (Service Level Objective): Target value for an SLI
    • e.g., 99.9% availability, p99 latency under 200ms
  • SLA (Service Level Agreement): Contract with customers
    • Includes compensation terms for SLO violations

8.2 Error Budget

SLO: 99.9% availability
= Allowed downtime per month (30 days): 43.2 minutes
= Error Budget: 0.1%

Error Budget consumption rate:
- 50% consumed: Caution
- 75% consumed: Feature release freeze
- 100% consumed: Focus exclusively on reliability work

8.3 Burn Rate Alerts

# Burn Rate based alerting
groups:
  - name: slo_alerts
    rules:
      # Fast burn (2% consumed in 1 hour) - Immediate response
      - alert: SLOBurnRateCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "SLO burn rate critical - error budget exhausting fast"

      # Slow burn (5% consumed in 6 hours) - Attention
      - alert: SLOBurnRateWarning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "SLO burn rate elevated"

9. APM Tool Comparison

9.1 Major APM Solutions

FeatureDatadogNew RelicDynatraceOpen Source Stack
Pricing ModelHost/usage basedUsage basedHost basedInfra costs only
Auto-instrumentationExcellentExcellentBestGood
APMIncludedIncludedIncludedJaeger/Tempo
Log ManagementIncludedIncludedIncludedELK/Loki
MetricsIncludedIncludedIncludedPrometheus
AI AnalysisWatchdogApplied IntelligenceDavis AINone
Setup ComplexityLowLowMediumHigh
Vendor Lock-inHighHighHighNone

9.2 Open Source Stack Composition

+-----------------+     +------------------+
| Application     |---->| OTel Collector   |
| (OTel SDK)      |     | (collect/process)|
+-----------------+     +--------+---------+
                                 |
                    +------------+------------+
                    |            |            |
              +-----v-----+ +---v---+ +-----v-----+
              | Prometheus | | Tempo | | Loki      |
              | (Metrics)  | |(Trace)| | (Logs)    |
              +-----+------+ +---+---+ +-----+-----+
                    |            |            |
                    +------+-----+------+-----+
                           |            |
                      +----v----+ +----v----+
                      | Grafana | | Grafana |
                      | (visual)| | (unified)|
                      +---------+ +---------+

10. Cost Optimization

10.1 Data Volume Management

# Log retention policy
retention_policy:
  hot_tier: 7d        # Recent 7 days: Fast search
  warm_tier: 30d       # Recent 30 days: Slow search
  cold_tier: 90d       # Recent 90 days: Archive
  delete_after: 365d   # Delete after 1 year

# Index Lifecycle Management (Elasticsearch ILM)
index_lifecycle:
  phases:
    hot:
      actions:
        rollover:
          max_size: 50gb
          max_age: 1d
    warm:
      min_age: 7d
      actions:
        shrink:
          number_of_shards: 1
        forcemerge:
          max_num_segments: 1
    cold:
      min_age: 30d
      actions:
        searchable_snapshot:
          snapshot_repository: s3_repo
    delete:
      min_age: 365d

10.2 Cost Reduction Through Sampling

// Adaptive sampling
class AdaptiveSampler {
  constructor(targetRate = 100) {
    this.targetRate = targetRate; // Target collections per second
    this.currentRate = 0;
    this.samplingProbability = 1.0;

    // Adjust sampling probability every 10 seconds
    setInterval(() => this.adjust(), 10000);
  }

  shouldSample() {
    this.currentRate++;
    return Math.random() < this.samplingProbability;
  }

  adjust() {
    if (this.currentRate > this.targetRate * 10) {
      // If traffic is 10x above target, reduce sampling probability
      this.samplingProbability = this.targetRate / this.currentRate;
    } else {
      this.samplingProbability = Math.min(1.0, this.samplingProbability * 1.1);
    }
    this.currentRate = 0;
  }
}

10.3 Metric Aggregation

# Aggregate raw data with Prometheus Recording Rules
groups:
  - name: aggregation
    interval: 1m
    rules:
      # Service-level aggregation (reduce cardinality by removing instance label)
      - record: service:requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, status)

      # Reduce time resolution (5min -> 1hour)
      - record: service:requests:rate1h
        expr: sum(rate(http_requests_total[1h])) by (service)

11. Production Checklist

11.1 Pre-deployment Verification

  • Structured logging applied to all services
  • Correlation ID propagation verified
  • Prometheus metrics endpoint exposed
  • OTel auto-instrumentation enabled
  • Health check endpoint implemented
  • SLO defined with error budget established

11.2 Operational Verification

  • Grafana dashboards (RED/USE method)
  • Alert rules configured (Critical/Warning/Info)
  • Runbooks completed
  • On-call rotation established
  • Log retention policies configured
  • Cost monitoring in place

11.3 Periodic Review Items

  • Monthly SLO review
  • Alert noise analysis
  • Clean up unused dashboards/alerts
  • Cost optimization review
  • Incident post-mortems

12. Quiz

Q1: What are the Three Pillars of Observability?

Answer: Logs, Metrics, Traces

  • Logs: Records of individual events. Used for debugging and auditing.
  • Metrics: Numeric measurements over time. For understanding system performance trends.
  • Traces: Tracking the complete path of requests through distributed systems. Identifying bottlenecks.

Combining these three enables understanding not just "what is broken" but "why it is broken."

Q2: Write the PromQL query to retrieve p99 response time in Prometheus.

Answer:

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

The histogram_quantile function calculates percentiles from Histogram metrics. 0.99 represents the 99th percentile, rate() computes the rate of change over 5 minutes, and results are grouped by the le (less than or equal) label.

Q3: What is the difference between Head Sampling and Tail Sampling?

Answer:

Head Sampling makes the collection decision at the start of a trace. It is simple with low overhead, but may miss traces where errors occur.

Tail Sampling makes the collection decision after the trace completes. It reliably preserves traces with errors or slow responses, but requires holding all spans in memory, resulting in higher resource usage.

In production, tail sampling is typically performed in the OTel Collector to prioritize collecting error traces.

Q4: With a 99.9% SLO, how much downtime is allowed in a 30-day month?

Answer: Approximately 43.2 minutes

Calculation: 30 days x 24 hours x 60 minutes = 43,200 minutes Error Budget = 43,200 x 0.001 = 43.2 minutes

This means up to 43.2 minutes of downtime per month is within SLO bounds. When the error budget is exhausted, new feature releases should be halted to focus on reliability improvements.

Q5: Explain the difference between the RED Method and the USE Method.

Answer:

The RED Method is used for service monitoring:

  • Rate: Requests per second
  • Errors: Error rate
  • Duration: Response time

The USE Method is used for infrastructure resource monitoring:

  • Utilization: Resource usage percentage (CPU, memory, etc.)
  • Saturation: Queue length, degree of overload
  • Errors: Hardware/system errors

RED monitors services from the user experience perspective, while USE monitors infrastructure from the system perspective. Using both together enables rapid identification of root causes.


13. References

  1. OpenTelemetry Documentation - https://opentelemetry.io/docs/
  2. Prometheus Documentation - https://prometheus.io/docs/
  3. Grafana Documentation - https://grafana.com/docs/
  4. Jaeger Documentation - https://www.jaegertracing.io/docs/
  5. Grafana Loki - https://grafana.com/oss/loki/
  6. Grafana Tempo - https://grafana.com/oss/tempo/
  7. ELK Stack - https://www.elastic.co/elk-stack
  8. Google SRE Book - Monitoring - https://sre.google/sre-book/monitoring-distributed-systems/
  9. Google SRE Book - Service Level Objectives - https://sre.google/sre-book/service-level-objectives/
  10. pino Logger - https://getpino.io/
  11. Alertmanager - https://prometheus.io/docs/alerting/latest/alertmanager/
  12. PagerDuty Incident Response - https://response.pagerduty.com/
  13. The RED Method - https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
  14. The USE Method - https://www.brendangregg.com/usemethod.html

Observability is not simply about installing tools -- it is a culture. Every team member must develop the habit of structuring logs, defining metrics, and utilizing traces. Build your alerting strategy around SLOs, and balance releases and reliability through error budgets. Do not forget to actively leverage sampling and retention policies for cost optimization.