Skip to content
Published on

Observability Complete Guide 2025: Making Systems Transparent with Prometheus, Grafana, and OpenTelemetry

Authors

Introduction: Why Observability Matters

"Monitoring tells you whether a system is working. Observability lets you understand why it is not working."

In modern distributed systems, simply monitoring CPU usage or memory is insufficient. In microservice architectures, containers, and serverless environments, a single request traverses dozens of services. To identify the root cause of problems, you need observability — the ability to look inside your system.


1. Three Pillars of Observability

Metrics

Time-series data expressed as numbers. They provide an aggregated view of system state.

  • Counter: Monotonically increasing value (e.g., total request count)
  • Gauge: Value that goes up and down (e.g., current memory usage)
  • Histogram: Distribution of values (e.g., response time distribution)
  • Summary: Client-side calculated quantiles
# Counter example
http_requests_total{method="GET", path="/api/users", status="200"} 15234

# Gauge example
node_memory_usage_bytes{instance="web-01"} 1073741824

# Histogram example
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444
http_request_duration_seconds_bucket{le="1.0"} 34055

Logs

Text records of events. They provide detailed information about individual events.

{
  "timestamp": "2025-03-15T10:30:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc123def456",
  "spanId": "span789",
  "message": "Payment processing failed",
  "userId": "user-42",
  "orderId": "order-1234",
  "error": "Timeout connecting to payment gateway",
  "duration_ms": 5000
}

Structured logging makes searching and analysis much easier.

Traces

Track the complete path of a request across multiple services.

[Trace: abc123def456]
|-- [Span: API Gateway] 2ms
|   |-- [Span: Auth Service] 5ms
|   |   +-- [Span: Redis Cache Lookup] 1ms
|   |-- [Span: User Service] 15ms
|   |   +-- [Span: PostgreSQL Query] 8ms
|   +-- [Span: Payment Service] 5003ms  <-- bottleneck!
|       +-- [Span: External Payment API] 5000ms (TIMEOUT)
+-- Total: 5025ms

When all three pillars are combined, you can understand "What went wrong, Why it went wrong, and Where it went wrong."


2. Prometheus

Architecture

Prometheus is a pull-based monitoring system.

+-------------+     +--------------+     +-----------+
|  Targets    |---->|  Prometheus  |---->|  Grafana  |
|  (exporters)|pull |  Server      |query|           |
+-------------+     |  - TSDB      |     +-----------+
                    |  - Rules     |
                    |  - AlertMgr  |
                    +--------------+
                          |
                    +-----v-----+
                    | AlertMgr  |
                    | - Routing |
                    | - Silence |
                    +-----------+

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app-service'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)

Essential PromQL Queries

# 1. Current requests per second (rate)
rate(http_requests_total[5m])

# 2. Error rate by service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# 3. 95th percentile response time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# 4. Memory utilization (%)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

# 5. CPU utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 6. Nodes with less than 10% free disk space
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 10

# 7. Pod restart count (last 1 hour)
increase(kube_pod_container_status_restarts_total[1h]) > 3

# 8. Service availability (last 30 days)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  /
  sum(increase(http_requests_total[30d]))
)

Recording Rules (Performance Optimization)

# recording_rules.yml
groups:
  - name: service_metrics
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      - record: service:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

      - record: service:http_error_rate:ratio
        expr: service:http_errors:rate5m / service:http_requests:rate5m

      - record: service:http_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Alert Rules

# alert_rules.yml
groups:
  - name: service_alerts
    rules:
      - alert: HighErrorRate
        expr: service:http_error_rate:ratio > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on service {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for 5+ minutes"

      - alert: HighLatency
        expr: service:http_latency:p95 > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.service }}"
          description: "P95 latency is {{ $value }}s (threshold: 2s)"

      - alert: PodCrashLooping
        expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

3. Grafana

Dashboard Design Principles

USE Method: Utilization, Saturation, Errors RED Method: Rate, Errors, Duration

Grafana Dashboard JSON Structure

{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            },
            "unit": "percent"
          }
        }
      }
    ],
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 2
        },
        {
          "name": "environment",
          "type": "custom",
          "options": ["production", "staging", "development"]
        }
      ]
    }
  }
}

Grafana Alerting

# Grafana Alert Rule (provisioning)
apiVersion: 1
groups:
  - orgId: 1
    name: service_alerts
    folder: Production
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          - refId: B
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total[5m])) by (service)
          - refId: C
            datasourceUid: __expr__
            model:
              type: math
              expression: "$A / $B > 0.05"
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 5%"

4. OpenTelemetry

OpenTelemetry Overview

OpenTelemetry (OTel) is a vendor-neutral standard for collecting metrics, logs, and traces.

+--------------+     +----------------+     +-------------+
| Application  |---->|  OTel          |---->|  Backend    |
| + OTel SDK   |     |  Collector     |     |  - Jaeger   |
|              |     |  - Receivers   |     |  - Tempo    |
|              |     |  - Processors  |     |  - Prometheus|
|              |     |  - Exporters   |     |  - Loki     |
+--------------+     +----------------+     +-------------+

SDK Instrumentation (Node.js)

// tracing.ts - import first when application starts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'payment-service',
    [ATTR_SERVICE_VERSION]: '1.2.0',
    environment: 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

Manual Instrumentation (Custom Spans)

import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

async function processPayment(orderId: string, amount: number) {
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      span.setAttribute('payment.amount', amount);
      span.setAttribute('payment.currency', 'USD');

      // Create child span
      const validationResult = await tracer.startActiveSpan(
        'validatePayment',
        async (validationSpan) => {
          const result = await validatePaymentDetails(orderId);
          validationSpan.setAttribute('validation.result', result.valid);
          validationSpan.end();
          return result;
        }
      );

      if (!validationResult.valid) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: 'Payment validation failed',
        });
        throw new Error('Invalid payment');
      }

      const result = await chargePayment(orderId, amount);
      span.setAttribute('payment.transactionId', result.transactionId);
      span.setStatus({ code: SpanStatusCode.OK });

      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      throw error;
    } finally {
      span.end();
    }
  });
}

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['0.0.0.0:8888']

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  memory_limiter:
    check_interval: 1s
    limit_mib: 512

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [otlp/jaeger]

    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

5. Distributed Tracing

Jaeger and Grafana Tempo

Jaeger: Standalone distributed tracing system with a built-in UI for quick start.

Grafana Tempo: Tracing backend integrated into the Grafana ecosystem. Lower storage costs due to indexless design.

Tracing Stack with Docker Compose

# docker-compose.yaml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
    depends_on:
      - jaeger

Trace Analysis Tips

  1. Find slow spans: Identify the longest span in the overall trace
  2. Filter error spans: Filter by status=ERROR to pinpoint failure locations
  3. Service map: Visualize dependencies and call patterns between services
  4. Comparative analysis: Compare normal traces with problematic ones side by side

6. Logging

ELK vs Loki vs CloudWatch

AspectELK StackGrafana LokiCloudWatch Logs
IndexingFull-text indexLabel-basedLog groups
Storage costHighLowMedium
Query languageKQL/LuceneLogQLInsights
Grafana integrationPluginNativePlugin
Best for scaleLargeSmall to mediumAWS native

Grafana Loki + LogQL

# Error logs by service
{service="payment-service"} |= "ERROR"

# JSON parsing then filter
{service="api-gateway"} | json | status >= 500

# Error frequency (per minute)
count_over_time({service="payment-service"} |= "ERROR" [1m])

# Slow request filter (over 1 second)
{service="api-gateway"} | json | duration > 1000

# Search all logs by specific trace ID
{service=~".+"} |= "trace_id=abc123def456"

Structured Logging Implementation (Node.js)

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level(label) {
      return { level: label };
    },
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  base: {
    service: 'payment-service',
    version: '1.2.0',
    environment: process.env.NODE_ENV,
  },
});

// Include per-request context
function createRequestLogger(req) {
  return logger.child({
    requestId: req.id,
    traceId: req.headers['x-trace-id'],
    userId: req.user?.id,
    method: req.method,
    path: req.url,
  });
}

// Usage example
app.use((req, res, next) => {
  req.log = createRequestLogger(req);
  req.log.info('Request received');

  res.on('finish', () => {
    req.log.info({
      statusCode: res.statusCode,
      duration: Date.now() - req.startTime,
    }, 'Request completed');
  });

  next();
});

7. SRE Core Concepts

SLI (Service Level Indicator)

Specific metrics that measure service quality.

# Availability SLI: ratio of successful requests
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# Latency SLI: ratio of requests with P99 < 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

SLO (Service Level Objective)

The target value for an SLI.

  • Availability SLO: 99.9% (monthly downtime of 43 minutes)
  • Latency SLO: P99 response time under 300ms
# SLO definition (Sloth format)
version: "prometheus/v1"
service: "payment-service"
labels:
  team: "platform"
slos:
  - name: "availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5..",service="payment"}[{{.window}}]))
        total_query: sum(rate(http_requests_total{service="payment"}[{{.window}}]))
    alerting:
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Error Budget

With a 99.9% SLO, the error budget is 0.1%.

  • Over 30 days: 43.2 minutes of downtime allowed
  • Budget remaining: Deploy new features, run experiments
  • Budget exhausted: Focus on stability, freeze deployments
# Remaining error budget (%)
1 - (
  (1 - service:availability:ratio30d)
  /
  (1 - 0.999)
)

SLA (Service Level Agreement)

A contract with customers. Set more loosely than SLOs.

SLA > SLO > SLI (measurement)

Example:
- SLA: 99.9% (contract, refund on violation)
- SLO: 99.95% (internal target, stricter than SLA)
- SLI: 99.97% (actual measurement)

8. Alerting Strategy

Alert Pyramid

          /  P1: Page  \          -> Immediate response (PagerDuty)
         / (Critical,   \
        /  customer impact)\
       /-------------------\
      /   P2: Ticket        \    -> Handle during business hours (Jira)
     / (Degradation, risk)   \
    /------------------------\
   /    P3: Notification      \  -> Awareness only (Slack)
  /  (Warning, trend changes)  \
 /-----------------------------\
/     P4: Dashboard only        \ -> Check dashboards
/ (Reference metrics, auto-heal) \

AlertManager Routing

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

    - match:
        severity: info
      receiver: 'slack-info'
      repeat_interval: 12h

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-warning'
        title: '[WARNING] {{ .GroupLabels.alertname }}'

  - name: 'slack-info'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-info'

  - name: 'default-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'service']

Qualities of Good Alerts

  1. Actionable: There should be something you can do when alerted
  2. Severity distinction: Only page for truly urgent matters
  3. Context included: Include runbook links and related dashboard links
  4. Prevent alert fatigue: Too many alerts cause all alerts to be ignored
  5. Auto-remediation first: Whenever possible, auto-recover then notify

9. On-Call Culture

On-Call Rotation Design

Weekly rotation example:
- Primary: First responder (respond within 5 minutes)
- Secondary: Backup responder (escalated 10 min after Primary no response)
- Manager: Escalated after 30+ minutes unresolved

Rotation period: 1 week
Handoff: Every Monday at 10 AM
Compensation: On-call pay, compensatory time off

Incident Response Process

1. Detect
   +-- Receive alert, initial impact assessment

2. Respond
   +-- Create incident channel, assign roles
       - IC (Incident Commander): Coordination
       - Tech Lead: Technical investigation
       - Comms: Customer/stakeholder communication

3. Mitigate
   +-- Immediate action (rollback, scale out, etc.)

4. Resolve
   +-- Fix root cause, verify service recovery

5. Postmortem
   +-- Blameless retrospective, derive action items to prevent recurrence

10. Production Monitoring Stack Architecture

+---------------------------------------------+
|                  Grafana                     |
|  (Dashboards, Alerts, Exploration)          |
+-------+-------------+------ --------+------+
        |             |               |
   +----v----+  +-----v-----+  +------v---+
   |Prometheus|  |   Loki    |  | Tempo    |
   |(Metrics) |  |  (Logs)   |  |(Traces)  |
   +----^----+  +-----^-----+  +-----^----+
        |             |              |
   +----+-------------+--------------+----+
   |        OpenTelemetry Collector        |
   |  (Collection, Processing, Routing)    |
   +----^-------------^--------------^----+
        |             |              |
   +----+----+  +-----+-----+  +----+----+
   |Service A|  |Service B  |  |Service C|
   |+OTel SDK|  |+OTel SDK  |  |+OTel SDK|
   +---------+  +-----------+  +---------+

Kubernetes Environment Monitoring

# kube-prometheus-stack values.yaml (Helm)
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 100Gi

grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: 'default'
          folder: ''
          type: file
          options:
            path: /var/lib/grafana/dashboards

alertmanager:
  config:
    route:
      receiver: 'slack'
      group_by: ['alertname', 'namespace']
    receivers:
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/xxx'
            channel: '#k8s-alerts'

11. Interview Questions: 15 Essential Topics

Basics (1-5)

Q1. Explain the three pillars of observability.

Metrics, Logs, and Traces. Metrics are numerical time-series data providing aggregated views of system state. Logs are detailed text records of events. Traces show the path of a request across multiple services.

Q2. Explain Prometheus's pull model.

Prometheus directly scrapes target service /metrics endpoints periodically. Unlike push models, the server controls collection targets, and combined with service discovery, supports dynamic environments.

Q3. Explain the differences between Counter, Gauge, and Histogram.

Counter is a monotonically increasing value (total requests), Gauge is a current value that goes up and down (memory usage), Histogram observes value distributions in buckets (response time distribution).

Q4. Why is structured logging important?

Logging in a consistent format like JSON enables automated parsing, filtering, and searching. Including traceId connects logs with traces in distributed systems, making debugging much faster.

Q5. Explain the differences between SLI, SLO, and SLA.

SLI (Service Level Indicator) is an actual measured metric, SLO (Service Level Objective) is an internal target value, SLA (Service Level Agreement) is a legal contract with customers. SLAs are set more loosely than SLOs.

Intermediate (6-10)

Q6. What's the difference between PromQL's rate() and increase()?

rate() returns the average per-second increase rate, while increase() returns the total increase over a given time range. rate() is suited for graphs, increase() for total counts.

Q7. Explain the role of the OpenTelemetry Collector.

It receives telemetry data (metrics, logs, traces) through Receivers, processes it through Processors (batching, filtering), and sends it to multiple backends through Exporters. It acts as an intermediate layer between applications and backends, preventing vendor lock-in.

Q8. Explain the concept and application of Error Budget.

The error ratio allowed by an SLO. With 99.9% SLO, the error budget is 0.1% (43 min/month). When budget remains, deploy new features. When exhausted, focus on stabilization. It manages the balance between development velocity and reliability with numbers.

Q9. What is the relationship between Span and Trace in Distributed Tracing?

A Trace is the complete journey of a request through the system. A Span is an individual unit of work within that journey. Spans form a tree with parent-child relationships, each containing start/end times, attributes, and status.

Q10. What are the key differences between Grafana Loki and ELK?

ELK full-text indexes log content providing powerful search but with high storage costs. Loki only indexes labels and stores log text compressed, resulting in lower costs but requiring label-based filtering before text search.

Advanced (11-15)

Q11. How do you prevent alert fatigue?

Set only actionable alerts with clearly distinguished severity levels. Use inhibit rules to suppress duplicate alerts and grouping to consolidate similar ones. Regularly review alerts to eliminate noise.

Q12. Why are Prometheus Recording Rules needed?

They pre-compute complex PromQL queries and store results as new time series. This reduces dashboard loading times and prevents repeated execution of the same queries. Especially effective for long-range queries like SLO dashboards.

Q13. What is Context Propagation in OpenTelemetry?

The mechanism for propagating trace context (trace ID, span ID) between services. Propagated through HTTP headers (W3C Trace Context) or message queue metadata, enabling end-to-end tracking of a single request in distributed systems.

Q14. Compare Golden Signals with RED/USE methodologies.

Google's Golden Signals are Latency, Traffic, Errors, Saturation. RED (Rate, Errors, Duration) is service-oriented, USE (Utilization, Saturation, Errors) is infrastructure-oriented. Typically apply RED for services and USE for infrastructure.

Q15. What are the core principles of blameless postmortems?

Focus on system failures rather than blaming individuals. Reconstruct the timeline, analyze contributing factors, and derive specific, measurable action items. The goal is to improve systems so the same problem does not recur.


12. Practice Quiz: 5 Questions

Q1. Which Prometheus metric type is most suitable for representing fluctuating values like "current memory usage"?

Answer: Gauge

Gauge represents instantaneous values that go up and down. Counter only increases monotonically, making it unsuitable for values like memory usage that can decrease. Histogram is used for measuring distributions.

Q2. With a 99.9% SLO, approximately how many minutes of error budget (allowed downtime) do you have over 30 days?

Answer: Approximately 43 minutes

30 days = 43,200 minutes. Error budget = 43,200 x 0.001 = 43.2 minutes. Outages within this time do not violate the SLO.

Q3. What are the three main components of the OpenTelemetry Collector?

Answer: Receivers, Processors, Exporters

Receivers receive data (OTLP, Prometheus, etc.), Processors process data (batching, filtering, adding attributes, etc.), and Exporters send data to backends (Jaeger, Prometheus, Loki, etc.).

Q4. What does this PromQL query calculate? histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Answer: The 95th percentile (P95) of HTTP request response times over the last 5 minutes

histogram_quantile calculates quantiles from histogram buckets. 0.95 means 95%, extracting P95 from bucket data grouped by the le (less than or equal) label.

Q5. What problem occurs without "Context Propagation" in Distributed Tracing?

Answer: Requests across services cannot be linked into a single trace.

Without Context Propagation, each service creates independent traces. When a single user request traverses multiple services, you cannot see the full path, making debugging in distributed systems extremely difficult.


References

  1. Prometheus Official Documentation
  2. Grafana Official Documentation
  3. OpenTelemetry Official Documentation
  4. Jaeger Official Documentation
  5. Grafana Loki Documentation
  6. Grafana Tempo Documentation
  7. Google SRE Book
  8. Google SRE Workbook
  9. PromQL Cheat Sheet
  10. Sloth - SLO Generator
  11. OpenTelemetry Collector Configuration
  12. Alertmanager Routing Tree
  13. LogQL Documentation
  14. Pino Logger (Node.js)
  15. kube-prometheus-stack Helm Chart
  16. DORA Metrics Guide