- Published on
Observability Complete Guide 2025: Making Systems Transparent with Prometheus, Grafana, and OpenTelemetry
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction: Why Observability Matters
"Monitoring tells you whether a system is working. Observability lets you understand why it is not working."
In modern distributed systems, simply monitoring CPU usage or memory is insufficient. In microservice architectures, containers, and serverless environments, a single request traverses dozens of services. To identify the root cause of problems, you need observability — the ability to look inside your system.
1. Three Pillars of Observability
Metrics
Time-series data expressed as numbers. They provide an aggregated view of system state.
- Counter: Monotonically increasing value (e.g., total request count)
- Gauge: Value that goes up and down (e.g., current memory usage)
- Histogram: Distribution of values (e.g., response time distribution)
- Summary: Client-side calculated quantiles
# Counter example
http_requests_total{method="GET", path="/api/users", status="200"} 15234
# Gauge example
node_memory_usage_bytes{instance="web-01"} 1073741824
# Histogram example
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444
http_request_duration_seconds_bucket{le="1.0"} 34055
Logs
Text records of events. They provide detailed information about individual events.
{
"timestamp": "2025-03-15T10:30:45.123Z",
"level": "ERROR",
"service": "payment-service",
"traceId": "abc123def456",
"spanId": "span789",
"message": "Payment processing failed",
"userId": "user-42",
"orderId": "order-1234",
"error": "Timeout connecting to payment gateway",
"duration_ms": 5000
}
Structured logging makes searching and analysis much easier.
Traces
Track the complete path of a request across multiple services.
[Trace: abc123def456]
|-- [Span: API Gateway] 2ms
| |-- [Span: Auth Service] 5ms
| | +-- [Span: Redis Cache Lookup] 1ms
| |-- [Span: User Service] 15ms
| | +-- [Span: PostgreSQL Query] 8ms
| +-- [Span: Payment Service] 5003ms <-- bottleneck!
| +-- [Span: External Payment API] 5000ms (TIMEOUT)
+-- Total: 5025ms
When all three pillars are combined, you can understand "What went wrong, Why it went wrong, and Where it went wrong."
2. Prometheus
Architecture
Prometheus is a pull-based monitoring system.
+-------------+ +--------------+ +-----------+
| Targets |---->| Prometheus |---->| Grafana |
| (exporters)|pull | Server |query| |
+-------------+ | - TSDB | +-----------+
| - Rules |
| - AlertMgr |
+--------------+
|
+-----v-----+
| AlertMgr |
| - Routing |
| - Silence |
+-----------+
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
- "recording_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'app-service'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
Essential PromQL Queries
# 1. Current requests per second (rate)
rate(http_requests_total[5m])
# 2. Error rate by service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# 3. 95th percentile response time
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# 4. Memory utilization (%)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# 5. CPU utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 6. Nodes with less than 10% free disk space
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 10
# 7. Pod restart count (last 1 hour)
increase(kube_pod_container_status_restarts_total[1h]) > 3
# 8. Service availability (last 30 days)
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
)
Recording Rules (Performance Optimization)
# recording_rules.yml
groups:
- name: service_metrics
interval: 30s
rules:
- record: service:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service)
- record: service:http_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
- record: service:http_error_rate:ratio
expr: service:http_errors:rate5m / service:http_requests:rate5m
- record: service:http_latency:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
Alert Rules
# alert_rules.yml
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: service:http_error_rate:ratio > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on service {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for 5+ minutes"
- alert: HighLatency
expr: service:http_latency:p95 > 2.0
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency on {{ $labels.service }}"
description: "P95 latency is {{ $value }}s (threshold: 2s)"
- alert: PodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
3. Grafana
Dashboard Design Principles
USE Method: Utilization, Saturation, Errors RED Method: Rate, Errors, Duration
Grafana Dashboard JSON Structure
{
"dashboard": {
"title": "Service Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
},
"unit": "percent"
}
}
}
],
"templating": {
"list": [
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)",
"refresh": 2
},
{
"name": "environment",
"type": "custom",
"options": ["production", "staging", "development"]
}
]
}
}
}
Grafana Alerting
# Grafana Alert Rule (provisioning)
apiVersion: 1
groups:
- orgId: 1
name: service_alerts
folder: Production
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
- refId: B
datasourceUid: prometheus
model:
expr: sum(rate(http_requests_total[5m])) by (service)
- refId: C
datasourceUid: __expr__
model:
type: math
expression: "$A / $B > 0.05"
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate exceeds 5%"
4. OpenTelemetry
OpenTelemetry Overview
OpenTelemetry (OTel) is a vendor-neutral standard for collecting metrics, logs, and traces.
+--------------+ +----------------+ +-------------+
| Application |---->| OTel |---->| Backend |
| + OTel SDK | | Collector | | - Jaeger |
| | | - Receivers | | - Tempo |
| | | - Processors | | - Prometheus|
| | | - Exporters | | - Loki |
+--------------+ +----------------+ +-------------+
SDK Instrumentation (Node.js)
// tracing.ts - import first when application starts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'payment-service',
[ATTR_SERVICE_VERSION]: '1.2.0',
environment: 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://otel-collector:4317',
}),
exportIntervalMillis: 30000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
Manual Instrumentation (Custom Spans)
import { trace, SpanStatusCode, context } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
async function processPayment(orderId: string, amount: number) {
return tracer.startActiveSpan('processPayment', async (span) => {
try {
span.setAttribute('order.id', orderId);
span.setAttribute('payment.amount', amount);
span.setAttribute('payment.currency', 'USD');
// Create child span
const validationResult = await tracer.startActiveSpan(
'validatePayment',
async (validationSpan) => {
const result = await validatePaymentDetails(orderId);
validationSpan.setAttribute('validation.result', result.valid);
validationSpan.end();
return result;
}
);
if (!validationResult.valid) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: 'Payment validation failed',
});
throw new Error('Invalid payment');
}
const result = await chargePayment(orderId, amount);
span.setAttribute('payment.transactionId', result.transactionId);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
});
}
OTel Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['0.0.0.0:8888']
processors:
batch:
timeout: 5s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 512
attributes:
actions:
- key: environment
value: production
action: upsert
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
5. Distributed Tracing
Jaeger and Grafana Tempo
Jaeger: Standalone distributed tracing system with a built-in UI for quick start.
Grafana Tempo: Tracing backend integrated into the Grafana ecosystem. Lower storage costs due to indexless design.
Tracing Stack with Docker Compose
# docker-compose.yaml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317"
- "4318:4318"
depends_on:
- jaeger
Trace Analysis Tips
- Find slow spans: Identify the longest span in the overall trace
- Filter error spans: Filter by status=ERROR to pinpoint failure locations
- Service map: Visualize dependencies and call patterns between services
- Comparative analysis: Compare normal traces with problematic ones side by side
6. Logging
ELK vs Loki vs CloudWatch
| Aspect | ELK Stack | Grafana Loki | CloudWatch Logs |
|---|---|---|---|
| Indexing | Full-text index | Label-based | Log groups |
| Storage cost | High | Low | Medium |
| Query language | KQL/Lucene | LogQL | Insights |
| Grafana integration | Plugin | Native | Plugin |
| Best for scale | Large | Small to medium | AWS native |
Grafana Loki + LogQL
# Error logs by service
{service="payment-service"} |= "ERROR"
# JSON parsing then filter
{service="api-gateway"} | json | status >= 500
# Error frequency (per minute)
count_over_time({service="payment-service"} |= "ERROR" [1m])
# Slow request filter (over 1 second)
{service="api-gateway"} | json | duration > 1000
# Search all logs by specific trace ID
{service=~".+"} |= "trace_id=abc123def456"
Structured Logging Implementation (Node.js)
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level(label) {
return { level: label };
},
},
timestamp: pino.stdTimeFunctions.isoTime,
base: {
service: 'payment-service',
version: '1.2.0',
environment: process.env.NODE_ENV,
},
});
// Include per-request context
function createRequestLogger(req) {
return logger.child({
requestId: req.id,
traceId: req.headers['x-trace-id'],
userId: req.user?.id,
method: req.method,
path: req.url,
});
}
// Usage example
app.use((req, res, next) => {
req.log = createRequestLogger(req);
req.log.info('Request received');
res.on('finish', () => {
req.log.info({
statusCode: res.statusCode,
duration: Date.now() - req.startTime,
}, 'Request completed');
});
next();
});
7. SRE Core Concepts
SLI (Service Level Indicator)
Specific metrics that measure service quality.
# Availability SLI: ratio of successful requests
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
# Latency SLI: ratio of requests with P99 < 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
SLO (Service Level Objective)
The target value for an SLI.
- Availability SLO: 99.9% (monthly downtime of 43 minutes)
- Latency SLO: P99 response time under 300ms
# SLO definition (Sloth format)
version: "prometheus/v1"
service: "payment-service"
labels:
team: "platform"
slos:
- name: "availability"
objective: 99.9
sli:
events:
error_query: sum(rate(http_requests_total{status=~"5..",service="payment"}[{{.window}}]))
total_query: sum(rate(http_requests_total{service="payment"}[{{.window}}]))
alerting:
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Error Budget
With a 99.9% SLO, the error budget is 0.1%.
- Over 30 days: 43.2 minutes of downtime allowed
- Budget remaining: Deploy new features, run experiments
- Budget exhausted: Focus on stability, freeze deployments
# Remaining error budget (%)
1 - (
(1 - service:availability:ratio30d)
/
(1 - 0.999)
)
SLA (Service Level Agreement)
A contract with customers. Set more loosely than SLOs.
SLA > SLO > SLI (measurement)
Example:
- SLA: 99.9% (contract, refund on violation)
- SLO: 99.95% (internal target, stricter than SLA)
- SLI: 99.97% (actual measurement)
8. Alerting Strategy
Alert Pyramid
/ P1: Page \ -> Immediate response (PagerDuty)
/ (Critical, \
/ customer impact)\
/-------------------\
/ P2: Ticket \ -> Handle during business hours (Jira)
/ (Degradation, risk) \
/------------------------\
/ P3: Notification \ -> Awareness only (Slack)
/ (Warning, trend changes) \
/-----------------------------\
/ P4: Dashboard only \ -> Check dashboards
/ (Reference metrics, auto-heal) \
AlertManager Routing
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: 'default-slack'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 10s
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
- match:
severity: info
receiver: 'slack-info'
repeat_interval: 12h
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
severity: critical
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts-warning'
title: '[WARNING] {{ .GroupLabels.alertname }}'
- name: 'slack-info'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts-info'
- name: 'default-slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'service']
Qualities of Good Alerts
- Actionable: There should be something you can do when alerted
- Severity distinction: Only page for truly urgent matters
- Context included: Include runbook links and related dashboard links
- Prevent alert fatigue: Too many alerts cause all alerts to be ignored
- Auto-remediation first: Whenever possible, auto-recover then notify
9. On-Call Culture
On-Call Rotation Design
Weekly rotation example:
- Primary: First responder (respond within 5 minutes)
- Secondary: Backup responder (escalated 10 min after Primary no response)
- Manager: Escalated after 30+ minutes unresolved
Rotation period: 1 week
Handoff: Every Monday at 10 AM
Compensation: On-call pay, compensatory time off
Incident Response Process
1. Detect
+-- Receive alert, initial impact assessment
2. Respond
+-- Create incident channel, assign roles
- IC (Incident Commander): Coordination
- Tech Lead: Technical investigation
- Comms: Customer/stakeholder communication
3. Mitigate
+-- Immediate action (rollback, scale out, etc.)
4. Resolve
+-- Fix root cause, verify service recovery
5. Postmortem
+-- Blameless retrospective, derive action items to prevent recurrence
10. Production Monitoring Stack Architecture
Recommended Stack
+---------------------------------------------+
| Grafana |
| (Dashboards, Alerts, Exploration) |
+-------+-------------+------ --------+------+
| | |
+----v----+ +-----v-----+ +------v---+
|Prometheus| | Loki | | Tempo |
|(Metrics) | | (Logs) | |(Traces) |
+----^----+ +-----^-----+ +-----^----+
| | |
+----+-------------+--------------+----+
| OpenTelemetry Collector |
| (Collection, Processing, Routing) |
+----^-------------^--------------^----+
| | |
+----+----+ +-----+-----+ +----+----+
|Service A| |Service B | |Service C|
|+OTel SDK| |+OTel SDK | |+OTel SDK|
+---------+ +-----------+ +---------+
Kubernetes Environment Monitoring
# kube-prometheus-stack values.yaml (Helm)
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 100Gi
grafana:
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
folder: ''
type: file
options:
path: /var/lib/grafana/dashboards
alertmanager:
config:
route:
receiver: 'slack'
group_by: ['alertname', 'namespace']
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#k8s-alerts'
11. Interview Questions: 15 Essential Topics
Basics (1-5)
Q1. Explain the three pillars of observability.
Metrics, Logs, and Traces. Metrics are numerical time-series data providing aggregated views of system state. Logs are detailed text records of events. Traces show the path of a request across multiple services.
Q2. Explain Prometheus's pull model.
Prometheus directly scrapes target service /metrics endpoints periodically. Unlike push models, the server controls collection targets, and combined with service discovery, supports dynamic environments.
Q3. Explain the differences between Counter, Gauge, and Histogram.
Counter is a monotonically increasing value (total requests), Gauge is a current value that goes up and down (memory usage), Histogram observes value distributions in buckets (response time distribution).
Q4. Why is structured logging important?
Logging in a consistent format like JSON enables automated parsing, filtering, and searching. Including traceId connects logs with traces in distributed systems, making debugging much faster.
Q5. Explain the differences between SLI, SLO, and SLA.
SLI (Service Level Indicator) is an actual measured metric, SLO (Service Level Objective) is an internal target value, SLA (Service Level Agreement) is a legal contract with customers. SLAs are set more loosely than SLOs.
Intermediate (6-10)
Q6. What's the difference between PromQL's rate() and increase()?
rate() returns the average per-second increase rate, while increase() returns the total increase over a given time range. rate() is suited for graphs, increase() for total counts.
Q7. Explain the role of the OpenTelemetry Collector.
It receives telemetry data (metrics, logs, traces) through Receivers, processes it through Processors (batching, filtering), and sends it to multiple backends through Exporters. It acts as an intermediate layer between applications and backends, preventing vendor lock-in.
Q8. Explain the concept and application of Error Budget.
The error ratio allowed by an SLO. With 99.9% SLO, the error budget is 0.1% (43 min/month). When budget remains, deploy new features. When exhausted, focus on stabilization. It manages the balance between development velocity and reliability with numbers.
Q9. What is the relationship between Span and Trace in Distributed Tracing?
A Trace is the complete journey of a request through the system. A Span is an individual unit of work within that journey. Spans form a tree with parent-child relationships, each containing start/end times, attributes, and status.
Q10. What are the key differences between Grafana Loki and ELK?
ELK full-text indexes log content providing powerful search but with high storage costs. Loki only indexes labels and stores log text compressed, resulting in lower costs but requiring label-based filtering before text search.
Advanced (11-15)
Q11. How do you prevent alert fatigue?
Set only actionable alerts with clearly distinguished severity levels. Use inhibit rules to suppress duplicate alerts and grouping to consolidate similar ones. Regularly review alerts to eliminate noise.
Q12. Why are Prometheus Recording Rules needed?
They pre-compute complex PromQL queries and store results as new time series. This reduces dashboard loading times and prevents repeated execution of the same queries. Especially effective for long-range queries like SLO dashboards.
Q13. What is Context Propagation in OpenTelemetry?
The mechanism for propagating trace context (trace ID, span ID) between services. Propagated through HTTP headers (W3C Trace Context) or message queue metadata, enabling end-to-end tracking of a single request in distributed systems.
Q14. Compare Golden Signals with RED/USE methodologies.
Google's Golden Signals are Latency, Traffic, Errors, Saturation. RED (Rate, Errors, Duration) is service-oriented, USE (Utilization, Saturation, Errors) is infrastructure-oriented. Typically apply RED for services and USE for infrastructure.
Q15. What are the core principles of blameless postmortems?
Focus on system failures rather than blaming individuals. Reconstruct the timeline, analyze contributing factors, and derive specific, measurable action items. The goal is to improve systems so the same problem does not recur.
12. Practice Quiz: 5 Questions
Q1. Which Prometheus metric type is most suitable for representing fluctuating values like "current memory usage"?
Answer: Gauge
Gauge represents instantaneous values that go up and down. Counter only increases monotonically, making it unsuitable for values like memory usage that can decrease. Histogram is used for measuring distributions.
Q2. With a 99.9% SLO, approximately how many minutes of error budget (allowed downtime) do you have over 30 days?
Answer: Approximately 43 minutes
30 days = 43,200 minutes. Error budget = 43,200 x 0.001 = 43.2 minutes. Outages within this time do not violate the SLO.
Q3. What are the three main components of the OpenTelemetry Collector?
Answer: Receivers, Processors, Exporters
Receivers receive data (OTLP, Prometheus, etc.), Processors process data (batching, filtering, adding attributes, etc.), and Exporters send data to backends (Jaeger, Prometheus, Loki, etc.).
Q4. What does this PromQL query calculate? histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Answer: The 95th percentile (P95) of HTTP request response times over the last 5 minutes
histogram_quantile calculates quantiles from histogram buckets. 0.95 means 95%, extracting P95 from bucket data grouped by the le (less than or equal) label.
Q5. What problem occurs without "Context Propagation" in Distributed Tracing?
Answer: Requests across services cannot be linked into a single trace.
Without Context Propagation, each service creates independent traces. When a single user request traverses multiple services, you cannot see the full path, making debugging in distributed systems extremely difficult.
References
- Prometheus Official Documentation
- Grafana Official Documentation
- OpenTelemetry Official Documentation
- Jaeger Official Documentation
- Grafana Loki Documentation
- Grafana Tempo Documentation
- Google SRE Book
- Google SRE Workbook
- PromQL Cheat Sheet
- Sloth - SLO Generator
- OpenTelemetry Collector Configuration
- Alertmanager Routing Tree
- LogQL Documentation
- Pino Logger (Node.js)
- kube-prometheus-stack Helm Chart
- DORA Metrics Guide