- Published on
Prometheus Complete Guide — Metrics, PromQL, Alerting, Dashboards, and Best Practices
- Authors
- Name
- Introduction
- 1. Core Concepts
- 2. Architecture
- 3. Installation and Configuration
- 4. PromQL in Practice
- 5. Alertmanager Alerting Setup
- 6. Grafana Dashboard Integration
- 7. Best Practices for Large-Scale Operations
- 8. Operational Checklist
- 9. Common Mistakes
- 10. Summary
- Quiz
Introduction
Modern infrastructure is distributed across microservices, containers, and serverless functions, with hundreds of components operating simultaneously. To answer the question "Is the system healthy right now?" in such an environment, systematic metric collection and monitoring are essential. Prometheus is a CNCF graduated project and the de facto standard monitoring system for the Kubernetes ecosystem, providing pull-based metric collection, a powerful query language called PromQL, and a flexible alerting system.
This article covers everything from Prometheus core concepts to architecture, installation and configuration, practical PromQL queries, Alertmanager alerting setup, Grafana dashboard integration, best practices for large-scale operations, an operational checklist, and common mistakes — all in a single comprehensive guide. This is a production-oriented, hands-on resource that you can apply immediately.
1. Core Concepts
Pull-Based Model
Unlike push-based monitoring systems where targets send metrics to a central server, Prometheus uses a pull-based model where the Prometheus server periodically scrapes each target's /metrics endpoint. The advantages of this approach include:
- Monitoring targets do not need to know about Prometheus
- It naturally adapts to dynamic environments when combined with service discovery
- When a target goes down, scrape failure is detected immediately
Time Series Data
Prometheus stores all data as time series. Each time series is uniquely identified by a combination of a metric name and key-value label pairs.
http_requests_total{method="GET", handler="/api/users", status="200"}
Metric Types
| Type | Description | Use Case | PromQL Usage |
|---|---|---|---|
| Counter | Monotonically increasing cumulative value | Request count, error count | rate(), increase() |
| Gauge | Value that can go up and down | CPU usage, memory, temperature | Direct use, avg_over_time() |
| Histogram | Measures distribution of values across buckets | Response time, request size | histogram_quantile() |
| Summary | Calculates quantiles on the client side | Response time (non-aggregatable) | Direct use (not recommended) |
Histogram vs Summary: Histograms allow server-side quantile calculation, making aggregation across multiple instances possible. Summaries calculate quantiles on the client side, making aggregation mathematically impossible. In most cases, Histogram is recommended.
2. Architecture
The Prometheus ecosystem consists of several components working together organically.
graph TB
subgraph Targets
A[Application /metrics]
B[Node Exporter]
C[cAdvisor]
D[Custom Exporter]
end
subgraph "Prometheus Server"
E[Retrieval<br/>Scrape Engine]
F[TSDB<br/>Time Series Database]
G[HTTP Server<br/>PromQL API]
end
H[Service Discovery<br/>Kubernetes / Consul / DNS]
I[Pushgateway<br/>For Short-lived Jobs]
J[Alertmanager<br/>Alert Routing/Grouping]
K[Grafana<br/>Dashboards]
H --> E
A --> E
B --> E
C --> E
D --> E
I --> E
E --> F
F --> G
G --> K
G --> J
J --> L[Slack / PagerDuty / Email]
| Component | Role |
|---|---|
| Prometheus Server | Scrapes metrics, stores in TSDB, runs PromQL query engine |
| Exporters | Expose target system metrics in Prometheus format (node_exporter, mysqld_exporter, etc.) |
| Pushgateway | Relays metrics from short-lived batch jobs that are difficult to scrape |
| Alertmanager | Routes, groups, deduplicates, and silences alerts based on alerting rules |
| Service Discovery | Dynamically discovers scrape targets from Kubernetes, Consul, DNS, etc. |
3. Installation and Configuration
Full Stack Setup with Docker Compose
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.53.0
container_name: prometheus
ports:
- '9090:9090'
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules/:/etc/prometheus/rules/
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- '9093:9093'
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
grafana:
image: grafana/grafana:11.1.0
container_name: grafana
ports:
- '3000:3000'
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning/:/etc/grafana/provisioning/
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.8.1
container_name: node-exporter
ports:
- '9100:9100'
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
prometheus.yml Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Alert rule evaluation interval
scrape_timeout: 10s # Scrape timeout
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# Application (using relabel_configs)
- job_name: 'app'
metrics_path: /metrics
scheme: http
static_configs:
- targets: ['app:8080']
labels:
env: production
team: backend
# Kubernetes Service Discovery (reference)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
4. PromQL in Practice
rate() and irate()
# Per-second request rate over 5 minutes (suitable for alerts)
sum(rate(http_requests_total[5m])) by (service)
# Instantaneous rate of change (suitable for dashboard visualization)
sum(irate(http_requests_total[$__rate_interval])) by (service)
Use rate() for alerting rules and irate() for dashboards. The range window for rate() should be at least 4 times the scrape_interval.
histogram_quantile()
# 95th percentile response time
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# 50th percentile (median)
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Aggregation Operators
# Top 5 services by error rate
topk(5,
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
)
# Memory usage by namespace
sum(container_memory_usage_bytes{container!=""}) by (namespace)
# Nodes with CPU usage exceeding 80%
node_cpu_seconds_total{mode="idle"} < 0.2
# Predict disk space 4 hours from now using predict_linear
predict_linear(
node_filesystem_avail_bytes{mountpoint="/"}[6h], 4 * 3600
) < 0
Useful Functions Reference
| Function | Purpose | Example |
|---|---|---|
rate() | Average per-second rate over time range | rate(http_requests_total[5m]) |
irate() | Instantaneous rate between last two points | irate(http_requests_total[5m]) |
increase() | Total increase over time range | increase(http_requests_total[1h]) |
histogram_quantile() | Calculate quantile from histogram | histogram_quantile(0.99, ...) |
predict_linear() | Predict future value via linear regression | predict_linear(disk_free[6h], 3600*4) |
absent() | Returns 1 if time series is missing | absent(up{job="app"}) |
changes() | Number of value changes in time range | changes(process_start_time_seconds[1h]) |
5. Alertmanager Alerting Setup
Defining Alerting Rules
# prometheus/rules/alerts.yml
groups:
- name: instance-alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: 'Instance {{ $labels.instance }} is down'
description: '{{ $labels.instance }} of job {{ $labels.job }} has been unreachable for more than 3 minutes.'
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: '{{ $labels.service }} error rate exceeds {{ $value | humanizePercentage }}'
description: 'Service {{ $labels.service }} has a 5xx error rate exceeding 5%.'
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: 'Node memory usage exceeds 90%'
- alert: DiskSpaceRunningOut
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: 'Disk space predicted to run out within 24 hours'
- alert: HighLatencyP95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: '{{ $labels.service }} P95 latency exceeds 1 second'
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
receiver: 'default-slack'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
receivers:
- name: 'default-slack'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
severity: critical
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warning'
send_resolved: true
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'instance']
6. Grafana Dashboard Integration
Data Source Provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: '15s'
httpMethod: POST
Example Dashboard Panel
{
"title": "Service Error Rate",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 0.01, "color": "yellow" },
{ "value": 0.05, "color": "red" }
]
}
}
}
}
Recommended Dashboard Panels
| Panel | PromQL | Purpose |
|---|---|---|
| Error Rate | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) | Error rate trend per service |
| QPS | sum(rate(http_requests_total[5m])) by (service) | Queries per second |
| P95 Latency | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) | 95th percentile response time |
| CPU Usage | 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) | CPU usage per node |
| Memory Usage | 1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes | Memory usage per node |
| Disk Free | node_filesystem_avail_bytes{mountpoint="/"} | Root filesystem remaining capacity |
7. Best Practices for Large-Scale Operations
Federation
Federation is a pattern where a global Prometheus server scrapes selected metrics from multiple local Prometheus servers. Each team or cluster runs its own local Prometheus, and only metrics needed for a global view are federated upward.
# Global Prometheus scrape_configs
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~".+"}' # All job metrics
- '{__name__=~"job:.*"}' # Only recording rule results
static_configs:
- targets:
- 'prometheus-team-a:9090'
- 'prometheus-team-b:9090'
Long-Term Storage: Thanos / Cortex / Mimir
Prometheus local TSDB is designed for 15 to 30 days of retention. For longer retention and global querying, consider the following solutions.
| Solution | Characteristics | Best For |
|---|---|---|
| Thanos | Sidecar pattern, object storage, global querying | Adding sidecars to existing Prometheus |
| Cortex | Multi-tenant, horizontally scalable, microservices architecture | Large-scale SaaS environments |
| Grafana Mimir | Cortex fork, improved performance, Grafana ecosystem integration | When using the Grafana stack |
| VictoriaMetrics | High performance, low resource usage, PromQL compatible | When cost optimization is critical |
Cardinality Management
High cardinality (explosive growth of label value combinations) is the most common cause of Prometheus performance degradation.
# Drop unnecessary labels with relabel_configs
relabel_configs:
- action: labeldrop
regex: '(pod_template_hash|controller_revision_hash)'
# Drop high-cardinality metrics with metric_relabel_configs
metric_relabel_configs:
- source_labels: [__name__]
regex: '(go_gc_.*|go_memstats_.*)'
action: drop
Cardinality inspection PromQL:
# Top 10 metrics by time series count
topk(10, count by (__name__) ({__name__=~".+"}))
# Check cardinality for a specific metric
count(http_requests_total) by (service, method, status)
8. Operational Checklist
| Item | Recommended Setting | Notes |
|---|---|---|
| Retention Period | 15-30 days (local TSDB) | Long-term: use Thanos/Mimir with object storage |
| Backup | Use TSDB snapshot API | POST /api/v1/admin/tsdb/snapshot |
| HA Setup | 2 identical Prometheus instances + Alertmanager clustering | Alertmanager uses gossip protocol to prevent duplicate alerts |
| Security | TLS + Basic Auth or OAuth2 Proxy | Configure via --web.config.file for TLS/Auth |
| Resources | ~2GB RAM per 1 million time series | Account for WAL and head chunks memory |
| Scrape Interval | 15s (default), 10s for critical metrics | Too short increases load, too long reduces resolution |
| Alert Testing | amtool check-config, promtool check rules | Integrate into CI/CD pipeline |
| Recording Rules | Pre-compute frequently used queries | Improves dashboard performance, follow record: naming convention |
| Cardinality Monitoring | Watch prometheus_tsdb_head_series trend | Investigate root cause metric on sudden increase |
9. Common Mistakes
Using too short a range window with rate() -- Use at least 4x the scrape_interval (e.g.,
[1m]for 15s interval,[2m]for 30s). A window that is too short causes missing data points, resulting in zero values.Using irate() in alerting rules -- irate() calculates the instantaneous rate and is sensitive to noise. Always use rate() for alerting rules.
Defining alerts without a
forclause -- Withfor: 0s, a single spike triggers an alert. Set a minimum of 3-5 minutes for theforperiod.Using high-cardinality labels -- Putting unique values like user_id, request_id, or trace_id in labels causes unbounded time series growth. Store such values in logs or traces instead.
Overusing the Summary metric type -- Summary quantiles cannot be aggregated across instances. Use Histogram when running multiple instances.
Overusing the Pushgateway -- Pushgateway is designed for short-lived batch jobs only. Routing long-running service metrics through Pushgateway makes it impossible to detect target downtime.
Ignoring Alertmanager routing priority -- Route matching proceeds from top to bottom. Place more specific matches first and the default receiver last.
Not following Recording Rules naming convention -- Follow the
level:metric:operationsformat (e.g.,job:http_requests_total:rate5m) for easy identification in dashboards and alerts.Mismatch between TSDB retention period and disk capacity -- Setting retention to 90 days when the disk can only hold 30 days of data will cause Prometheus to crash with OOM. Use
--storage.tsdb.retention.sizeto enforce disk-based retention limits in parallel.Not monitoring the monitoring system itself -- Always monitor Prometheus's own
up,prometheus_tsdb_head_series, andprometheus_engine_query_duration_secondsmetrics.
10. Summary
Prometheus is the standard monitoring system for cloud-native environments. Here are the key takeaways:
- The pull-based model naturally adapts to dynamic environments and integrates with service discovery
- Among the 4 metric types (Counter, Gauge, Histogram, Summary), prefer Histogram
- PromQL provides powerful functions such as rate(), histogram_quantile(), and predict_linear()
- Alertmanager reduces alert fatigue through routing, grouping, inhibition, and deduplication
- Grafana integration enables dashboards for error rates, QPS, latency, and resource usage
- For large-scale environments, use Thanos/Mimir for long-term storage, Federation for global views, and cardinality management for sustained performance
- Regularly review the operational checklist and avoid common mistakes
By adapting the configurations and queries covered in this guide to your specific environment, you can build a stable and scalable monitoring system.
Quiz
Q1: What is the main reason Prometheus uses a pull-based model instead of a push-based model?
In the pull-based model, monitoring targets do not need to know about Prometheus, and it integrates naturally with service discovery. Additionally, when a target goes down, scrape failure is detected immediately. This makes it particularly well-suited for dynamically scaling cloud-native environments.
Q2: Why should you apply rate() to Counter-type metrics instead of using the raw value?
A Counter is a monotonically increasing cumulative value, so looking at the raw value just shows an ever-growing number that provides little meaningful insight. Applying rate() calculates the per-second rate of change, allowing you to understand current throughput (e.g., requests per second). Additionally, rate() automatically handles counter resets caused by restarts.
Q3: What is the key difference between Histogram and Summary, and why is Histogram recommended?
Histogram allows server-side quantile calculation using the histogram_quantile() function, making it possible to aggregate data across multiple instances. Summary calculates quantiles on the client side, making cross-instance aggregation mathematically impossible. Since running multiple instances is standard in microservices environments, Histogram is recommended.
Q4: Why is using irate() instead of rate() problematic in alerting rules?
irate() calculates the instantaneous rate between the two most recent data points, making it highly sensitive to short-lived spikes. When used in alerting, it triggers on transient noise, causing a surge in false positives and alert fatigue. rate() averages over the entire specified time range, making it resilient to temporary fluctuations.
Q5: What are the roles of group_by, group_wait, and group_interval in Alertmanager?
group_by specifies the labels used to group alerts together (e.g., alertname, service). group_wait is the initial waiting period for alerts in the same group to accumulate (e.g., 30 seconds). group_interval is the interval for resending when new alerts join an already-notified group. Properly combining these settings prevents alert storms.
Q6: Why is high cardinality dangerous for Prometheus, and how can it be mitigated?
When label value combinations grow explosively, the number of time series surges, causing Prometheus memory (head chunks) and disk usage to increase exponentially. Mitigation strategies include avoiding unique values like user_id in labels, using metric_relabel_configs to drop unnecessary metrics, and monitoring the prometheus_tsdb_head_series metric to track time series count.
Q7: What are the main differences between Thanos and Grafana Mimir?
Thanos uses a sidecar pattern added to existing Prometheus instances, uploading blocks to object storage for long-term retention and global querying. Grafana Mimir (a Cortex fork) is a self-contained, horizontally scalable microservices architecture that receives data via remote_write, with strengths in multi-tenancy and Grafana ecosystem integration. If you want minimal changes to existing Prometheus, choose Thanos; if you use the Grafana stack, Mimir is the better fit.
Q8: When is the predict_linear() function useful, and how is it applied in alerting?
predict_linear() performs linear regression based on historical data to predict values at a future point in time. It is useful for proactively detecting gradual issues like disk space exhaustion. For example, predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0 means "based on the trend over the last 6 hours, the disk will be full within 24 hours," allowing a preemptive alert to be sent.