Skip to content
Published on

Prometheus Complete Guide — Metrics, PromQL, Alerting, Dashboards, and Best Practices

Authors
  • Name
    Twitter

Introduction

Modern infrastructure is distributed across microservices, containers, and serverless functions, with hundreds of components operating simultaneously. To answer the question "Is the system healthy right now?" in such an environment, systematic metric collection and monitoring are essential. Prometheus is a CNCF graduated project and the de facto standard monitoring system for the Kubernetes ecosystem, providing pull-based metric collection, a powerful query language called PromQL, and a flexible alerting system.

This article covers everything from Prometheus core concepts to architecture, installation and configuration, practical PromQL queries, Alertmanager alerting setup, Grafana dashboard integration, best practices for large-scale operations, an operational checklist, and common mistakes — all in a single comprehensive guide. This is a production-oriented, hands-on resource that you can apply immediately.


1. Core Concepts

Pull-Based Model

Unlike push-based monitoring systems where targets send metrics to a central server, Prometheus uses a pull-based model where the Prometheus server periodically scrapes each target's /metrics endpoint. The advantages of this approach include:

  • Monitoring targets do not need to know about Prometheus
  • It naturally adapts to dynamic environments when combined with service discovery
  • When a target goes down, scrape failure is detected immediately

Time Series Data

Prometheus stores all data as time series. Each time series is uniquely identified by a combination of a metric name and key-value label pairs.

http_requests_total{method="GET", handler="/api/users", status="200"}

Metric Types

TypeDescriptionUse CasePromQL Usage
CounterMonotonically increasing cumulative valueRequest count, error countrate(), increase()
GaugeValue that can go up and downCPU usage, memory, temperatureDirect use, avg_over_time()
HistogramMeasures distribution of values across bucketsResponse time, request sizehistogram_quantile()
SummaryCalculates quantiles on the client sideResponse time (non-aggregatable)Direct use (not recommended)

Histogram vs Summary: Histograms allow server-side quantile calculation, making aggregation across multiple instances possible. Summaries calculate quantiles on the client side, making aggregation mathematically impossible. In most cases, Histogram is recommended.

2. Architecture

The Prometheus ecosystem consists of several components working together organically.

graph TB
    subgraph Targets
        A[Application /metrics]
        B[Node Exporter]
        C[cAdvisor]
        D[Custom Exporter]
    end

    subgraph "Prometheus Server"
        E[Retrieval<br/>Scrape Engine]
        F[TSDB<br/>Time Series Database]
        G[HTTP Server<br/>PromQL API]
    end

    H[Service Discovery<br/>Kubernetes / Consul / DNS]
    I[Pushgateway<br/>For Short-lived Jobs]
    J[Alertmanager<br/>Alert Routing/Grouping]
    K[Grafana<br/>Dashboards]

    H --> E
    A --> E
    B --> E
    C --> E
    D --> E
    I --> E
    E --> F
    F --> G
    G --> K
    G --> J
    J --> L[Slack / PagerDuty / Email]
ComponentRole
Prometheus ServerScrapes metrics, stores in TSDB, runs PromQL query engine
ExportersExpose target system metrics in Prometheus format (node_exporter, mysqld_exporter, etc.)
PushgatewayRelays metrics from short-lived batch jobs that are difficult to scrape
AlertmanagerRoutes, groups, deduplicates, and silences alerts based on alerting rules
Service DiscoveryDynamically discovers scrape targets from Kubernetes, Consul, DNS, etc.

3. Installation and Configuration

Full Stack Setup with Docker Compose

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    ports:
      - '9090:9090'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules/:/etc/prometheus/rules/
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - '9093:9093'
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    ports:
      - '3000:3000'
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/:/etc/grafana/provisioning/
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node-exporter
    ports:
      - '9100:9100'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

prometheus.yml Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s # Default scrape interval
  evaluation_interval: 15s # Alert rule evaluation interval
  scrape_timeout: 10s # Scrape timeout

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Application (using relabel_configs)
  - job_name: 'app'
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets: ['app:8080']
        labels:
          env: production
          team: backend

  # Kubernetes Service Discovery (reference)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

4. PromQL in Practice

rate() and irate()

# Per-second request rate over 5 minutes (suitable for alerts)
sum(rate(http_requests_total[5m])) by (service)

# Instantaneous rate of change (suitable for dashboard visualization)
sum(irate(http_requests_total[$__rate_interval])) by (service)

Use rate() for alerting rules and irate() for dashboards. The range window for rate() should be at least 4 times the scrape_interval.

histogram_quantile()

# 95th percentile response time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# 50th percentile (median)
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Aggregation Operators

# Top 5 services by error rate
topk(5,
  sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  /
  sum(rate(http_requests_total[5m])) by (service)
)

# Memory usage by namespace
sum(container_memory_usage_bytes{container!=""}) by (namespace)

# Nodes with CPU usage exceeding 80%
node_cpu_seconds_total{mode="idle"} < 0.2

# Predict disk space 4 hours from now using predict_linear
predict_linear(
  node_filesystem_avail_bytes{mountpoint="/"}[6h], 4 * 3600
) < 0

Useful Functions Reference

FunctionPurposeExample
rate()Average per-second rate over time rangerate(http_requests_total[5m])
irate()Instantaneous rate between last two pointsirate(http_requests_total[5m])
increase()Total increase over time rangeincrease(http_requests_total[1h])
histogram_quantile()Calculate quantile from histogramhistogram_quantile(0.99, ...)
predict_linear()Predict future value via linear regressionpredict_linear(disk_free[6h], 3600*4)
absent()Returns 1 if time series is missingabsent(up{job="app"})
changes()Number of value changes in time rangechanges(process_start_time_seconds[1h])

5. Alertmanager Alerting Setup

Defining Alerting Rules

# prometheus/rules/alerts.yml
groups:
  - name: instance-alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: 'Instance {{ $labels.instance }} is down'
          description: '{{ $labels.instance }} of job {{ $labels.job }} has been unreachable for more than 3 minutes.'

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: '{{ $labels.service }} error rate exceeds {{ $value | humanizePercentage }}'
          description: 'Service {{ $labels.service }} has a 5xx error rate exceeding 5%.'

      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: 'Node memory usage exceeds 90%'

      - alert: DiskSpaceRunningOut
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: 'Disk space predicted to run out within 24 hours'

      - alert: HighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: '{{ $labels.service }} P95 latency exceeds 1 second'

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']

6. Grafana Dashboard Integration

Data Source Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: '15s'
      httpMethod: POST

Example Dashboard Panel

{
  "title": "Service Error Rate",
  "type": "timeseries",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
      "legendFormat": "{{ service }}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 0.01, "color": "yellow" },
          { "value": 0.05, "color": "red" }
        ]
      }
    }
  }
}
PanelPromQLPurpose
Error Ratesum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)Error rate trend per service
QPSsum(rate(http_requests_total[5m])) by (service)Queries per second
P95 Latencyhistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))95th percentile response time
CPU Usage1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)CPU usage per node
Memory Usage1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytesMemory usage per node
Disk Freenode_filesystem_avail_bytes{mountpoint="/"}Root filesystem remaining capacity

7. Best Practices for Large-Scale Operations

Federation

Federation is a pattern where a global Prometheus server scrapes selected metrics from multiple local Prometheus servers. Each team or cluster runs its own local Prometheus, and only metrics needed for a global view are federated upward.

# Global Prometheus scrape_configs
scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".+"}' # All job metrics
        - '{__name__=~"job:.*"}' # Only recording rule results
    static_configs:
      - targets:
          - 'prometheus-team-a:9090'
          - 'prometheus-team-b:9090'

Long-Term Storage: Thanos / Cortex / Mimir

Prometheus local TSDB is designed for 15 to 30 days of retention. For longer retention and global querying, consider the following solutions.

SolutionCharacteristicsBest For
ThanosSidecar pattern, object storage, global queryingAdding sidecars to existing Prometheus
CortexMulti-tenant, horizontally scalable, microservices architectureLarge-scale SaaS environments
Grafana MimirCortex fork, improved performance, Grafana ecosystem integrationWhen using the Grafana stack
VictoriaMetricsHigh performance, low resource usage, PromQL compatibleWhen cost optimization is critical

Cardinality Management

High cardinality (explosive growth of label value combinations) is the most common cause of Prometheus performance degradation.

# Drop unnecessary labels with relabel_configs
relabel_configs:
  - action: labeldrop
    regex: '(pod_template_hash|controller_revision_hash)'

# Drop high-cardinality metrics with metric_relabel_configs
metric_relabel_configs:
  - source_labels: [__name__]
    regex: '(go_gc_.*|go_memstats_.*)'
    action: drop

Cardinality inspection PromQL:

# Top 10 metrics by time series count
topk(10, count by (__name__) ({__name__=~".+"}))

# Check cardinality for a specific metric
count(http_requests_total) by (service, method, status)

8. Operational Checklist

ItemRecommended SettingNotes
Retention Period15-30 days (local TSDB)Long-term: use Thanos/Mimir with object storage
BackupUse TSDB snapshot APIPOST /api/v1/admin/tsdb/snapshot
HA Setup2 identical Prometheus instances + Alertmanager clusteringAlertmanager uses gossip protocol to prevent duplicate alerts
SecurityTLS + Basic Auth or OAuth2 ProxyConfigure via --web.config.file for TLS/Auth
Resources~2GB RAM per 1 million time seriesAccount for WAL and head chunks memory
Scrape Interval15s (default), 10s for critical metricsToo short increases load, too long reduces resolution
Alert Testingamtool check-config, promtool check rulesIntegrate into CI/CD pipeline
Recording RulesPre-compute frequently used queriesImproves dashboard performance, follow record: naming convention
Cardinality MonitoringWatch prometheus_tsdb_head_series trendInvestigate root cause metric on sudden increase

9. Common Mistakes

  1. Using too short a range window with rate() -- Use at least 4x the scrape_interval (e.g., [1m] for 15s interval, [2m] for 30s). A window that is too short causes missing data points, resulting in zero values.

  2. Using irate() in alerting rules -- irate() calculates the instantaneous rate and is sensitive to noise. Always use rate() for alerting rules.

  3. Defining alerts without a for clause -- With for: 0s, a single spike triggers an alert. Set a minimum of 3-5 minutes for the for period.

  4. Using high-cardinality labels -- Putting unique values like user_id, request_id, or trace_id in labels causes unbounded time series growth. Store such values in logs or traces instead.

  5. Overusing the Summary metric type -- Summary quantiles cannot be aggregated across instances. Use Histogram when running multiple instances.

  6. Overusing the Pushgateway -- Pushgateway is designed for short-lived batch jobs only. Routing long-running service metrics through Pushgateway makes it impossible to detect target downtime.

  7. Ignoring Alertmanager routing priority -- Route matching proceeds from top to bottom. Place more specific matches first and the default receiver last.

  8. Not following Recording Rules naming convention -- Follow the level:metric:operations format (e.g., job:http_requests_total:rate5m) for easy identification in dashboards and alerts.

  9. Mismatch between TSDB retention period and disk capacity -- Setting retention to 90 days when the disk can only hold 30 days of data will cause Prometheus to crash with OOM. Use --storage.tsdb.retention.size to enforce disk-based retention limits in parallel.

  10. Not monitoring the monitoring system itself -- Always monitor Prometheus's own up, prometheus_tsdb_head_series, and prometheus_engine_query_duration_seconds metrics.

10. Summary

Prometheus is the standard monitoring system for cloud-native environments. Here are the key takeaways:

  • The pull-based model naturally adapts to dynamic environments and integrates with service discovery
  • Among the 4 metric types (Counter, Gauge, Histogram, Summary), prefer Histogram
  • PromQL provides powerful functions such as rate(), histogram_quantile(), and predict_linear()
  • Alertmanager reduces alert fatigue through routing, grouping, inhibition, and deduplication
  • Grafana integration enables dashboards for error rates, QPS, latency, and resource usage
  • For large-scale environments, use Thanos/Mimir for long-term storage, Federation for global views, and cardinality management for sustained performance
  • Regularly review the operational checklist and avoid common mistakes

By adapting the configurations and queries covered in this guide to your specific environment, you can build a stable and scalable monitoring system.

Quiz

Q1: What is the main reason Prometheus uses a pull-based model instead of a push-based model?

In the pull-based model, monitoring targets do not need to know about Prometheus, and it integrates naturally with service discovery. Additionally, when a target goes down, scrape failure is detected immediately. This makes it particularly well-suited for dynamically scaling cloud-native environments.

Q2: Why should you apply rate() to Counter-type metrics instead of using the raw value?

A Counter is a monotonically increasing cumulative value, so looking at the raw value just shows an ever-growing number that provides little meaningful insight. Applying rate() calculates the per-second rate of change, allowing you to understand current throughput (e.g., requests per second). Additionally, rate() automatically handles counter resets caused by restarts.

Q3: What is the key difference between Histogram and Summary, and why is Histogram recommended?

Histogram allows server-side quantile calculation using the histogram_quantile() function, making it possible to aggregate data across multiple instances. Summary calculates quantiles on the client side, making cross-instance aggregation mathematically impossible. Since running multiple instances is standard in microservices environments, Histogram is recommended.

Q4: Why is using irate() instead of rate() problematic in alerting rules? irate() calculates the instantaneous rate between the two most recent data points, making it highly sensitive to short-lived spikes. When used in alerting, it triggers on transient noise, causing a surge in false positives and alert fatigue. rate() averages over the entire specified time range, making it resilient to temporary fluctuations.

Q5: What are the roles of group_by, group_wait, and group_interval in Alertmanager?

group_by specifies the labels used to group alerts together (e.g., alertname, service). group_wait is the initial waiting period for alerts in the same group to accumulate (e.g., 30 seconds). group_interval is the interval for resending when new alerts join an already-notified group. Properly combining these settings prevents alert storms.

Q6: Why is high cardinality dangerous for Prometheus, and how can it be mitigated?

When label value combinations grow explosively, the number of time series surges, causing Prometheus memory (head chunks) and disk usage to increase exponentially. Mitigation strategies include avoiding unique values like user_id in labels, using metric_relabel_configs to drop unnecessary metrics, and monitoring the prometheus_tsdb_head_series metric to track time series count.

Q7: What are the main differences between Thanos and Grafana Mimir? Thanos uses a sidecar pattern added to existing Prometheus instances, uploading blocks to object storage for long-term retention and global querying. Grafana Mimir (a Cortex fork) is a self-contained, horizontally scalable microservices architecture that receives data via remote_write, with strengths in multi-tenancy and Grafana ecosystem integration. If you want minimal changes to existing Prometheus, choose Thanos; if you use the Grafana stack, Mimir is the better fit.

Q8: When is the predict_linear() function useful, and how is it applied in alerting?

predict_linear() performs linear regression based on historical data to predict values at a future point in time. It is useful for proactively detecting gradual issues like disk space exhaustion. For example, predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0 means "based on the trend over the last 6 hours, the disk will be full within 24 hours," allowing a preemptive alert to be sent.