[Prometheus] Architecture Internals: Source Code Level Deep Dive

1. Overview
2. Main Server Components
- 2.1 Overall Architecture
- 2.2 Component Roles
3. Goroutine Model and Lifecycle Management
4. Configuration Reload Mechanism
5. Target Discovery Pipeline
6. HTTP Client Configuration and TLS
7. Internal Metrics
- 7.1 Key Self-Monitoring Metrics
- 7.2 Performance Tuning Indicators
8. Prometheus Operator Integration
- 8.1 Prometheus Operator Architecture
- 8.2 Config Reloader Sidecar
9. Performance Optimization Tips
10. Summary

1. Overview

Prometheus is a CNCF graduated project and the de facto standard monitoring system for cloud-native environments. This post analyzes the Prometheus server's internal architecture at the source code level, examining how each component interacts and how lifecycles are managed.

The Prometheus main server is written in Go, with cmd/prometheus/main.go as the entry point. The server consists of multiple independent components, each running as a separate goroutine group.

2. Main Server Components

2.1 Overall Architecture

The Prometheus server consists of these core components:

                    +-----------------+
                    |   Web UI/API    |
                    +--------+--------+
                             |
              +--------------+--------------+
              |              |              |
     +--------v---+  +------v------+  +----v-------+
     | Scrape     |  | Rule        |  | Notifier   |
     | Manager    |  | Manager     |  | (Alertmgr) |
     +--------+---+  +------+------+  +----+-------+
              |              |              |
              +--------------+--------------+
                             |
                    +--------v--------+
                    |     TSDB        |
                    +-----------------+
                             |
                    +--------v--------+
                    | Discovery       |
                    | Manager         |
                    +-----------------+

2.2 Component Roles

Scrape Manager: The core engine that collects metrics from targets. It receives target lists from the Discovery Manager and manages scraping loops for each target.

TSDB (Time Series Database): Handles storage and querying of time series data. A hierarchical storage system consisting of WAL, Head Block, and Persistent Blocks.

Rule Manager: Periodically evaluates Recording Rules and Alerting Rules. Evaluation results are written to TSDB or forwarded to the Notifier.

Notifier: Delivers active alerts to Alertmanager.

Web UI/API: HTTP server providing PromQL query endpoints, management APIs, and the built-in UI.

Discovery Manager: Discovers targets from various service discovery sources and delivers them to the Scrape Manager.

3. Goroutine Model and Lifecycle Management

3.1 Actor/Run Group Pattern

Prometheus uses the Run Group pattern from the oklog/run library to manage component lifecycles. Each component is registered with two functions:

// Each component registers as an (execute, interrupt) pair
g.Add(
    func() error {
        // execute: run the component's main logic
        return component.Run(ctx)
    },
    func(err error) {
        // interrupt: cleanup on shutdown
        component.Stop()
    },
)

The core behavior of the Run Group:

All component execute functions start simultaneously
When any execute function returns an error or exits, all other components' interrupt functions are called
It waits until all execute functions have returned

This pattern naturally implements graceful shutdown. For example, when SIGTERM is received, the signal handler's execute returns, causing all other components to shut down sequentially.

3.2 Main Goroutine Groups

main goroutine
  |
  +-- Signal Handler goroutine
  +-- Scrape Discovery Manager goroutine
  +-- Notify Discovery Manager goroutine
  +-- Scrape Manager goroutine
  +-- Rule Manager goroutine
  +-- TSDB goroutine
  +-- Web Handler goroutine
  +-- Notifier goroutine
  +-- Remote Storage goroutine

Each goroutine runs independently, communicating through channels or synchronization primitives.

3.3 Component Initialization Order

Initialization follows the dependency order:

TSDB initialization: Storage must be ready first
Discovery Manager start: Target discovery begins
Scrape Manager start: Receives targets from Discovery Manager and begins scraping
Rule Manager start: Begins rule evaluation
Notifier start: Prepares for alert delivery
Web Handler start: HTTP server starts last to accept requests

4. Configuration Reload Mechanism

4.1 Reload Triggers

Prometheus supports two methods for configuration reload:

SIGHUP signal: Reload via OS signal

kill -HUP $(pidof prometheus)

HTTP API: Available when --web.enable-lifecycle flag is enabled

curl -X POST http://localhost:9090/-/reload

4.2 Reload Process

Configuration reload is handled in the reloadConfig function. The complete flow:

1. Receive SIGHUP or /-/reload call
2. Parse prometheus.yml file
3. Validate configuration
4. Propagate new configuration to each component:
   a. Update Remote Storage configuration
   b. Update Notifier configuration
   c. Update Discovery Manager configuration
   d. Update Scrape Manager configuration
   e. Update Rule Manager configuration
   f. Update Web Handler configuration
5. Update reload success/failure metrics

4.3 Per-Component Reload Behavior

Scrape Manager: Computes the diff between existing scrape pools and the new configuration. Only changed jobs are recreated while unchanged jobs are maintained, minimizing unnecessary disruption.

Rule Manager: Recreates all rule groups. Restores the state of existing rules (alert states, etc.) to the new rules.

Discovery Manager: Applies new service discovery configuration. Reuses unchanged providers from the existing configuration.

Notifier: Updates Alertmanager endpoint configuration.

4.4 Reload Failure Handling

If configuration file parsing or validation fails, the reload is aborted and the existing configuration is preserved. Reload failures can be monitored via the prometheus_config_last_reload_successful metric:

prometheus_config_last_reload_successful == 0

5. Target Discovery Pipeline

5.1 Complete Flow

The full pipeline from target discovery to scraping:

Service Discovery Provider
        |
        v
Discovery Manager (target groups)
        |
        v
Scrape Manager (apply relabel_configs)
        |
        v
Scrape Pool (per job)
        |
        v
Scrape Loop (per target)
        |
        v
HTTP GET /metrics
        |
        v
TSDB Appender

5.2 Discovery Manager

The Discovery Manager unifies multiple service discovery providers:

kubernetes_sd_config --|
consul_sd_config    --|-->  Discovery Manager  -->  Target Groups Channel
file_sd_config      --|
static_config       --|

Each provider runs in a separate goroutine, delivering target group changes through channels. The Discovery Manager collects these and forwards them to the Scrape Manager.

Provider updates are buffered for a configurable duration (default 5 seconds) before being delivered in batch. This prevents frequent changes from overloading the Scrape Manager.

5.3 Scrape Manager

When the Scrape Manager receives target groups from the Discovery Manager:

Finds the corresponding Scrape Pool for each scrape job
Applies relabel_configs to new targets
Keeps or drops targets based on relabeling results
Creates Scrape Loops for new targets
Terminates Scrape Loops for disappeared targets

5.4 Scrape Pool

A Scrape Pool is a group of targets belonging to the same scrape job. Each Pool manages:

Active target list
Dropped target list (for debugging)
Shared scraping configuration (interval, timeout, metrics path, etc.)

5.5 Scrape Loop

One Scrape Loop runs as a goroutine for each target:

Scrape Loop Cycle:
1. Wait for scrape_interval
2. Send HTTP GET request (within scrape_timeout)
3. Parse response (Prometheus Exposition Format)
4. Apply metric_relabel_configs
5. Write samples to TSDB
6. Update up metric (1=success, 0=failure)
7. Update internal metrics (scrape_duration_seconds, etc.)
8. Repeat from step 1

6. HTTP Client Configuration and TLS

6.1 HTTP Client Construction

The HTTP client for each scrape target is configured according to the scrape job settings:

scrape_configs:
  - job_name: 'secure-targets'
    scheme: https
    tls_config:
      ca_file: /path/to/ca.pem
      cert_file: /path/to/cert.pem
      key_file: /path/to/key.pem
      insecure_skip_verify: false
    basic_auth:
      username: admin
      password_file: /path/to/password
    authorization:
      type: Bearer
      credentials_file: /path/to/token

6.2 TLS Configuration

Prometheus supports various TLS configurations:

CA Certificate: For server certificate validation
Client Certificate: For mTLS (mutual TLS) authentication
Server Name Verification: SNI configuration via the server_name field
Automatic Certificate Renewal: File-based credentials are periodically re-read

6.3 Authentication Mechanisms

Prometheus supports multiple authentication methods:

Basic Auth: Username/password based
Bearer Token: OAuth2 tokens, etc.
OAuth2 Client: Client credentials flow support
AWS SigV4: For AWS service access

Credential files (password_file, credentials_file, etc.) are re-read on each scrape cycle, so credentials can be updated without restarting Prometheus.

7. Internal Metrics

7.1 Key Self-Monitoring Metrics

Prometheus exposes various internal metrics for self-monitoring:

Scraping related:

prometheus_target_scrape_pool_targets: Number of targets in each scrape pool
prometheus_target_scrapes_exceeded_sample_limit_total: Sample limit exceeded count
scrape_duration_seconds: Scrape duration
scrape_samples_scraped: Number of samples scraped

TSDB related:

prometheus_tsdb_head_series: Number of series in the Head Block
prometheus_tsdb_head_samples_appended_total: Number of samples appended
prometheus_tsdb_compactions_total: Number of compactions performed
prometheus_tsdb_wal_corruptions_total: Number of WAL corruptions

Rule evaluation related:

prometheus_rule_evaluation_duration_seconds: Rule evaluation duration
prometheus_rule_group_last_duration_seconds: Last rule group evaluation time

Configuration related:

prometheus_config_last_reload_successful: Whether last reload succeeded
prometheus_config_last_reload_success_timestamp_seconds: Last successful reload timestamp

7.2 Performance Tuning Indicators

Key indicators to monitor in production:

1. Scrape lag: scrape_duration_seconds > scrape_interval
   - If scraping takes longer than the configured interval, data gaps may occur

2. Series cardinality: prometheus_tsdb_head_series
   - Rapid increase signals cardinality explosion

3. Rule evaluation lag: prometheus_rule_group_last_duration_seconds > evaluation_interval
   - Rule evaluation exceeding the interval causes alert delays

4. WAL size: prometheus_tsdb_wal_segment_current
   - Abnormally large WAL signals compaction issues

8. Prometheus Operator Integration

8.1 Prometheus Operator Architecture

In Kubernetes environments, Prometheus is typically deployed via Prometheus Operator:

Prometheus Operator
  |
  +-- watches Prometheus CRD
  |     |
  |     +-- generates prometheus.yml
  |     +-- manages StatefulSet
  |
  +-- watches ServiceMonitor CRD
  |     |
  |     +-- converts to scrape_configs
  |
  +-- watches PodMonitor CRD
  |     |
  |     +-- converts to scrape_configs
  |
  +-- watches PrometheusRule CRD
        |
        +-- converts to rule files

8.2 Config Reloader Sidecar

Prometheus Operator uses a config-reloader sidecar container to automatically detect configuration changes and send reload requests to Prometheus:

1. Operator updates ConfigMap/Secret
2. Config Reloader detects mounted file changes
3. Sends /-/reload POST request
4. Prometheus applies new configuration

This sidecar monitors file changes using inotify (Linux), with periodic polling as a fallback.

9. Performance Optimization Tips

9.1 Scraping Optimization

scrape_interval tuning: Set intervals matching metric change velocity. Too short increases load, too long reduces precision
sample_limit configuration: Limit maximum samples per target to prevent cardinality explosion
metric_relabel_configs: Drop unnecessary metrics immediately after scraping

9.2 TSDB Optimization

--storage.tsdb.min-block-duration: Minimum Head Block retention time (default 2h)
--storage.tsdb.max-block-duration: Maximum block size (default 10% of retention time or 31 days)
--storage.tsdb.wal-compression: Enable WAL compression to reduce disk I/O

9.3 Query Optimization

--query.max-concurrency: Limit concurrent queries (default 20)
--query.timeout: Set query timeout (default 2 minutes)
--query.max-samples: Limit maximum samples per query (default 50 million)

10. Summary

The Prometheus server's internal architecture demonstrates a clean design leveraging Go's concurrency model. The core design patterns are lifecycle management through Run Groups, channel-based inter-component communication, and pipeline-style data flow.

In the next post, we will analyze the TSDB internals in greater depth, examining WAL segment structure, chunk encoding, and block compaction algorithms at the source code level.