- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. Overview
- 2. Rule Manager
- 3. Alert State Machine
- 4. Delivery to Alertmanager
- 5. Alertmanager Internals
- 6. Alert Grouping
- 7. Inhibition
- 8. Silencing
- 9. Deduplication
- 10. Alertmanager HA Cluster
- 11. Notification Delivery
- 12. Monitoring and Debugging
- 13. Summary
1. Overview
The Prometheus alerting pipeline detects anomalies based on metric data and delivers notifications to appropriate receivers. The pipeline consists of two major components:
- Prometheus Server: Rule evaluation and Alert generation
- Alertmanager: Alert routing, grouping, inhibition, silencing, and notification delivery
This post analyzes the Rule Manager evaluation loop, Alert state machine, and Alertmanager internals at the source code level.
2. Rule Manager
2.1 Overall Structure
prometheus.yml rule_files
|
v
+-------------------+
| Rule Manager |
| |
| +-- Rule Group 1 (evaluation_interval: 15s)
| | +-- recording rule A
| | +-- alerting rule B
| | +-- alerting rule C
| |
| +-- Rule Group 2 (evaluation_interval: 30s)
| | +-- recording rule D
| | +-- alerting rule E
| |
+-------------------+
|
Alert delivery
|
v
+-------------------+
| Alertmanager |
+-------------------+
2.2 Rule Group
A Rule Group is a collection of related rules that are evaluated sequentially at the same interval:
groups:
- name: node_alerts
interval: 15s # defaults to global.evaluation_interval
rules:
- record: instance:node_cpu:rate5m
expr: 1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
- alert: HighCPU
expr: instance:node_cpu:rate5m > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: 'CPU usage exceeds 90%'
2.3 Evaluation Loop
Rule Group evaluation loop:
1. evaluation_interval timer fires
|
v
2. Evaluate rules in group sequentially
(recording rules first, then alerting rules)
|
v
3. Execute each rule's PromQL expression against TSDB
|
v
4. Recording rule: store result as new time series in TSDB
Alerting rule: pass result to Alert state machine
|
v
5. Send active Alerts to Alertmanager
|
v
6. Wait until next evaluation cycle
Rules are evaluated sequentially within a group, so recording rule results can be referenced by alerting rules in the same group.
2.4 Evaluation Timing
Evaluation timing management:
- Each Rule Group runs in an independent goroutine
- Evaluation start times are aligned to evaluation_interval
(e.g., 15s interval: :00, :15, :30...)
- If evaluation takes longer than interval, next evaluation is skipped
- Skipped evaluations are recorded as metrics:
prometheus_rule_group_iterations_missed_total
3. Alert State Machine
3.1 State Transitions
Alert state machine:
+----------+
| inactive |
+----+-----+
|
| Expression result exists (matched)
v
+---------+
| pending | (waiting for 'for' duration)
+----+----+
|
| 'for' duration elapsed
v
+---------+
| firing | (sent to Alertmanager)
+----+----+
|
| Expression result absent (not matched)
v
+----------+
| resolved | (resolved, sent to Alertmanager)
+----+-----+
|
| Next evaluation cycle
v
+----------+
| inactive |
+----------+
Note: If expression result is absent while in pending, transitions directly to inactive
3.2 for Duration
The for field specifies how long a condition must persist before the Alert transitions to firing:
for duration behavior:
Time ExprResult State
0s true inactive -> pending (ActiveAt = 0s)
15s true pending (elapsed: 15s)
30s true pending (elapsed: 30s)
...
5m true pending -> firing (for: 5m satisfied)
5m15s true firing (continues sending)
5m30s false firing -> resolved
5m45s - resolved -> inactive
Without for (for: 0s):
0s true inactive -> firing (immediately)
3.3 Alert Identity
Each Alert instance is uniquely identified by its label set:
Alert identity:
alertname + expression result labels + additional labels field
= Alert's unique fingerprint
Example:
alert: HighCPU
expr: instance:node_cpu:rate5m > 0.9
labels:
severity: warning
If result label is instance="node-1":
fingerprint = hash(alertname=HighCPU, instance=node-1, severity=warning)
Same rule with different instance values produces separate Alerts
4. Delivery to Alertmanager
4.1 Delivery Mechanism
Alert delivery flow:
1. Collect active Alerts from evaluation loop
(firing + resolved)
|
v
2. Serialize Alerts to API format
POST /api/v2/alerts
|
v
3. Send to all configured Alertmanager instances
(alerting.alertmanagers configuration)
|
v
4. On delivery failure, resend in next evaluation cycle
(Alerts are resent every evaluation cycle)
4.2 Alert Data Format
Alert data structure:
- labels: Alert identification label map
- annotations: Additional info (summary, description, etc.)
- startsAt: Alert start time
- endsAt: Alert end time (resolved or expected end)
- generatorURL: Prometheus expression link
Firing state:
endsAt = current_time + 4 * evaluation_interval
(to prevent expiry before next send)
Resolved state:
endsAt = resolution time
4.3 Alertmanager Discovery
Alertmanager discovery:
1. Static configuration:
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager-1:9093', 'alertmanager-2:9093']
2. Service discovery:
alerting:
alertmanagers:
- kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: alertmanager
action: keep
Prometheus sends Alerts to all discovered Alertmanagers.
5. Alertmanager Internals
5.1 Processing Pipeline
Alertmanager processing pipeline:
Alert received (API)
|
v
Dispatcher
|-- Routing tree matching
|-- Alert grouping
|
v
Notification Pipeline
|-- Wait (group wait)
|-- Dedup (deduplication)
|-- Retry
|-- Inhibit (inhibition check)
|-- Silence (silencing check)
|-- Notify (actual delivery)
|
v
Receiver
|-- email
|-- Slack
|-- PagerDuty
|-- Webhook
|-- ...
5.2 Routing Tree
The routing tree is a hierarchical structure that matches Alerts to appropriate receivers:
# alertmanager.yml
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 10s
routes:
- match:
service: database
receiver: 'dba-pagerduty'
- match:
severity: warning
receiver: 'slack-warnings'
group_by: ['alertname', 'service']
Routing tree matching:
root (default-receiver)
|
+-- severity=critical -> pagerduty-critical
| |
| +-- service=database -> dba-pagerduty
|
+-- severity=warning -> slack-warnings
Matching order:
1. Traverse child routes top to bottom
2. Select first matching route (continue: false by default)
3. If continue: true, continue checking next routes
4. If no child matches, use current node's receiver
6. Alert Grouping
6.1 Grouping Mechanism
Grouping behavior:
group_by: ['alertname', 'cluster']
Alert 1: alertname=HighCPU, cluster=prod, instance=node-1
Alert 2: alertname=HighCPU, cluster=prod, instance=node-2
Alert 3: alertname=HighCPU, cluster=staging, instance=node-3
Alert 4: alertname=DiskFull, cluster=prod, instance=node-1
Group result:
Group 1: (alertname=HighCPU, cluster=prod) -> Alert 1, 2
Group 2: (alertname=HighCPU, cluster=staging) -> Alert 3
Group 3: (alertname=DiskFull, cluster=prod) -> Alert 4
Each group is sent as a single notification.
6.2 Grouping Timing
Grouping timing parameters:
group_wait: 30s
- Wait time before sending first notification for a new group
- Collects Alerts added to the same group during this window
- Prevents duplicate notifications during initial Alert storms
group_interval: 5m
- Interval for resending notifications when new Alerts join a group
- Not applied if only existing Alerts are present
repeat_interval: 4h
- Interval for resending notifications for unchanged groups
- Periodic reminder so receivers don't miss Alerts
7. Inhibition
7.1 Inhibition Rules
Inhibition suppresses notifications for certain Alerts when other specific Alerts are active:
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'cluster']
Inhibition behavior:
Source Alert (exists and firing):
alertname=HighCPU, cluster=prod, severity=critical
Target Alert (to be inhibited):
alertname=HighCPU, cluster=prod, severity=warning
Since equal fields match, target Alert notifications are inhibited.
When critical is firing for the same issue, warning is not notified.
7.2 Inhibition Processing
Inhibition processing flow:
1. Inhibit stage runs in Notification Pipeline
2. Check active Alert list for source_match
3. If matching source Alert exists
4. Verify equal field values match
5. If equal, suppress target Alert notification
6. Inhibited Alerts still appear in UI (state is maintained)
8. Silencing
8.1 Creating Silences
A silence temporarily suppresses notifications for Alerts matching specific conditions:
Silence configuration:
- matchers: Label matchers (regex supported)
- startsAt: Silence start time
- endsAt: Silence end time
- createdBy: Creator
- comment: Reason
Example:
matchers:
alertname = HighCPU
cluster = prod
startsAt: 2026-03-20T10:00:00Z
endsAt: 2026-03-20T14:00:00Z
comment: "Planned maintenance"
8.2 Silence Processing
Silence matching:
1. Runs at the Silence stage in Notification Pipeline
2. Iterates over active silence list
3. Evaluates each silence's matchers against Alert labels
4. If all matchers match, the Alert is silenced
5. Silenced Alerts are displayed in Alertmanager UI
6. Notifications automatically resume after silence expires
9. Deduplication
9.1 Deduplication Mechanism
Deduplication scenarios:
1. Repeated receipt of the same Alert:
- Prometheus resends firing Alerts every evaluation cycle
- Alertmanager prevents duplicate notifications for the same Alert
- Delivery records stored in Notification Log
2. HA cluster deduplication:
- Multiple Prometheus instances send the same Alert
- Alertmanager cluster shares delivery state via gossip
- Only one Alertmanager sends the actual notification
9.2 Notification Log
Notification Log:
- Stores notification delivery records for each Alert group
- Key: group fingerprint + receiver
- Value: last delivery time, delivered Alert fingerprint list
- Compared with repeat_interval to decide resend
- Synchronized via gossip protocol in HA cluster
10. Alertmanager HA Cluster
10.1 Cluster Architecture
Alertmanager HA architecture:
Prometheus 1 --+ +--> Slack
Prometheus 2 --+--> [Alertmanager 1] <--> +--> PagerDuty
[Alertmanager 2] <-->
[Alertmanager 3] <-->
|
gossip protocol
(Memberlist)
10.2 Gossip Protocol
Alertmanager uses a Memberlist (HashiCorp)-based gossip protocol:
Data synchronized via gossip:
1. Notification Log:
- Which Alert groups have been notified
- Prevents duplicate delivery if another instance already sent
2. Silence state:
- Created/modified/deleted silence information
- Same silences applied across all instances
Synchronization mechanism:
- Periodically propagates state to random peers
- Full state sync when new instance joins
- Eventual consistency model
10.3 HA Configuration
# Starting Alertmanager cluster
alertmanager --config.file=alertmanager.yml \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-1:9094 \
--cluster.peer=alertmanager-2:9094
HA behavior:
1. All instances receive Alerts
2. All instances perform routing/grouping independently
3. Dedup stage runs in Notification Pipeline
4. Check Notification Log via gossip
5. Only send actual notification if not yet delivered
6. Record delivery result in Notification Log and propagate via gossip
11. Notification Delivery
11.1 Receiver Types
Built-in receivers:
- email: SMTP email
- slack: Slack webhook
- pagerduty: PagerDuty Events API
- opsgenie: OpsGenie API
- victorops: VictorOps API
- webhook: Generic HTTP webhook
- wechat: WeChat
- pushover: Pushover
- sns: AWS SNS
- telegram: Telegram Bot API
- webex: Webex Teams
- msteams: Microsoft Teams
11.2 Template System
Notification templates:
Alertmanager uses Go templates to compose notification content.
Available data:
.Status: firing or resolved
.Alerts: Alert list
.GroupLabels: Labels used for grouping
.CommonLabels: Labels common to all Alerts
.ExternalURL: Alertmanager external URL
Data available per Alert:
.Labels: Alert labels
.Annotations: Alert annotations
.StartsAt: Start time
.EndsAt: End time
.GeneratorURL: Prometheus link
11.3 Retry Mechanism
Notification delivery retry:
1. On notification delivery failure
2. Retry with exponential backoff
Initial interval: 1s, max interval: 5m
3. After max retries exhausted, log the failure
4. Try again at next repeat_interval
12. Monitoring and Debugging
12.1 Prometheus Server Metrics
Rule evaluation metrics:
prometheus_rule_evaluations_total: Total rule evaluation count
prometheus_rule_evaluation_failures_total: Rule evaluation failure count
prometheus_rule_group_duration_seconds: Group evaluation duration
prometheus_rule_group_iterations_missed_total: Skipped evaluation count
Alert metrics:
prometheus_alerts: Current active Alerts (by state)
prometheus_notifications_total: Alertmanager delivery count
prometheus_notifications_errors_total: Delivery failure count
prometheus_notifications_dropped_total: Dropped notification count
prometheus_notifications_queue_length: Delivery queue length
12.2 Alertmanager Metrics
Alertmanager metrics:
alertmanager_alerts: Current active Alerts
alertmanager_alerts_received_total: Received Alert count
alertmanager_alerts_invalid_total: Invalid Alert count
alertmanager_notifications_total: Sent notification count (by receiver)
alertmanager_notifications_failed_total: Delivery failure count
alertmanager_silences: Active silence count
alertmanager_cluster_members: Cluster member count
alertmanager_cluster_messages_received_total: Gossip message count
12.3 Common Troubleshooting
1. Alert not firing:
- Manually execute PromQL expression to verify results
- Verify for duration has elapsed sufficiently
- Check evaluation_interval setting
2. Notifications not received:
- Check Alertmanager connectivity (Prometheus /targets)
- Verify routing tree matching (amtool config routes test)
- Check silences (Alertmanager UI)
- Review inhibition rules
3. Duplicate notifications:
- Verify group_by settings
- Check repeat_interval
- Verify HA cluster gossip health
13. Summary
The Prometheus alerting pipeline consists of the Rule Manager's periodic evaluation, the Alert state machine's precise state management, and Alertmanager's multi-stage processing pipeline. Grouping mitigates notification storms, inhibition and silencing suppress unnecessary notifications, and gossip-based HA clustering ensures high availability.