- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. Overview
- 2. Discovery Manager Architecture
- 3. Provider Interface
- 4. Kubernetes SD
- 5. Relabeling Mechanism
- 6. File-based SD
- 7. HTTP SD
- 8. Other SD Implementations
- 9. Target Lifecycle
- 10. Performance Considerations
- 11. Debugging and Troubleshooting
- 12. Summary
1. Overview
Prometheus Service Discovery is the core subsystem that automatically discovers and manages monitoring targets. In cloud-native environments where service instances are dynamically created and destroyed, static configuration alone cannot track all targets.
This post analyzes the Discovery Manager architecture, Provider interface, SD implementation details, relabeling mechanism, and target lifecycle at the source code level.
2. Discovery Manager Architecture
2.1 Overall Structure
prometheus.yml scrape_configs
|
v
+-------------------+
| Discovery Manager |
| |
| +-- Provider 1 (kubernetes_sd) -- goroutine
| +-- Provider 2 (consul_sd) -- goroutine
| +-- Provider 3 (file_sd) -- goroutine
| +-- Provider 4 (static_config) -- goroutine
| |
+--------+----------+
|
Target Groups Channel
|
v
+-------------------+
| Scrape Manager |
+-------------------+
2.2 Discovery Manager Responsibilities
- Provider management: Managing the lifecycle of each service discovery provider
- Update collection: Collecting target group changes from providers
- Batch delivery: Buffering changes and delivering them to the Scrape Manager in batches
- Config reload: Recreating providers when new configuration is applied
2.3 Update Flow
Provider detects change
|
v
Sends update to internal channel
|
v
Discovery Manager receives
|
v
Waits 5 seconds for additional updates (debouncing)
|
v
Merges latest target groups from all providers
|
v
Delivers complete target map to Scrape Manager
Debouncing prevents excessive Scrape Manager updates in environments with frequent changes.
3. Provider Interface
3.1 Discoverer Interface
All service discovery providers implement the Discoverer interface:
type Discoverer interface {
Run(ctx context.Context, up chan<- []*targetgroup.Group)
}
Run method contract:
- Maintain execution until context is cancelled
- Send complete group list to up channel on target group changes
- Send all currently known targets on initial run
3.2 TargetGroup Structure
TargetGroup:
Source: "kubernetes/pod/default/nginx-abc123" (unique identifier)
Targets:
- __address__: "10.0.1.5:8080"
- __address__: "10.0.1.6:8080"
Labels:
__meta_kubernetes_namespace: "default"
__meta_kubernetes_pod_name: "nginx-abc123"
...
3.3 Provider Registration
Providers are registered via factory functions:
Registration process:
1. Parse *_sd_config sections from config file
2. Call NewDiscoverer factory for the SD type
3. Register provider with Discovery Manager
4. Execute Run() in separate goroutine
4. Kubernetes SD
4.1 kubernetes_sd Overview
Kubernetes SD is the most widely used service discovery in Prometheus. It uses the Kubernetes API server's Watch mechanism to detect resource changes in real time.
4.2 Role Types
kubernetes_sd supports 6 role types:
1. node:
- Kubernetes node list
- __address__: Node Kubelet address
- __meta_kubernetes_node_name, __meta_kubernetes_node_label_*
2. service:
- Kubernetes Service list
- __address__: Service ClusterIP:Port
- __meta_kubernetes_service_name, __meta_kubernetes_service_port_name
3. pod:
- Kubernetes Pod list
- __address__: Pod IP:Container Port
- __meta_kubernetes_pod_name, __meta_kubernetes_pod_container_name
4. endpoints:
- Kubernetes Endpoints list
- __address__: Individual endpoint address
- __meta_kubernetes_endpoints_name
5. endpointslice:
- Kubernetes EndpointSlice list (Endpoints extension)
- More efficient for large clusters
6. ingress:
- Kubernetes Ingress list
- __address__: Ingress host
- __meta_kubernetes_ingress_name, __meta_kubernetes_ingress_path
4.3 Watch Mechanism
Kubernetes SD Watch operation:
1. Initial sync:
a. List API to retrieve full resource list
b. Convert all resources to TargetGroups
c. Send full list to Discovery Manager
2. Watch start:
a. Connect to Watch API event stream
b. Incremental updates based on resourceVersion
3. Event processing:
ADDED -> Create new TargetGroup
MODIFIED -> Update existing TargetGroup
DELETED -> Update TargetGroup with empty Targets
4. Error handling:
Watch disconnect -> Auto-reconnect
410 Gone error -> Full re-list
Timeout -> Restart Watch
4.4 Informer/Reflector Pattern
kubernetes_sd leverages the client-go Informer pattern:
Informer structure:
Reflector
|-- List: Initial full sync
|-- Watch: Incremental update reception
|-- Store: Save to local cache
v
Informer
|-- EventHandler: Register event callbacks
|-- Indexer: Indexes for efficient lookup
v
Prometheus SD
|-- Convert events to TargetGroups
|-- Deliver to Discovery Manager
4.5 __meta Labels
kubernetes_sd provides rich meta labels:
Common:
__meta_kubernetes_namespace
Pod role:
__meta_kubernetes_pod_name
__meta_kubernetes_pod_ip
__meta_kubernetes_pod_container_name
__meta_kubernetes_pod_container_port_name
__meta_kubernetes_pod_container_port_number
__meta_kubernetes_pod_label_*
__meta_kubernetes_pod_annotation_*
__meta_kubernetes_pod_node_name
__meta_kubernetes_pod_ready
__meta_kubernetes_pod_phase
Node role:
__meta_kubernetes_node_name
__meta_kubernetes_node_label_*
__meta_kubernetes_node_annotation_*
__meta_kubernetes_node_address_*
Service role:
__meta_kubernetes_service_name
__meta_kubernetes_service_port_name
__meta_kubernetes_service_port_number
__meta_kubernetes_service_label_*
__meta_kubernetes_service_annotation_*
5. Relabeling Mechanism
5.1 Relabeling Overview
Relabeling is a powerful mechanism for transforming target labels. It is applied in two stages:
Stage 1: relabel_configs (before scraping)
- Process __meta_* labels from discovery
- Determine target labels (job, instance, etc.)
- Decide target keep/drop
Stage 2: metric_relabel_configs (after scraping)
- Transform labels of collected metrics
- Drop unnecessary metrics
- Transform label names/values
5.2 Relabeling Actions
replace: Match source label via regex, set target label
keep: Drop targets not matching regex
drop: Drop targets matching regex
hashmod: Hash source label, modulo to target label
labelmap: Transform matching label names
labeldrop: Drop labels matching regex
labelkeep: Drop labels not matching regex
lowercase: Convert target label value to lowercase
uppercase: Convert target label value to uppercase
keepequal: Keep targets where source equals target label
dropequal: Drop targets where source equals target label
5.3 Relabeling Processing Flow
__meta_* labels + __address__ + __scheme__ + __metrics_path__
|
v
Apply relabel_configs sequentially
|
v
__address__ -> copy to instance label
Remove internal labels (__scheme__, __metrics_path__, etc.)
|
v
Final target label set determined
5.4 Common Relabeling Patterns
Pod annotation-based scraping:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: __meta_kubernetes_pod_ip:$1
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Namespace filtering:
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
action: keep
regex: production|staging
6. File-based SD
6.1 Overview
File SD reads targets from JSON or YAML files, the simplest dynamic discovery:
scrape_configs:
- job_name: 'file_sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
refresh_interval: 5m
6.2 Target File Formats
JSON format:
[
{
"targets": ["10.0.1.1:9090", "10.0.1.2:9090"],
"labels": {
"env": "production",
"team": "backend"
}
}
]
YAML format:
- targets:
- '10.0.1.1:9090'
- '10.0.1.2:9090'
labels:
env: production
team: backend
6.3 File Watch Mechanism
File SD operation:
1. Initial load: Read all files matching the pattern
2. inotify watch: Subscribe to file change events on Linux
3. Periodic polling: Full reload every refresh_interval (default 5m)
4. File change: Re-parse only changed files, update TargetGroups
5. File deletion: Remove all targets from that file
6.4 Custom SD Bridge Pattern
File SD serves as a bridge for custom service discovery:
Custom SD process (external):
1. Query proprietary service registry
2. Generate results as JSON/YAML files
3. Save to directory monitored by Prometheus
Prometheus (File SD):
1. Detect file changes
2. Update target list
3. Start/stop scraping
This pattern is useful for integrating with service registries not directly supported by Prometheus (Zookeeper, etcd, etc.).
7. HTTP SD
7.1 Overview
HTTP SD periodically polls an HTTP endpoint for the target list:
scrape_configs:
- job_name: 'http_sd'
http_sd_configs:
- url: 'http://service-registry:8080/targets'
refresh_interval: 30s
7.2 Response Format
The HTTP SD endpoint must return a JSON array:
[
{
"targets": ["10.0.1.1:9090"],
"labels": {
"__meta_datacenter": "us-east",
"__meta_env": "production"
}
}
]
7.3 Operation
HTTP SD processing:
1. Send GET request to URL every refresh_interval
2. Parse response JSON
3. Convert to TargetGroups
4. Deliver to Discovery Manager
5. On HTTP error, keep previous results
6. Alert via metrics on consecutive failures
8. Other SD Implementations
8.1 Consul SD
Consul SD:
- Uses Consul Service Catalog API
- Watch/Blocking Query for change detection
- Filtering by service name, tags, datacenter
- __meta_consul_service, __meta_consul_tags meta labels
8.2 EC2 SD
EC2 SD:
- Uses AWS EC2 DescribeInstances API
- Filtering by region, availability zone, tags
- IAM role or access key authentication
- __meta_ec2_instance_id, __meta_ec2_tag_* meta labels
- Polling based on refresh_interval (no Watch support)
8.3 DNS SD
DNS SD:
- DNS SRV or A/AAAA record lookups
- Periodic polling approach
- Useful for simple environments
- __meta_dns_name meta label
9. Target Lifecycle
9.1 Target State Transitions
Target lifecycle:
Discovered
|
v
Relabeled
|
+-- Drop decision --> Dropped
|
v
Active (scraping starts)
|
+-- Scrape success --> up=1
+-- Scrape failure --> up=0
|
v
Disappeared
|
v
Stale (stale marker added)
|
v
Removed (Scrape Loop terminated)
9.2 Dropped Targets
Conditions for target dropping:
1. drop action applied in relabel_configs
2. Not matching keep action in relabel_configs
3. __address__ label is empty
4. Duplicate target (identical label set)
Dropped targets appear in the "Dropped Targets" section of the /targets UI,
useful for debugging discovery and relabeling configuration.
9.3 Target Health Monitoring
Built-in metrics:
up: 0 or 1 (scrape success)
scrape_duration_seconds: scrape duration
scrape_samples_scraped: samples collected
scrape_samples_post_metric_relabeling: samples after relabeling
scrape_series_added: newly added series
API endpoints:
GET /api/v1/targets: full target list and status
GET /api/v1/targets/metadata: per-target metric metadata
10. Performance Considerations
10.1 Optimization for Large Clusters
Kubernetes SD performance tips:
1. Namespace restriction: Narrow watch scope with namespaces field
2. Label selectors: Filter resources with selectors field
3. attach_metadata: Disable if unnecessary
4. EndpointSlice: More efficient than Endpoints for large clusters
10.2 Discovery Load Monitoring
Key metrics:
prometheus_sd_discovered_targets: discovered targets (per SD type)
prometheus_sd_received_updates_total: received updates
prometheus_sd_updates_total: delivered updates
prometheus_sd_updates_delayed_total: delayed updates
11. Debugging and Troubleshooting
11.1 Common Issues
1. Targets not discovered:
- Check SD configuration (namespace, selector, etc.)
- Check Prometheus logs for SD errors
- Verify discovered targets on /service-discovery UI
2. Targets in Dropped state:
- Check relabel_configs rule order
- Validate keep/drop action regex
- Check Dropped Targets on /targets page
3. Target labels differ from expected:
- Check __meta_* labels (/service-discovery page)
- Verify relabel_configs replace rules
- Check honor_labels setting for label conflicts
11.2 Debugging Tools
1. /service-discovery UI:
- Shows raw target groups from each SD provider
- Full __meta_* label visibility
2. /targets UI:
- Active/dropped target list
- Per-target health, last scrape time
- Applied labels
3. Prometheus logs:
- Enable detailed SD logs with --log.level=debug
- Per-provider connection/error logs
12. Summary
Prometheus service discovery supports diverse infrastructure environments through its plugin architecture. The Kubernetes SD Watch mechanism, powerful label transformation via relabeling, and custom extension through File SD/HTTP SD are the key elements.
In the next post, we will analyze the Prometheus alerting pipeline, covering the Rule Manager evaluation mechanism, Alert state machine, and Alertmanager routing and notification delivery.