[Prometheus] PromQL Engine Internals: From Parser to Execution Engine

1. Overview
2. Complete Query Processing Pipeline
3. Lexer
4. Parser
5. AST (Abstract Syntax Tree)
6. Query Evaluation Engine
7. Function Evaluation
8. Vector Matching (Binary Operations)
- 8.1 One-to-One Matching
- 8.2 Many-to-One/One-to-Many Matching
9. Subqueries
- 9.1 Subquery Processing
- 9.2 Subquery Optimization
10. Memory Management
11. Native Histogram Processing
- 11.1 Native Histogram Aggregation
12. Performance Optimization Techniques
13. Error Handling and Limitations
- 13.1 Common Error Conditions
- 13.2 Known Limitations
14. Summary

1. Overview

PromQL (Prometheus Query Language) is the core query language for Prometheus, a functional expression language for selecting and aggregating time series data. This post analyzes the PromQL engine internals at the source code level.

The PromQL engine code resides in the promql/ package, consisting of the parser, AST (Abstract Syntax Tree), and evaluation engine.

2. Complete Query Processing Pipeline

PromQL String
    |
    v
+----------+
| Lexer    |  (tokenization)
+----------+
    |
    v
+----------+
| Parser   |  (syntax analysis)
+----------+
    |
    v
+----------+
| AST      |  (abstract syntax tree)
+----------+
    |
    v
+----------+
| Analyzer |  (optimization/validation)
+----------+
    |
    v
+----------+
| Engine   |  (evaluation/execution)
+----------+
    |
    v
Result (Matrix/Vector/Scalar)

3. Lexer

3.1 Tokenization Process

The lexer converts a PromQL string into a token stream:

Input: rate(http_requests_total[5m]) > 10

Token stream:
  IDENTIFIER("rate")
  LEFT_PAREN
  IDENTIFIER("http_requests_total")
  LEFT_BRACKET
  DURATION("5m")
  RIGHT_BRACKET
  RIGHT_PAREN
  GTR
  NUMBER(10)

3.2 Token Types

Major token types recognized by the PromQL lexer:

Literals:
  NUMBER      - integer/float (e.g., 42, 3.14)
  STRING      - string literal (e.g., "hello")
  DURATION    - time range (e.g., 5m, 1h, 30s)

Identifiers:
  IDENTIFIER  - metric names, label names
  METRIC_IDENTIFIER - metric identifier

Operators:
  ADD, SUB, MUL, DIV, MOD, POW
  EQL, NEQ, LSS, GTR, LTE, GTE
  AND, OR, UNLESS

Keywords:
  BY, WITHOUT, ON, IGNORING
  GROUP_LEFT, GROUP_RIGHT
  OFFSET, BOOL
  SUM, AVG, MIN, MAX, COUNT, ...

3.3 Lexer Implementation

The lexer is implemented as a manual state machine in Go. Key characteristics:

1-char lookahead: Examines current and next character to determine token boundaries
Duration parsing: Numbers followed by time units (ms, s, m, h, d, w, y) are recognized as DURATION tokens
String escaping: Supports single-quote, double-quote, and backtick string literals
No comments: PromQL has no comment syntax

4. Parser

4.1 Parser Structure

The PromQL parser uses the Pratt parser (Top-Down Operator Precedence) approach:

Parser flow:
1. Read next token from stream
2. Call prefix parsing function (literals, unary ops, parentheses, etc.)
3. If higher-precedence infix operator exists, perform infix parsing
4. Return AST node

4.2 Operator Precedence

Precedence (lowest first):
1. OR
2. AND, UNLESS
3. ==, !=, <, >, <=, >= (comparison)
4. +, - (addition/subtraction)
5. *, /, % (multiplication/division)
6. ^ (power, right-associative)
7. Unary +, -

4.3 Parsing Rules

Key parsing rules:

Vector Selector Parsing:

When curly braces follow a metric name, parse label matchers
Label matching operators: =, !=, =~, !~

Function Call Parsing:

IDENTIFIER + LEFT_PAREN -> function call
Validate name against registered function list
Verify argument count and types

Range Vector Parsing:

Vector selector + LEFT_BRACKET + DURATION + RIGHT_BRACKET
Example: http_requests_total[5m]

5. AST (Abstract Syntax Tree)

5.1 Node Types

The AST consists of the following node types:

Expr (expression interface)
  |
  +-- NumberLiteral        (numeric literal)
  +-- StringLiteral        (string literal)
  +-- VectorSelector       (instant vector selector)
  +-- MatrixSelector       (range vector selector)
  +-- AggregateExpr        (aggregation expression)
  +-- BinaryExpr           (binary operation)
  +-- UnaryExpr            (unary operation)
  +-- Call                 (function call)
  +-- ParenExpr            (parenthesized expression)
  +-- SubqueryExpr         (subquery expression)
  +-- StepInvariantExpr    (step-invariant expression)

5.2 AST Example

Query: sum(rate(http_requests_total[5m])) by (job)

AggregateExpr (sum, by=[job])
  |
  +-- Call (rate)
        |
        +-- MatrixSelector
              |
              +-- VectorSelector (http_requests_total)
              +-- Range: 5m

5.3 VectorSelector Structure

VectorSelector:
  Name: "http_requests_total"
  LabelMatchers:
    - __name__ = "http_requests_total"
    - job = "prometheus"       (user-specified)
    - instance =~ ".*:9090"    (regex matching)
  Offset: 0s                   (offset modifier)
  Timestamp: nil               (@ modifier)
  OriginalOffset: 0s
  StartOrEnd: 0                (start() or end())

6. Query Evaluation Engine

6.1 Engine Configuration

Engine:
  +-- queryable      (TSDB storage interface)
  +-- maxSamples     (maximum sample count limit)
  +-- timeout        (query timeout)
  +-- activeQueries  (active query counter)
  +-- maxConcurrency (maximum concurrent queries)

6.2 Instant Query vs Range Query

Instant Query: Evaluation at a single point in time

Request: GET /api/v1/query?query=up&time=1609459200

Processing:
1. Generate AST
2. Evaluate once at time=1609459200
3. Result: Vector (single value per series)

Range Query: Step-by-step evaluation over a time range

Request: GET /api/v1/query_range?query=up&start=1609455600&end=1609459200&step=60

Processing:
1. Generate AST
2. Evaluate repeatedly from start to end at step (60s) intervals
3. Each step evaluates like an instant query
4. Result: Matrix (time-indexed value array per series)

6.3 Step Evaluation Mechanism

Each step in a Range Query is an independent instant evaluation:

query: rate(http_requests_total[5m])
start: T0, end: T0+300s, step: 60s

Step evaluations:
  T0:      rate(samples from T0-5m to T0)
  T0+60:   rate(samples from T0+60-5m to T0+60)
  T0+120:  rate(samples from T0+120-5m to T0+120)
  T0+180:  rate(samples from T0+180-5m to T0+180)
  T0+240:  rate(samples from T0+240-5m to T0+240)
  T0+300:  rate(samples from T0+300-5m to T0+300)

6.4 Lookback Delta

The Lookback Delta (default 5 minutes) is the time window within which an instant vector selector searches for samples:

Evaluation time: T
lookback delta: 5m

Sample search range: (T - 5m, T]

Behavior:
1. Find most recent sample within 5 minutes before T
2. Include in result if not a stale marker
3. Exclude series if no sample within 5 minutes

Important: An exact sample at time T is not required
           Uses the most recent sample within lookback delta

7. Function Evaluation

7.1 Function Registry

All PromQL functions are registered in a central registry:

Function registration:
  Name:       "rate"
  ArgTypes:   [ValueTypeMatrix]     (argument types)
  ReturnType: ValueTypeVector       (return type)
  Call:       funcRate              (actual implementation)
  Variadic:   0                     (variadic argument count)

7.2 rate() Implementation

Internal operation of rate():

rate(counter[5m]) evaluation:

1. Extract samples from 5-minute range: [s1, s2, s3, ..., sN]
2. Counter reset detection and correction:
   - If value decreases, add previous value to cumulative total
3. Calculate corrected total increase:
   resultValue = (last - first) + counterCorrection
4. Extrapolation:
   - Calculate time ratio beyond range start/end
   - durationToStart = first.time - rangeStart
   - durationToEnd = rangeEnd - last.time
   - Apply extrapolation ratio
5. Per-second rate:
   result = extrapolatedIncrease / rangeDurationSeconds

7.3 histogram_quantile() Implementation

histogram_quantile(phi, buckets) evaluation:

1. Sort buckets by le label
2. Collect cumulative count for each bucket
3. Calculate target_count = phi (e.g., 0.95) * total_count
4. Find bucket containing target_count
5. Linear interpolation between bucket boundaries:
   bucketStart + (bucketEnd - bucketStart) *
   (target - countBelow) / (countInBucket)

Notes:
- le="+Inf" bucket must exist
- Returns NaN if buckets are empty
- Inaccurate if bucket boundaries do not match actual distribution

7.4 Aggregation Function Evaluation

sum by (job) (metric) evaluation:

1. Collect all series of metric
2. Group by job label value
3. Sum values within each group
4. Result series: only job label retained

Implementation:
- Group management via hashmap
- Label set hash used as key
- Exact label comparison on hash collision

8. Vector Matching (Binary Operations)

8.1 One-to-One Matching

metric_a + metric_b

Matching algorithm:
1. Index series from both vectors by label set
2. Find pairs with identical matching label sets
3. Perform operation (addition here)
4. Result series uses left-side label set

Performance:
- Hash join approach with O(n+m) time complexity
- Smaller side built as hash table

8.2 Many-to-One/One-to-Many Matching

metric_a * on(job) group_left metric_b

Processing:
1. Determine matching key with on(job)
2. Index metric_b (one side) by job in hashmap
3. For each series in metric_a (many side):
   a. Look up matching item in metric_b by job label
   b. Perform operation
   c. Retain all labels from metric_a (group_left)

9. Subqueries

9.1 Subquery Processing

max_over_time(rate(http_requests_total[5m])[30m:1m])

Evaluation:
1. Outer function (max_over_time) needs range vector
2. Subquery part: rate(http_requests_total[5m])[30m:1m]
3. Inner evaluation:
   - From 30 minutes before current time (T) to T
   - Evaluate rate(http_requests_total[5m]) at 1-minute intervals
   - Generate 30 instant vector results
4. Convert to range vector
5. max_over_time calculates maximum for each series

9.2 Subquery Optimization

Key optimizations for subquery evaluation:
1. Reuse inner query range vectors (same series)
2. Step alignment: align subquery step with outer step
3. Use global evaluation_interval when resolution not specified

10. Memory Management

10.1 Sample Counting

--query.max-samples (default 50,000,000)

Counting approach:
- Accumulate samples loaded/generated at each step
- For range vectors, count all samples within range
- Immediately abort query when limit exceeded

10.2 Query Concurrency Control

--query.max-concurrency (default 20)

Implementation:
- Semaphore limits concurrent queries
- New queries wait for semaphore acquisition
- Return error if acquisition fails within timeout

10.3 Point Pool

The PromQL engine uses point pools to minimize memory allocation:

Point pool operation:
1. Allocate point slices from pool at query start
2. Reuse between step evaluations
3. Return to pool at query completion
4. Uses sync.Pool to reduce GC pressure

11. Native Histogram Processing

11.1 Native Histogram Aggregation

Native Histograms have different processing paths than float samples:
1. Align histogram schemas (resolution)
2. Merge at bucket level
3. Auto-convert schemas when needed during aggregation

sum(native_histogram_metric) evaluation:
1. Collect histograms from each series
2. Unify to lower resolution if schemas differ
3. Sum per-bucket counts
4. Sum sum and count fields

12. Performance Optimization Techniques

12.1 Series Set Merging

When same series exist across multiple blocks:
1. Search series set in each block
2. Consolidate via merge sort
3. Filter at chunk level by time range
4. Deduplicate

12.2 StepInvariantExpr Optimization

Detect expressions returning identical results regardless of step:
- Numeric literals
- Scalar functions without offset
- Certain aggregation results

When detected:
- Evaluate once and copy result to all steps
- Eliminate unnecessary repeated evaluation

12.3 Query Preprocessing

Optimization passes:
1. Constant folding: 2 + 3 -> 5
2. Label matching optimization: generate index hints
3. Remove unnecessary label operations
4. Early return for empty results

13. Error Handling and Limitations

13.1 Common Error Conditions

1. "query processing would load too many samples"
   - max-samples limit exceeded
   - Fix: reduce query range or increase limit

2. "query timed out in expression evaluation"
   - query.timeout exceeded
   - Fix: optimize query or increase timeout

3. "found duplicate series for the match group"
   - Many-to-one relationship without group_left/right
   - Fix: add group_left or group_right

4. "many-to-many matching not allowed"
   - Many-on-both-sides matching not supported
   - Fix: aggregate one side to reduce to one

13.2 Known Limitations

1. Limited join capability: no SQL-style joins
2. Limited string operations: no string comparison/manipulation
3. No sort guarantee: series order in results is non-deterministic
4. Subquery nesting limits: excessive nesting risks memory explosion

14. Summary

The PromQL engine efficiently processes a functional query language specialized for time series data. Key aspects include clean syntax analysis via Pratt parser, step-based range query evaluation, flexible sample lookup via lookback delta, and memory efficiency through point pools and StepInvariantExpr optimization.

In the next post, we will analyze the Prometheus service discovery mechanism, covering the kubernetes_sd watch mechanism, relabeling internals, and target lifecycle.