LLM Serving: Speculative Decoding Production Benchmark 2026

Benchmark Purpose and Scope
Test Environment
Benchmark Results: Llama 3.1 70B Instruct
Benchmark Results: Concurrency Impact
- Performance by Concurrent Request Count (EAGLE-3, Llama 3.1 70B)
Benchmark Results: Cost Efficiency Analysis
- Cost Efficiency Comparison (A100 80GB x4, $13.04/hour basis)
- Draft Model Training/Maintenance Costs
num_speculative_tokens Sensitivity Analysis
Reproducible Benchmark Execution
- Full Benchmark Script
- Result Aggregation and Comparison
Production Application Recommendations
- Technique Selection Decision Tree
- Recommended Settings by SLO
Benchmark Limitations
Quiz
References

Benchmark Purpose and Scope

This document presents production benchmark results comparing speculative decoding techniques under the same hardware, the same prompt set, and the same measurement methodology as of early 2026. The goal is to provide real-world serving environment figures, not idealized conditions from academic papers.

Techniques compared:

Vanilla Speculative Decoding (independent draft model) - Leviathan et al., arXiv:2211.17192
Medusa (multiple decoding heads) - Cai et al., arXiv:2401.10774
EAGLE-1/EAGLE-3 (feature-level speculation) - Li et al., arXiv:2401.15077 / arXiv:2503.01840
Prompt Lookup Decoding (N-gram matching, no training required)

Techniques excluded and reasons:

Staged Speculative Decoding: Low practical adoption rate relative to implementation complexity
REST (Retrieval-based): Requires serving architecture changes due to external datastore dependency

Test Environment

Hardware

GPU: 4x NVIDIA A100 80GB SXM (NVLink connected)
CPU: AMD EPYC 7763 64-Core
RAM: 512GB DDR4
OS: Ubuntu 22.04 LTS
CUDA: 12.4
Driver: 550.90.07

Software Stack

vLLM: 0.7.3
PyTorch: 2.5.1
Transformers: 4.47.0
Python: 3.11.10

Target Models

Model	Parameters	Tensor Parallel	Notes
Llama 3.1 70B Instruct	70B	TP=4	Primary benchmark model
Qwen 2.5 72B Instruct	72B	TP=4	Cross-validation
Mistral Large 2 (123B)	123B	TP=4	Large model verification

Prompt Set Composition

The prompt set was designed to reflect actual production traffic distribution, not a single type.

# Prompt set composition (500 prompts)
prompt_distribution = {
    "short_qa": 150,          # 1-2 sentence questions, expected output 50-100 tokens
    "summarization": 100,      # 500-1000 word document summary, expected output 150-300 tokens
    "code_generation": 80,     # function/class generation, expected output 100-500 tokens
    "creative_writing": 50,    # stories/essays, expected output 300-800 tokens
    "translation": 70,         # KR-EN/EN-KR translation, expected output 100-300 tokens
    "structured_output": 50,   # JSON/YAML generation, expected output 50-200 tokens
}

Benchmark Execution Code

import json
import time
import statistics
from dataclasses import dataclass, asdict
from openai import OpenAI

@dataclass
class BenchmarkResult:
    method: str
    model: str
    draft_model: str
    num_spec_tokens: int
    prompt_category: str
    ttft_ms: float
    tpot_ms: float
    e2e_ms: float
    output_tokens: int
    accept_ratio: float
    gpu_memory_gb: float

def run_single_benchmark(
    client: OpenAI,
    model: str,
    prompt: str,
    max_tokens: int,
) -> dict:
    """Run benchmark for a single prompt"""
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=0.0,  # unified greedy decoding
        stream=True,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()

    ttft = (first_token_time - start) if first_token_time else (end - start)
    e2e = end - start
    tpot = (end - first_token_time) / max(token_count - 1, 1) if first_token_time else e2e

    return {
        "ttft_ms": round(ttft * 1000, 2),
        "tpot_ms": round(tpot * 1000, 2),
        "e2e_ms": round(e2e * 1000, 2),
        "output_tokens": token_count,
    }

def run_full_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    warmup_runs: int = 5,
    measure_runs: int = 3,
) -> list[dict]:
    """Run benchmark across the full prompt set"""
    client = OpenAI(base_url=f"{base_url}/v1", api_key="benchmark")

    # Warmup
    for i in range(warmup_runs):
        run_single_benchmark(client, model, prompts[0]["text"], 64)

    results = []
    for run_idx in range(measure_runs):
        for prompt in prompts:
            result = run_single_benchmark(
                client, model, prompt["text"], prompt["max_tokens"]
            )
            result["category"] = prompt["category"]
            result["run"] = run_idx
            results.append(result)

    return results

Benchmark Results: Llama 3.1 70B Instruct

Overall Summary (500 prompts, 3-run average)

Technique	Draft Model	Accept Ratio	TPOT P50 (ms)	TPOT P95 (ms)	E2E Speedup	Additional GPU Memory
Baseline (no spec)	-	-	42.3	58.7	1.00x	0 GB
Vanilla SD	Llama 3.1 8B	0.62	19.8	31.2	1.95x	~16 GB
Medusa-2	5 heads	0.68	17.1	27.8	2.21x	~0.8 GB
EAGLE-1	EAGLE head	0.73	14.9	24.1	2.52x	~1.5 GB
EAGLE-3	EAGLE-3 head	0.81	12.3	19.6	2.89x	~1.8 GB
Prompt Lookup	N-gram (n=3)	0.41	28.5	45.3	1.38x	0 GB

Per-Category Detailed Results (EAGLE-3)

Performance varies significantly by prompt type, so category-level results should be reviewed separately.

Category	Accept Ratio	E2E Speedup	TPOT P50 (ms)	Notes
short_qa	0.84	2.95x	11.8	Short output but high prediction accuracy
summarization	0.83	3.12x	11.5	Highest speedup
code_generation	0.72	2.31x	15.2	High syntax prediction but drops on logic
creative_writing	0.76	2.67x	13.4	Surprisingly high accept ratio
translation	0.85	3.05x	11.9	Translation is highly dependent on input, making prediction easier
structured_output	0.88	3.21x	10.9	Structural patterns of JSON/YAML favor prediction

Accept Ratio Trends by Temperature

# Accept ratio measurements by temperature (EAGLE-3, Llama 3.1 70B)
temperature_results = {
    0.0: {"accept_ratio": 0.81, "speedup": 2.89},
    0.3: {"accept_ratio": 0.76, "speedup": 2.61},
    0.5: {"accept_ratio": 0.71, "speedup": 2.38},
    0.7: {"accept_ratio": 0.64, "speedup": 2.08},
    1.0: {"accept_ratio": 0.55, "speedup": 1.72},
    1.2: {"accept_ratio": 0.47, "speedup": 1.41},
    1.5: {"accept_ratio": 0.38, "speedup": 1.15},  # virtually no effect
}

# Conclusion: disable speculative decoding when temperature > 1.0
TEMP_THRESHOLD = 1.0

Benchmark Results: Concurrency Impact

In actual production, multiple concurrent requests must be handled rather than single requests. We measured how speculative decoding effectiveness changes as concurrency increases.

Performance by Concurrent Request Count (EAGLE-3, Llama 3.1 70B)

Concurrency	Throughput (tokens/s)	E2E Speedup	Accept Ratio	Notes
1	81.3	2.89x	0.81	Optimal conditions
2	156.2	2.71x	0.80	Nearly maintained
4	289.5	2.43x	0.79	Slight decrease
8	498.7	1.98x	0.78	Decline begins
16	721.3	1.52x	0.77	Notable decrease
32	890.4	1.18x	0.76	Minimal effect
64	965.1	0.95x	0.75	Negative impact occurs

# Benchmark execution code by concurrency level
import asyncio
import aiohttp
import time

async def concurrent_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    concurrency: int,
) -> dict:
    """Benchmark by concurrent request count"""
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def single_request(session, prompt):
        async with semaphore:
            start = time.perf_counter()
            payload = {
                "model": model,
                "messages": [{"role": "user", "content": prompt["text"]}],
                "max_tokens": prompt["max_tokens"],
                "temperature": 0.0,
                "stream": False,
            }
            async with session.post(
                f"{base_url}/v1/chat/completions",
                json=payload,
            ) as resp:
                data = await resp.json()
                end = time.perf_counter()
                tokens = data["usage"]["completion_tokens"]
                return {
                    "e2e_ms": (end - start) * 1000,
                    "output_tokens": tokens,
                    "tokens_per_sec": tokens / (end - start),
                }

    async with aiohttp.ClientSession() as session:
        tasks = [single_request(session, p) for p in prompts]
        total_start = time.perf_counter()
        results = await asyncio.gather(*tasks)
        total_elapsed = time.perf_counter() - total_start

    total_tokens = sum(r["output_tokens"] for r in results)
    return {
        "concurrency": concurrency,
        "total_tokens": total_tokens,
        "total_elapsed_sec": round(total_elapsed, 2),
        "throughput_tokens_per_sec": round(total_tokens / total_elapsed, 1),
        "avg_e2e_ms": round(statistics.mean(r["e2e_ms"] for r in results), 1),
        "p95_e2e_ms": round(
            sorted(r["e2e_ms"] for r in results)[int(len(results) * 0.95)], 1
        ),
    }

Key observation: At concurrency 32 and above, the benefits of speculative decoding disappear, and at 64 and above, throughput actually decreases. This is due to KV cache memory contention and the additional computational overhead of the draft model.

Benchmark Results: Cost Efficiency Analysis

Since serving costs are determined by GPU time, "how many more requests can be processed with the same budget" is more important in production than simple latency improvement.

Cost Efficiency Comparison (A100 80GB x4, $13.04/hour basis)

Technique	Hourly Throughput (requests)	Cost per Request ($)	Cost Reduction vs. Baseline
Baseline	3,200	$0.00408	-
Vanilla SD	5,440	$0.00240	41% reduction
EAGLE-3	7,680	$0.00170	58% reduction
Medusa-2	6,400	$0.00204	50% reduction

Note: The above figures are based on concurrency 8. In actual serving, numbers vary depending on traffic patterns, request size distribution, and SLO requirements.

Draft Model Training/Maintenance Costs

Technique	Initial Training Cost	Training Time	Retraining Needed on Target Model Update
Vanilla SD	$0 (uses existing model)	0	Not required
Medusa-2	~$50 (1x A100, 3 hours)	3 hours	Required
EAGLE-1	~$100 (1x A100, 6 hours)	6 hours	Required
EAGLE-3	~$200 (1x A100, 12 hours)	12 hours	Required

num_speculative_tokens Sensitivity Analysis

We measured performance changes based on num_speculative_tokens values using the EAGLE-3 + Llama 3.1 70B combination.

num_speculative_tokens	Accept Ratio	Mean Accepted Length	TPOT P50 (ms)	E2E Speedup	GPU Memory Increase
1	0.91	0.91	35.2	1.20x	+0.3 GB
3	0.86	2.58	16.8	2.52x	+0.8 GB
5	0.81	4.05	12.3	2.89x	+1.8 GB
7	0.76	5.32	11.1	3.02x	+2.9 GB
10	0.69	6.90	10.8	2.95x	+4.5 GB
15	0.58	8.70	11.5	2.71x	+7.2 GB

Conclusion: num_speculative_tokens=5~7 is the optimal range. Beyond 7, the declining accept ratio offsets the throughput benefits, and only GPU memory consumption increases.

Reproducible Benchmark Execution

Full Benchmark Script

#!/bin/bash
# run_benchmark.sh - Full benchmark execution script
set -euo pipefail

MODEL="meta-llama/Llama-3.1-70B-Instruct"
DRAFT_MODELS=(
    "none"                                    # baseline
    "meta-llama/Llama-3.1-8B-Instruct"       # vanilla SD
    "eagle3-llama3.1-70b-instruct"            # EAGLE-3
)
SPEC_TOKENS=(0 5 5)
METHODS=("baseline" "vanilla" "eagle3")

RESULTS_DIR="benchmark_results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"

for i in "${!METHODS[@]}"; do
    method="${METHODS[$i]}"
    draft="${DRAFT_MODELS[$i]}"
    n_tokens="${SPEC_TOKENS[$i]}"

    echo "=== Running benchmark: $method ==="

    # Start server
    if [ "$method" = "baseline" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    elif [ "$method" = "eagle3" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --speculative-method eagle \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    else
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    fi

    SERVER_PID=$!
    sleep 60  # Wait for model loading

    # Run benchmark
    python benchmark.py \
        --prompts eval_prompts.json \
        --model "$MODEL" \
        --warmup 10 \
        --runs 3 \
        --output "$RESULTS_DIR/${method}.json"

    # Stop server
    kill $SERVER_PID
    wait $SERVER_PID 2>/dev/null || true
    sleep 10
done

# Aggregate results
python aggregate_results.py --input-dir "$RESULTS_DIR" --output "$RESULTS_DIR/summary.json"
echo "Results saved to $RESULTS_DIR/summary.json"

Result Aggregation and Comparison

# aggregate_results.py
import json
import sys
from pathlib import Path

def aggregate_results(results_dir: str) -> dict:
    """Read benchmark result files and generate comparison table"""
    results_path = Path(results_dir)
    summary = {}

    for result_file in sorted(results_path.glob("*.json")):
        if result_file.name == "summary.json":
            continue
        method = result_file.stem
        data = json.load(open(result_file))

        # Calculate P50, P95
        tpot_values = [r["tpot_ms"] for r in data]
        e2e_values = [r["e2e_ms"] for r in data]

        summary[method] = {
            "tpot_p50": round(sorted(tpot_values)[len(tpot_values) // 2], 2),
            "tpot_p95": round(sorted(tpot_values)[int(len(tpot_values) * 0.95)], 2),
            "e2e_p50": round(sorted(e2e_values)[len(e2e_values) // 2], 2),
            "e2e_p95": round(sorted(e2e_values)[int(len(e2e_values) * 0.95)], 2),
            "total_requests": len(data),
        }

    # Calculate speedup (relative to baseline)
    if "baseline" in summary:
        baseline_e2e = summary["baseline"]["e2e_p50"]
        for method in summary:
            summary[method]["speedup"] = round(
                baseline_e2e / summary[method]["e2e_p50"], 2
            )

    return summary

if __name__ == "__main__":
    results_dir = sys.argv[1] if len(sys.argv) > 1 else "benchmark_results/latest"
    summary = aggregate_results(results_dir)
    print(json.dumps(summary, indent=2))

Production Application Recommendations

Here are practical application guidelines based on the benchmark results.

Technique Selection Decision Tree

Q: Do you frequently swap the target model?
├─ Yes: Vanilla SD (no retraining needed)
│       or Prompt Lookup (no training needed)
└─ No: Q: Is there spare GPU memory?
       ├─ Yes (>10GB spare): EAGLE-3 (best performance)
       └─ No:  Q: Do you have lightweight training infrastructure?
              ├─ Yes: Medusa-2 (memory efficient)
              └─ No:  Vanilla SD (Llama 3.1 8B as draft)

Recommended Settings by SLO

SLO	Recommended Technique	num_speculative_tokens	Notes
TPOT P95 under 20ms	EAGLE-3	5	Accept ratio over 0.8 expected
TPOT P95 under 35ms	Vanilla SD or Medusa	5	Low operational complexity
Maximum throughput	EAGLE-3, concurrency 8 or less	7	Concurrency limit mandatory
Minimum cost	EAGLE-3	5	58% cost reduction

Benchmark Limitations

The following limitations should be recognized when interpreting these benchmark results.

Hardware dependency: Results are based on A100 SXM. Speedup ratios differ on A10G or L4. Especially with PCIe connections without NVLink, TP communication overhead increases and speedup decreases.
Prompt set bias: Korean/English mixed prompts were used. Services with a high proportion of pure code generation or mathematical reasoning may see different accept ratios.
vLLM version dependency: vLLM's speculative decoding implementation is rapidly improving, so numbers may vary by version.
KV cache impact: max_model_len was fixed at 4096. Increasing to 8192 or above changes concurrency handling capacity due to KV cache memory constraints.
No quantization applied: This benchmark is based on bf16 precision. Applying speculative decoding to GPTQ/AWQ quantized models requires a separate benchmark.

Quiz

Q1. Which technique recorded the highest E2E speedup in this benchmark, and what was the value?

Answer: EAGLE-3 recorded the highest speedup at 2.89x. In the structured output (JSON/YAML) category, it reached up to 3.21x.

Q2. Why is the speculative decoding speedup 0.95x at concurrency 64?

Answer: At high concurrency, the system is already compute-bound, so the memory-bound mitigation benefit of speculation disappears. The additional computation from the draft model and KV cache memory contention actually degrade performance.

Q3. Why does speedup decrease when num_speculative_tokens is increased to 15?

Answer: As the number of speculative tokens increases, the accept ratio drops sharply for later positions. Of 15 tokens, only an average of 8.7 are accepted, and the draft computation for the remaining 6.3 is wasted. This overhead offsets the benefit of additional accepted tokens.

Q4. Why is Prompt Lookup Decoding's accept ratio low at 0.41?

Answer: Prompt Lookup reuses N-grams from the input text, so matching probability is low for tasks with little lexical overlap between input and output (translation, creative writing, etc.). It is only effective for tasks like document summarization that directly reuse input words.

Q5. Why is EAGLE-3's additional GPU memory more than twice that of Medusa-2?

Answer: Medusa consists of lightweight heads made of a few linear layers, while EAGLE-3 includes transformer layers that process the target model's features, resulting in more parameters. However, this complex structure yields an accept ratio of 0.81, higher than Medusa's 0.68.

Q6. Why is the accept ratio high at 0.85 for translation tasks?

Answer: Translation is strongly conditioned on the source sentence's structure and vocabulary, making it easy for the draft model to predict the next token. Especially in Korean-English translation, the correspondence patterns of high-frequency expressions are clear, resulting in high prediction accuracy.

Q7. What should be changed first when reproducing this benchmark in your own environment?

Answer: The prompt set composition. You need to create a prompt set that reflects your service's actual traffic distribution (request types, input/output lengths, temperature distribution) for a meaningful benchmark. Results from a general-purpose prompt set can differ significantly from your environment.

Q8. What prerequisites are needed to achieve the 58% cost reduction?

Answer: Prerequisites include completed EAGLE-3 draft head training, operation at concurrency 8 or below, temperature settings close to 0, and spare GPU memory (+2GB or more). In actual production, concurrency fluctuates constantly with traffic, so dynamic speculative decoding on/off routing is essential.