Skip to content
Published on

LLM Serving: Speculative Decoding Production Benchmark 2026

Authors
  • Name
    Twitter
LLM Serving: Speculative Decoding Production Benchmark 2026

Benchmark Purpose and Scope

This document presents production benchmark results comparing speculative decoding techniques under the same hardware, the same prompt set, and the same measurement methodology as of early 2026. The goal is to provide real-world serving environment figures, not idealized conditions from academic papers.

Techniques compared:

  • Vanilla Speculative Decoding (independent draft model) - Leviathan et al., arXiv:2211.17192
  • Medusa (multiple decoding heads) - Cai et al., arXiv:2401.10774
  • EAGLE-1/EAGLE-3 (feature-level speculation) - Li et al., arXiv:2401.15077 / arXiv:2503.01840
  • Prompt Lookup Decoding (N-gram matching, no training required)

Techniques excluded and reasons:

  • Staged Speculative Decoding: Low practical adoption rate relative to implementation complexity
  • REST (Retrieval-based): Requires serving architecture changes due to external datastore dependency

Test Environment

Hardware

GPU: 4x NVIDIA A100 80GB SXM (NVLink connected)
CPU: AMD EPYC 7763 64-Core
RAM: 512GB DDR4
OS: Ubuntu 22.04 LTS
CUDA: 12.4
Driver: 550.90.07

Software Stack

vLLM: 0.7.3
PyTorch: 2.5.1
Transformers: 4.47.0
Python: 3.11.10

Target Models

ModelParametersTensor ParallelNotes
Llama 3.1 70B Instruct70BTP=4Primary benchmark model
Qwen 2.5 72B Instruct72BTP=4Cross-validation
Mistral Large 2 (123B)123BTP=4Large model verification

Prompt Set Composition

The prompt set was designed to reflect actual production traffic distribution, not a single type.

# Prompt set composition (500 prompts)
prompt_distribution = {
    "short_qa": 150,          # 1-2 sentence questions, expected output 50-100 tokens
    "summarization": 100,      # 500-1000 word document summary, expected output 150-300 tokens
    "code_generation": 80,     # function/class generation, expected output 100-500 tokens
    "creative_writing": 50,    # stories/essays, expected output 300-800 tokens
    "translation": 70,         # KR-EN/EN-KR translation, expected output 100-300 tokens
    "structured_output": 50,   # JSON/YAML generation, expected output 50-200 tokens
}

Benchmark Execution Code

import json
import time
import statistics
from dataclasses import dataclass, asdict
from openai import OpenAI

@dataclass
class BenchmarkResult:
    method: str
    model: str
    draft_model: str
    num_spec_tokens: int
    prompt_category: str
    ttft_ms: float
    tpot_ms: float
    e2e_ms: float
    output_tokens: int
    accept_ratio: float
    gpu_memory_gb: float

def run_single_benchmark(
    client: OpenAI,
    model: str,
    prompt: str,
    max_tokens: int,
) -> dict:
    """Run benchmark for a single prompt"""
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=0.0,  # unified greedy decoding
        stream=True,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()

    ttft = (first_token_time - start) if first_token_time else (end - start)
    e2e = end - start
    tpot = (end - first_token_time) / max(token_count - 1, 1) if first_token_time else e2e

    return {
        "ttft_ms": round(ttft * 1000, 2),
        "tpot_ms": round(tpot * 1000, 2),
        "e2e_ms": round(e2e * 1000, 2),
        "output_tokens": token_count,
    }

def run_full_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    warmup_runs: int = 5,
    measure_runs: int = 3,
) -> list[dict]:
    """Run benchmark across the full prompt set"""
    client = OpenAI(base_url=f"{base_url}/v1", api_key="benchmark")

    # Warmup
    for i in range(warmup_runs):
        run_single_benchmark(client, model, prompts[0]["text"], 64)

    results = []
    for run_idx in range(measure_runs):
        for prompt in prompts:
            result = run_single_benchmark(
                client, model, prompt["text"], prompt["max_tokens"]
            )
            result["category"] = prompt["category"]
            result["run"] = run_idx
            results.append(result)

    return results

Benchmark Results: Llama 3.1 70B Instruct

Overall Summary (500 prompts, 3-run average)

TechniqueDraft ModelAccept RatioTPOT P50 (ms)TPOT P95 (ms)E2E SpeedupAdditional GPU Memory
Baseline (no spec)--42.358.71.00x0 GB
Vanilla SDLlama 3.1 8B0.6219.831.21.95x~16 GB
Medusa-25 heads0.6817.127.82.21x~0.8 GB
EAGLE-1EAGLE head0.7314.924.12.52x~1.5 GB
EAGLE-3EAGLE-3 head0.8112.319.62.89x~1.8 GB
Prompt LookupN-gram (n=3)0.4128.545.31.38x0 GB

Per-Category Detailed Results (EAGLE-3)

Performance varies significantly by prompt type, so category-level results should be reviewed separately.

CategoryAccept RatioE2E SpeedupTPOT P50 (ms)Notes
short_qa0.842.95x11.8Short output but high prediction accuracy
summarization0.833.12x11.5Highest speedup
code_generation0.722.31x15.2High syntax prediction but drops on logic
creative_writing0.762.67x13.4Surprisingly high accept ratio
translation0.853.05x11.9Translation is highly dependent on input, making prediction easier
structured_output0.883.21x10.9Structural patterns of JSON/YAML favor prediction
# Accept ratio measurements by temperature (EAGLE-3, Llama 3.1 70B)
temperature_results = {
    0.0: {"accept_ratio": 0.81, "speedup": 2.89},
    0.3: {"accept_ratio": 0.76, "speedup": 2.61},
    0.5: {"accept_ratio": 0.71, "speedup": 2.38},
    0.7: {"accept_ratio": 0.64, "speedup": 2.08},
    1.0: {"accept_ratio": 0.55, "speedup": 1.72},
    1.2: {"accept_ratio": 0.47, "speedup": 1.41},
    1.5: {"accept_ratio": 0.38, "speedup": 1.15},  # virtually no effect
}

# Conclusion: disable speculative decoding when temperature > 1.0
TEMP_THRESHOLD = 1.0

Benchmark Results: Concurrency Impact

In actual production, multiple concurrent requests must be handled rather than single requests. We measured how speculative decoding effectiveness changes as concurrency increases.

Performance by Concurrent Request Count (EAGLE-3, Llama 3.1 70B)

ConcurrencyThroughput (tokens/s)E2E SpeedupAccept RatioNotes
181.32.89x0.81Optimal conditions
2156.22.71x0.80Nearly maintained
4289.52.43x0.79Slight decrease
8498.71.98x0.78Decline begins
16721.31.52x0.77Notable decrease
32890.41.18x0.76Minimal effect
64965.10.95x0.75Negative impact occurs
# Benchmark execution code by concurrency level
import asyncio
import aiohttp
import time

async def concurrent_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    concurrency: int,
) -> dict:
    """Benchmark by concurrent request count"""
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def single_request(session, prompt):
        async with semaphore:
            start = time.perf_counter()
            payload = {
                "model": model,
                "messages": [{"role": "user", "content": prompt["text"]}],
                "max_tokens": prompt["max_tokens"],
                "temperature": 0.0,
                "stream": False,
            }
            async with session.post(
                f"{base_url}/v1/chat/completions",
                json=payload,
            ) as resp:
                data = await resp.json()
                end = time.perf_counter()
                tokens = data["usage"]["completion_tokens"]
                return {
                    "e2e_ms": (end - start) * 1000,
                    "output_tokens": tokens,
                    "tokens_per_sec": tokens / (end - start),
                }

    async with aiohttp.ClientSession() as session:
        tasks = [single_request(session, p) for p in prompts]
        total_start = time.perf_counter()
        results = await asyncio.gather(*tasks)
        total_elapsed = time.perf_counter() - total_start

    total_tokens = sum(r["output_tokens"] for r in results)
    return {
        "concurrency": concurrency,
        "total_tokens": total_tokens,
        "total_elapsed_sec": round(total_elapsed, 2),
        "throughput_tokens_per_sec": round(total_tokens / total_elapsed, 1),
        "avg_e2e_ms": round(statistics.mean(r["e2e_ms"] for r in results), 1),
        "p95_e2e_ms": round(
            sorted(r["e2e_ms"] for r in results)[int(len(results) * 0.95)], 1
        ),
    }

Key observation: At concurrency 32 and above, the benefits of speculative decoding disappear, and at 64 and above, throughput actually decreases. This is due to KV cache memory contention and the additional computational overhead of the draft model.

Benchmark Results: Cost Efficiency Analysis

Since serving costs are determined by GPU time, "how many more requests can be processed with the same budget" is more important in production than simple latency improvement.

Cost Efficiency Comparison (A100 80GB x4, $13.04/hour basis)

TechniqueHourly Throughput (requests)Cost per Request ($)Cost Reduction vs. Baseline
Baseline3,200$0.00408-
Vanilla SD5,440$0.0024041% reduction
EAGLE-37,680$0.0017058% reduction
Medusa-26,400$0.0020450% reduction

Note: The above figures are based on concurrency 8. In actual serving, numbers vary depending on traffic patterns, request size distribution, and SLO requirements.

Draft Model Training/Maintenance Costs

TechniqueInitial Training CostTraining TimeRetraining Needed on Target Model Update
Vanilla SD$0 (uses existing model)0Not required
Medusa-2~$50 (1x A100, 3 hours)3 hoursRequired
EAGLE-1~$100 (1x A100, 6 hours)6 hoursRequired
EAGLE-3~$200 (1x A100, 12 hours)12 hoursRequired

num_speculative_tokens Sensitivity Analysis

We measured performance changes based on num_speculative_tokens values using the EAGLE-3 + Llama 3.1 70B combination.

num_speculative_tokensAccept RatioMean Accepted LengthTPOT P50 (ms)E2E SpeedupGPU Memory Increase
10.910.9135.21.20x+0.3 GB
30.862.5816.82.52x+0.8 GB
50.814.0512.32.89x+1.8 GB
70.765.3211.13.02x+2.9 GB
100.696.9010.82.95x+4.5 GB
150.588.7011.52.71x+7.2 GB

Conclusion: num_speculative_tokens=5~7 is the optimal range. Beyond 7, the declining accept ratio offsets the throughput benefits, and only GPU memory consumption increases.

Reproducible Benchmark Execution

Full Benchmark Script

#!/bin/bash
# run_benchmark.sh - Full benchmark execution script
set -euo pipefail

MODEL="meta-llama/Llama-3.1-70B-Instruct"
DRAFT_MODELS=(
    "none"                                    # baseline
    "meta-llama/Llama-3.1-8B-Instruct"       # vanilla SD
    "eagle3-llama3.1-70b-instruct"            # EAGLE-3
)
SPEC_TOKENS=(0 5 5)
METHODS=("baseline" "vanilla" "eagle3")

RESULTS_DIR="benchmark_results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"

for i in "${!METHODS[@]}"; do
    method="${METHODS[$i]}"
    draft="${DRAFT_MODELS[$i]}"
    n_tokens="${SPEC_TOKENS[$i]}"

    echo "=== Running benchmark: $method ==="

    # Start server
    if [ "$method" = "baseline" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    elif [ "$method" = "eagle3" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --speculative-method eagle \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    else
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    fi

    SERVER_PID=$!
    sleep 60  # Wait for model loading

    # Run benchmark
    python benchmark.py \
        --prompts eval_prompts.json \
        --model "$MODEL" \
        --warmup 10 \
        --runs 3 \
        --output "$RESULTS_DIR/${method}.json"

    # Stop server
    kill $SERVER_PID
    wait $SERVER_PID 2>/dev/null || true
    sleep 10
done

# Aggregate results
python aggregate_results.py --input-dir "$RESULTS_DIR" --output "$RESULTS_DIR/summary.json"
echo "Results saved to $RESULTS_DIR/summary.json"

Result Aggregation and Comparison

# aggregate_results.py
import json
import sys
from pathlib import Path

def aggregate_results(results_dir: str) -> dict:
    """Read benchmark result files and generate comparison table"""
    results_path = Path(results_dir)
    summary = {}

    for result_file in sorted(results_path.glob("*.json")):
        if result_file.name == "summary.json":
            continue
        method = result_file.stem
        data = json.load(open(result_file))

        # Calculate P50, P95
        tpot_values = [r["tpot_ms"] for r in data]
        e2e_values = [r["e2e_ms"] for r in data]

        summary[method] = {
            "tpot_p50": round(sorted(tpot_values)[len(tpot_values) // 2], 2),
            "tpot_p95": round(sorted(tpot_values)[int(len(tpot_values) * 0.95)], 2),
            "e2e_p50": round(sorted(e2e_values)[len(e2e_values) // 2], 2),
            "e2e_p95": round(sorted(e2e_values)[int(len(e2e_values) * 0.95)], 2),
            "total_requests": len(data),
        }

    # Calculate speedup (relative to baseline)
    if "baseline" in summary:
        baseline_e2e = summary["baseline"]["e2e_p50"]
        for method in summary:
            summary[method]["speedup"] = round(
                baseline_e2e / summary[method]["e2e_p50"], 2
            )

    return summary

if __name__ == "__main__":
    results_dir = sys.argv[1] if len(sys.argv) > 1 else "benchmark_results/latest"
    summary = aggregate_results(results_dir)
    print(json.dumps(summary, indent=2))

Production Application Recommendations

Here are practical application guidelines based on the benchmark results.

Technique Selection Decision Tree

Q: Do you frequently swap the target model?
├─ Yes: Vanilla SD (no retraining needed)
│       or Prompt Lookup (no training needed)
└─ No: Q: Is there spare GPU memory?
       ├─ Yes (>10GB spare): EAGLE-3 (best performance)
       └─ No:  Q: Do you have lightweight training infrastructure?
              ├─ Yes: Medusa-2 (memory efficient)
              └─ No:  Vanilla SD (Llama 3.1 8B as draft)
SLORecommended Techniquenum_speculative_tokensNotes
TPOT P95 under 20msEAGLE-35Accept ratio over 0.8 expected
TPOT P95 under 35msVanilla SD or Medusa5Low operational complexity
Maximum throughputEAGLE-3, concurrency 8 or less7Concurrency limit mandatory
Minimum costEAGLE-3558% cost reduction

Benchmark Limitations

The following limitations should be recognized when interpreting these benchmark results.

  1. Hardware dependency: Results are based on A100 SXM. Speedup ratios differ on A10G or L4. Especially with PCIe connections without NVLink, TP communication overhead increases and speedup decreases.
  2. Prompt set bias: Korean/English mixed prompts were used. Services with a high proportion of pure code generation or mathematical reasoning may see different accept ratios.
  3. vLLM version dependency: vLLM's speculative decoding implementation is rapidly improving, so numbers may vary by version.
  4. KV cache impact: max_model_len was fixed at 4096. Increasing to 8192 or above changes concurrency handling capacity due to KV cache memory constraints.
  5. No quantization applied: This benchmark is based on bf16 precision. Applying speculative decoding to GPTQ/AWQ quantized models requires a separate benchmark.

Quiz

Q1. Which technique recorded the highest E2E speedup in this benchmark, and what was the value?

Answer: EAGLE-3 recorded the highest speedup at 2.89x. In the structured output (JSON/YAML) category, it reached up to 3.21x.

Q2. Why is the speculative decoding speedup 0.95x at concurrency 64? Answer: At high concurrency, the system is already compute-bound, so the memory-bound mitigation benefit of speculation disappears. The additional computation from the draft model and KV cache memory contention actually degrade performance.

Q3. Why does speedup decrease when num_speculative_tokens is increased to 15? Answer: As the number of speculative tokens increases, the accept ratio drops sharply for later positions. Of 15 tokens, only an average of 8.7 are accepted, and the draft computation for the remaining 6.3 is wasted. This overhead offsets the benefit of additional accepted tokens.

Q4. Why is Prompt Lookup Decoding's accept ratio low at 0.41? Answer: Prompt Lookup reuses N-grams from the input text, so matching probability is low for tasks with little lexical overlap between input and output (translation, creative writing, etc.). It is only effective for tasks like document summarization that directly reuse input words.

Q5. Why is EAGLE-3's additional GPU memory more than twice that of Medusa-2? Answer: Medusa consists of lightweight heads made of a few linear layers, while EAGLE-3 includes transformer layers that process the target model's features, resulting in more parameters. However, this complex structure yields an accept ratio of 0.81, higher than Medusa's 0.68.

Q6. Why is the accept ratio high at 0.85 for translation tasks? Answer: Translation is strongly conditioned on the source sentence's structure and vocabulary, making it easy for the draft model to predict the next token. Especially in Korean-English translation, the correspondence patterns of high-frequency expressions are clear, resulting in high prediction accuracy.

Q7. What should be changed first when reproducing this benchmark in your own environment?

Answer: The prompt set composition. You need to create a prompt set that reflects your service's actual traffic distribution (request types, input/output lengths, temperature distribution) for a meaningful benchmark. Results from a general-purpose prompt set can differ significantly from your environment.

Q8. What prerequisites are needed to achieve the 58% cost reduction? Answer: Prerequisites include completed EAGLE-3 draft head training, operation at concurrency 8 or below, temperature settings close to 0, and spare GPU memory (+2GB or more). In actual production, concurrency fluctuates constantly with traffic, so dynamic speculative decoding on/off routing is essential.

References