- Authors
- Name

- Benchmark Purpose and Scope
- Test Environment
- Benchmark Results: Llama 3.1 70B Instruct
- Benchmark Results: Concurrency Impact
- Benchmark Results: Cost Efficiency Analysis
- num_speculative_tokens Sensitivity Analysis
- Reproducible Benchmark Execution
- Production Application Recommendations
- Benchmark Limitations
- Quiz
- References
Benchmark Purpose and Scope
This document presents production benchmark results comparing speculative decoding techniques under the same hardware, the same prompt set, and the same measurement methodology as of early 2026. The goal is to provide real-world serving environment figures, not idealized conditions from academic papers.
Techniques compared:
- Vanilla Speculative Decoding (independent draft model) - Leviathan et al., arXiv:2211.17192
- Medusa (multiple decoding heads) - Cai et al., arXiv:2401.10774
- EAGLE-1/EAGLE-3 (feature-level speculation) - Li et al., arXiv:2401.15077 / arXiv:2503.01840
- Prompt Lookup Decoding (N-gram matching, no training required)
Techniques excluded and reasons:
- Staged Speculative Decoding: Low practical adoption rate relative to implementation complexity
- REST (Retrieval-based): Requires serving architecture changes due to external datastore dependency
Test Environment
Hardware
GPU: 4x NVIDIA A100 80GB SXM (NVLink connected)
CPU: AMD EPYC 7763 64-Core
RAM: 512GB DDR4
OS: Ubuntu 22.04 LTS
CUDA: 12.4
Driver: 550.90.07
Software Stack
vLLM: 0.7.3
PyTorch: 2.5.1
Transformers: 4.47.0
Python: 3.11.10
Target Models
| Model | Parameters | Tensor Parallel | Notes |
|---|---|---|---|
| Llama 3.1 70B Instruct | 70B | TP=4 | Primary benchmark model |
| Qwen 2.5 72B Instruct | 72B | TP=4 | Cross-validation |
| Mistral Large 2 (123B) | 123B | TP=4 | Large model verification |
Prompt Set Composition
The prompt set was designed to reflect actual production traffic distribution, not a single type.
# Prompt set composition (500 prompts)
prompt_distribution = {
"short_qa": 150, # 1-2 sentence questions, expected output 50-100 tokens
"summarization": 100, # 500-1000 word document summary, expected output 150-300 tokens
"code_generation": 80, # function/class generation, expected output 100-500 tokens
"creative_writing": 50, # stories/essays, expected output 300-800 tokens
"translation": 70, # KR-EN/EN-KR translation, expected output 100-300 tokens
"structured_output": 50, # JSON/YAML generation, expected output 50-200 tokens
}
Benchmark Execution Code
import json
import time
import statistics
from dataclasses import dataclass, asdict
from openai import OpenAI
@dataclass
class BenchmarkResult:
method: str
model: str
draft_model: str
num_spec_tokens: int
prompt_category: str
ttft_ms: float
tpot_ms: float
e2e_ms: float
output_tokens: int
accept_ratio: float
gpu_memory_gb: float
def run_single_benchmark(
client: OpenAI,
model: str,
prompt: str,
max_tokens: int,
) -> dict:
"""Run benchmark for a single prompt"""
start = time.perf_counter()
first_token_time = None
token_count = 0
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=0.0, # unified greedy decoding
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
if first_token_time is None:
first_token_time = time.perf_counter()
token_count += 1
end = time.perf_counter()
ttft = (first_token_time - start) if first_token_time else (end - start)
e2e = end - start
tpot = (end - first_token_time) / max(token_count - 1, 1) if first_token_time else e2e
return {
"ttft_ms": round(ttft * 1000, 2),
"tpot_ms": round(tpot * 1000, 2),
"e2e_ms": round(e2e * 1000, 2),
"output_tokens": token_count,
}
def run_full_benchmark(
base_url: str,
model: str,
prompts: list[dict],
warmup_runs: int = 5,
measure_runs: int = 3,
) -> list[dict]:
"""Run benchmark across the full prompt set"""
client = OpenAI(base_url=f"{base_url}/v1", api_key="benchmark")
# Warmup
for i in range(warmup_runs):
run_single_benchmark(client, model, prompts[0]["text"], 64)
results = []
for run_idx in range(measure_runs):
for prompt in prompts:
result = run_single_benchmark(
client, model, prompt["text"], prompt["max_tokens"]
)
result["category"] = prompt["category"]
result["run"] = run_idx
results.append(result)
return results
Benchmark Results: Llama 3.1 70B Instruct
Overall Summary (500 prompts, 3-run average)
| Technique | Draft Model | Accept Ratio | TPOT P50 (ms) | TPOT P95 (ms) | E2E Speedup | Additional GPU Memory |
|---|---|---|---|---|---|---|
| Baseline (no spec) | - | - | 42.3 | 58.7 | 1.00x | 0 GB |
| Vanilla SD | Llama 3.1 8B | 0.62 | 19.8 | 31.2 | 1.95x | ~16 GB |
| Medusa-2 | 5 heads | 0.68 | 17.1 | 27.8 | 2.21x | ~0.8 GB |
| EAGLE-1 | EAGLE head | 0.73 | 14.9 | 24.1 | 2.52x | ~1.5 GB |
| EAGLE-3 | EAGLE-3 head | 0.81 | 12.3 | 19.6 | 2.89x | ~1.8 GB |
| Prompt Lookup | N-gram (n=3) | 0.41 | 28.5 | 45.3 | 1.38x | 0 GB |
Per-Category Detailed Results (EAGLE-3)
Performance varies significantly by prompt type, so category-level results should be reviewed separately.
| Category | Accept Ratio | E2E Speedup | TPOT P50 (ms) | Notes |
|---|---|---|---|---|
| short_qa | 0.84 | 2.95x | 11.8 | Short output but high prediction accuracy |
| summarization | 0.83 | 3.12x | 11.5 | Highest speedup |
| code_generation | 0.72 | 2.31x | 15.2 | High syntax prediction but drops on logic |
| creative_writing | 0.76 | 2.67x | 13.4 | Surprisingly high accept ratio |
| translation | 0.85 | 3.05x | 11.9 | Translation is highly dependent on input, making prediction easier |
| structured_output | 0.88 | 3.21x | 10.9 | Structural patterns of JSON/YAML favor prediction |
Accept Ratio Trends by Temperature
# Accept ratio measurements by temperature (EAGLE-3, Llama 3.1 70B)
temperature_results = {
0.0: {"accept_ratio": 0.81, "speedup": 2.89},
0.3: {"accept_ratio": 0.76, "speedup": 2.61},
0.5: {"accept_ratio": 0.71, "speedup": 2.38},
0.7: {"accept_ratio": 0.64, "speedup": 2.08},
1.0: {"accept_ratio": 0.55, "speedup": 1.72},
1.2: {"accept_ratio": 0.47, "speedup": 1.41},
1.5: {"accept_ratio": 0.38, "speedup": 1.15}, # virtually no effect
}
# Conclusion: disable speculative decoding when temperature > 1.0
TEMP_THRESHOLD = 1.0
Benchmark Results: Concurrency Impact
In actual production, multiple concurrent requests must be handled rather than single requests. We measured how speculative decoding effectiveness changes as concurrency increases.
Performance by Concurrent Request Count (EAGLE-3, Llama 3.1 70B)
| Concurrency | Throughput (tokens/s) | E2E Speedup | Accept Ratio | Notes |
|---|---|---|---|---|
| 1 | 81.3 | 2.89x | 0.81 | Optimal conditions |
| 2 | 156.2 | 2.71x | 0.80 | Nearly maintained |
| 4 | 289.5 | 2.43x | 0.79 | Slight decrease |
| 8 | 498.7 | 1.98x | 0.78 | Decline begins |
| 16 | 721.3 | 1.52x | 0.77 | Notable decrease |
| 32 | 890.4 | 1.18x | 0.76 | Minimal effect |
| 64 | 965.1 | 0.95x | 0.75 | Negative impact occurs |
# Benchmark execution code by concurrency level
import asyncio
import aiohttp
import time
async def concurrent_benchmark(
base_url: str,
model: str,
prompts: list[dict],
concurrency: int,
) -> dict:
"""Benchmark by concurrent request count"""
semaphore = asyncio.Semaphore(concurrency)
results = []
async def single_request(session, prompt):
async with semaphore:
start = time.perf_counter()
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt["text"]}],
"max_tokens": prompt["max_tokens"],
"temperature": 0.0,
"stream": False,
}
async with session.post(
f"{base_url}/v1/chat/completions",
json=payload,
) as resp:
data = await resp.json()
end = time.perf_counter()
tokens = data["usage"]["completion_tokens"]
return {
"e2e_ms": (end - start) * 1000,
"output_tokens": tokens,
"tokens_per_sec": tokens / (end - start),
}
async with aiohttp.ClientSession() as session:
tasks = [single_request(session, p) for p in prompts]
total_start = time.perf_counter()
results = await asyncio.gather(*tasks)
total_elapsed = time.perf_counter() - total_start
total_tokens = sum(r["output_tokens"] for r in results)
return {
"concurrency": concurrency,
"total_tokens": total_tokens,
"total_elapsed_sec": round(total_elapsed, 2),
"throughput_tokens_per_sec": round(total_tokens / total_elapsed, 1),
"avg_e2e_ms": round(statistics.mean(r["e2e_ms"] for r in results), 1),
"p95_e2e_ms": round(
sorted(r["e2e_ms"] for r in results)[int(len(results) * 0.95)], 1
),
}
Key observation: At concurrency 32 and above, the benefits of speculative decoding disappear, and at 64 and above, throughput actually decreases. This is due to KV cache memory contention and the additional computational overhead of the draft model.
Benchmark Results: Cost Efficiency Analysis
Since serving costs are determined by GPU time, "how many more requests can be processed with the same budget" is more important in production than simple latency improvement.
Cost Efficiency Comparison (A100 80GB x4, $13.04/hour basis)
| Technique | Hourly Throughput (requests) | Cost per Request ($) | Cost Reduction vs. Baseline |
|---|---|---|---|
| Baseline | 3,200 | $0.00408 | - |
| Vanilla SD | 5,440 | $0.00240 | 41% reduction |
| EAGLE-3 | 7,680 | $0.00170 | 58% reduction |
| Medusa-2 | 6,400 | $0.00204 | 50% reduction |
Note: The above figures are based on concurrency 8. In actual serving, numbers vary depending on traffic patterns, request size distribution, and SLO requirements.
Draft Model Training/Maintenance Costs
| Technique | Initial Training Cost | Training Time | Retraining Needed on Target Model Update |
|---|---|---|---|
| Vanilla SD | $0 (uses existing model) | 0 | Not required |
| Medusa-2 | ~$50 (1x A100, 3 hours) | 3 hours | Required |
| EAGLE-1 | ~$100 (1x A100, 6 hours) | 6 hours | Required |
| EAGLE-3 | ~$200 (1x A100, 12 hours) | 12 hours | Required |
num_speculative_tokens Sensitivity Analysis
We measured performance changes based on num_speculative_tokens values using the EAGLE-3 + Llama 3.1 70B combination.
| num_speculative_tokens | Accept Ratio | Mean Accepted Length | TPOT P50 (ms) | E2E Speedup | GPU Memory Increase |
|---|---|---|---|---|---|
| 1 | 0.91 | 0.91 | 35.2 | 1.20x | +0.3 GB |
| 3 | 0.86 | 2.58 | 16.8 | 2.52x | +0.8 GB |
| 5 | 0.81 | 4.05 | 12.3 | 2.89x | +1.8 GB |
| 7 | 0.76 | 5.32 | 11.1 | 3.02x | +2.9 GB |
| 10 | 0.69 | 6.90 | 10.8 | 2.95x | +4.5 GB |
| 15 | 0.58 | 8.70 | 11.5 | 2.71x | +7.2 GB |
Conclusion: num_speculative_tokens=5~7 is the optimal range. Beyond 7, the declining accept ratio offsets the throughput benefits, and only GPU memory consumption increases.
Reproducible Benchmark Execution
Full Benchmark Script
#!/bin/bash
# run_benchmark.sh - Full benchmark execution script
set -euo pipefail
MODEL="meta-llama/Llama-3.1-70B-Instruct"
DRAFT_MODELS=(
"none" # baseline
"meta-llama/Llama-3.1-8B-Instruct" # vanilla SD
"eagle3-llama3.1-70b-instruct" # EAGLE-3
)
SPEC_TOKENS=(0 5 5)
METHODS=("baseline" "vanilla" "eagle3")
RESULTS_DIR="benchmark_results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"
for i in "${!METHODS[@]}"; do
method="${METHODS[$i]}"
draft="${DRAFT_MODELS[$i]}"
n_tokens="${SPEC_TOKENS[$i]}"
echo "=== Running benchmark: $method ==="
# Start server
if [ "$method" = "baseline" ]; then
python -m vllm.entrypoints.openai.api_server \
--model "$MODEL" \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92 \
--port 8000 &
elif [ "$method" = "eagle3" ]; then
python -m vllm.entrypoints.openai.api_server \
--model "$MODEL" \
--speculative-model "$draft" \
--speculative-method eagle \
--num-speculative-tokens "$n_tokens" \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92 \
--port 8000 &
else
python -m vllm.entrypoints.openai.api_server \
--model "$MODEL" \
--speculative-model "$draft" \
--num-speculative-tokens "$n_tokens" \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92 \
--port 8000 &
fi
SERVER_PID=$!
sleep 60 # Wait for model loading
# Run benchmark
python benchmark.py \
--prompts eval_prompts.json \
--model "$MODEL" \
--warmup 10 \
--runs 3 \
--output "$RESULTS_DIR/${method}.json"
# Stop server
kill $SERVER_PID
wait $SERVER_PID 2>/dev/null || true
sleep 10
done
# Aggregate results
python aggregate_results.py --input-dir "$RESULTS_DIR" --output "$RESULTS_DIR/summary.json"
echo "Results saved to $RESULTS_DIR/summary.json"
Result Aggregation and Comparison
# aggregate_results.py
import json
import sys
from pathlib import Path
def aggregate_results(results_dir: str) -> dict:
"""Read benchmark result files and generate comparison table"""
results_path = Path(results_dir)
summary = {}
for result_file in sorted(results_path.glob("*.json")):
if result_file.name == "summary.json":
continue
method = result_file.stem
data = json.load(open(result_file))
# Calculate P50, P95
tpot_values = [r["tpot_ms"] for r in data]
e2e_values = [r["e2e_ms"] for r in data]
summary[method] = {
"tpot_p50": round(sorted(tpot_values)[len(tpot_values) // 2], 2),
"tpot_p95": round(sorted(tpot_values)[int(len(tpot_values) * 0.95)], 2),
"e2e_p50": round(sorted(e2e_values)[len(e2e_values) // 2], 2),
"e2e_p95": round(sorted(e2e_values)[int(len(e2e_values) * 0.95)], 2),
"total_requests": len(data),
}
# Calculate speedup (relative to baseline)
if "baseline" in summary:
baseline_e2e = summary["baseline"]["e2e_p50"]
for method in summary:
summary[method]["speedup"] = round(
baseline_e2e / summary[method]["e2e_p50"], 2
)
return summary
if __name__ == "__main__":
results_dir = sys.argv[1] if len(sys.argv) > 1 else "benchmark_results/latest"
summary = aggregate_results(results_dir)
print(json.dumps(summary, indent=2))
Production Application Recommendations
Here are practical application guidelines based on the benchmark results.
Technique Selection Decision Tree
Q: Do you frequently swap the target model?
├─ Yes: Vanilla SD (no retraining needed)
│ or Prompt Lookup (no training needed)
└─ No: Q: Is there spare GPU memory?
├─ Yes (>10GB spare): EAGLE-3 (best performance)
└─ No: Q: Do you have lightweight training infrastructure?
├─ Yes: Medusa-2 (memory efficient)
└─ No: Vanilla SD (Llama 3.1 8B as draft)
Recommended Settings by SLO
| SLO | Recommended Technique | num_speculative_tokens | Notes |
|---|---|---|---|
| TPOT P95 under 20ms | EAGLE-3 | 5 | Accept ratio over 0.8 expected |
| TPOT P95 under 35ms | Vanilla SD or Medusa | 5 | Low operational complexity |
| Maximum throughput | EAGLE-3, concurrency 8 or less | 7 | Concurrency limit mandatory |
| Minimum cost | EAGLE-3 | 5 | 58% cost reduction |
Benchmark Limitations
The following limitations should be recognized when interpreting these benchmark results.
- Hardware dependency: Results are based on A100 SXM. Speedup ratios differ on A10G or L4. Especially with PCIe connections without NVLink, TP communication overhead increases and speedup decreases.
- Prompt set bias: Korean/English mixed prompts were used. Services with a high proportion of pure code generation or mathematical reasoning may see different accept ratios.
- vLLM version dependency: vLLM's speculative decoding implementation is rapidly improving, so numbers may vary by version.
- KV cache impact:
max_model_lenwas fixed at 4096. Increasing to 8192 or above changes concurrency handling capacity due to KV cache memory constraints. - No quantization applied: This benchmark is based on bf16 precision. Applying speculative decoding to GPTQ/AWQ quantized models requires a separate benchmark.
Quiz
Q1. Which technique recorded the highest E2E speedup in this benchmark, and what was the value?
Answer: EAGLE-3 recorded the highest speedup at 2.89x. In the structured output (JSON/YAML) category, it reached up to 3.21x.
Q2. Why is the speculative decoding speedup 0.95x at concurrency 64?
Answer: At high concurrency, the system is already compute-bound, so the memory-bound mitigation benefit of speculation disappears. The additional computation from the draft model and KV cache memory contention actually degrade performance.
Q3. Why does speedup decrease when num_speculative_tokens is increased to 15?
Answer: As the number of speculative tokens increases, the accept ratio drops sharply for later positions. Of 15 tokens, only an average of 8.7 are accepted, and the draft computation for the remaining 6.3 is wasted. This overhead offsets the benefit of additional accepted tokens.
Q4. Why is Prompt Lookup Decoding's accept ratio low at 0.41?
Answer: Prompt Lookup reuses N-grams from the input text, so matching probability is low for tasks with little lexical overlap between input and output (translation, creative writing, etc.). It is only effective for tasks like document summarization that directly reuse input words.
Q5. Why is EAGLE-3's additional GPU memory more than twice that of Medusa-2?
Answer: Medusa consists of lightweight heads made of a few linear layers, while EAGLE-3 includes transformer layers that process the target model's features, resulting in more parameters. However, this complex structure yields an accept ratio of 0.81, higher than Medusa's 0.68.
Q6. Why is the accept ratio high at 0.85 for translation tasks?
Answer: Translation is strongly conditioned on the source sentence's structure and vocabulary, making it easy for the draft model to predict the next token. Especially in Korean-English translation, the correspondence patterns of high-frequency expressions are clear, resulting in high prediction accuracy.
Q7. What should be changed first when reproducing this benchmark in your own environment?
Answer: The prompt set composition. You need to create a prompt set that reflects your service's actual traffic distribution (request types, input/output lengths, temperature distribution) for a meaningful benchmark. Results from a general-purpose prompt set can differ significantly from your environment.
Q8. What prerequisites are needed to achieve the 58% cost reduction?
Answer: Prerequisites include completed EAGLE-3 draft head training, operation at concurrency 8 or below, temperature settings close to 0, and spare GPU memory (+2GB or more). In actual production, concurrency fluctuates constantly with traffic, so dynamic speculative decoding on/off routing is essential.
References
- Fast Inference from Transformers via Speculative Decoding (arXiv:2211.17192)
- Medusa: Simple LLM Inference Acceleration Framework (arXiv:2401.10774)
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (arXiv:2401.15077)
- EAGLE-3: Scaling up Inference Acceleration (arXiv:2503.01840)
- vLLM Speculative Decoding Documentation
- Speculators v0.3.0 - Training Support for vLLM
- vLLM Releases - GitHub