Skip to content

Split View: LLM 서빙: Speculative Decoding 프로덕션 벤치마크 2026

|

LLM 서빙: Speculative Decoding 프로덕션 벤치마크 2026

LLM 서빙: Speculative Decoding 프로덕션 벤치마크 2026

벤치마크 목적과 범위

이 문서는 2026년 초 기준으로 speculative decoding 기법들을 동일한 하드웨어, 동일한 프롬프트셋, 동일한 측정 방법론으로 비교한 프로덕션 벤치마크 결과를 정리한다. 학술 논문의 이상적 조건이 아닌, 실제 서빙 환경에서의 수치를 제공하는 것이 목표다.

비교 대상 기법:

  • Vanilla Speculative Decoding (독립 draft 모델) - Leviathan et al., arXiv:2211.17192
  • Medusa (다중 디코딩 헤드) - Cai et al., arXiv:2401.10774
  • EAGLE-1/EAGLE-3 (feature-level speculation) - Li et al., arXiv:2401.15077 / arXiv:2503.01840
  • Prompt Lookup Decoding (N-gram 매칭, 학습 불필요)

비교 대상에서 제외한 기법과 이유:

  • Staged Speculative Decoding: 구현 복잡도 대비 실무 채택률 낮음
  • REST (Retrieval-based): 외부 데이터스토어 의존으로 서빙 아키텍처 변경 필요

테스트 환경

하드웨어

GPU: 4x NVIDIA A100 80GB SXM (NVLink 연결)
CPU: AMD EPYC 7763 64-Core
RAM: 512GB DDR4
OS: Ubuntu 22.04 LTS
CUDA: 12.4
Driver: 550.90.07

소프트웨어 스택

vLLM: 0.7.3
PyTorch: 2.5.1
Transformers: 4.47.0
Python: 3.11.10

Target 모델

모델파라미터Tensor Parallel비고
Llama 3.1 70B Instruct70BTP=4주력 벤치마크 모델
Qwen 2.5 72B Instruct72BTP=4교차 검증용
Mistral Large 2 (123B)123BTP=4대형 모델 확인용

프롬프트셋 구성

단일 유형이 아닌 실제 프로덕션 트래픽 분포를 반영한 프롬프트셋을 구성했다.

# 프롬프트셋 구성 (500개)
prompt_distribution = {
    "short_qa": 150,          # 1-2문장 질문, 예상 출력 50-100 tokens
    "summarization": 100,      # 500-1000 단어 문서 요약, 예상 출력 150-300 tokens
    "code_generation": 80,     # 함수/클래스 생성, 예상 출력 100-500 tokens
    "creative_writing": 50,    # 스토리/에세이, 예상 출력 300-800 tokens
    "translation": 70,         # 한영/영한 번역, 예상 출력 100-300 tokens
    "structured_output": 50,   # JSON/YAML 생성, 예상 출력 50-200 tokens
}

벤치마크 실행 코드

import json
import time
import statistics
from dataclasses import dataclass, asdict
from openai import OpenAI

@dataclass
class BenchmarkResult:
    method: str
    model: str
    draft_model: str
    num_spec_tokens: int
    prompt_category: str
    ttft_ms: float
    tpot_ms: float
    e2e_ms: float
    output_tokens: int
    accept_ratio: float
    gpu_memory_gb: float

def run_single_benchmark(
    client: OpenAI,
    model: str,
    prompt: str,
    max_tokens: int,
) -> dict:
    """단일 프롬프트에 대한 벤치마크 실행"""
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=0.0,  # greedy decoding으로 통일
        stream=True,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()

    ttft = (first_token_time - start) if first_token_time else (end - start)
    e2e = end - start
    tpot = (end - first_token_time) / max(token_count - 1, 1) if first_token_time else e2e

    return {
        "ttft_ms": round(ttft * 1000, 2),
        "tpot_ms": round(tpot * 1000, 2),
        "e2e_ms": round(e2e * 1000, 2),
        "output_tokens": token_count,
    }

def run_full_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    warmup_runs: int = 5,
    measure_runs: int = 3,
) -> list[dict]:
    """전체 프롬프트셋에 대한 벤치마크 실행"""
    client = OpenAI(base_url=f"{base_url}/v1", api_key="benchmark")

    # Warmup
    for i in range(warmup_runs):
        run_single_benchmark(client, model, prompts[0]["text"], 64)

    results = []
    for run_idx in range(measure_runs):
        for prompt in prompts:
            result = run_single_benchmark(
                client, model, prompt["text"], prompt["max_tokens"]
            )
            result["category"] = prompt["category"]
            result["run"] = run_idx
            results.append(result)

    return results

벤치마크 결과: Llama 3.1 70B Instruct

전체 요약 (500 프롬프트, 3회 반복 평균)

기법Draft 모델Accept RatioTPOT P50 (ms)TPOT P95 (ms)E2E Speedup추가 GPU 메모리
Baseline (no spec)--42.358.71.00x0 GB
Vanilla SDLlama 3.1 8B0.6219.831.21.95x~16 GB
Medusa-25 heads0.6817.127.82.21x~0.8 GB
EAGLE-1EAGLE head0.7314.924.12.52x~1.5 GB
EAGLE-3EAGLE-3 head0.8112.319.62.89x~1.8 GB
Prompt LookupN-gram (n=3)0.4128.545.31.38x0 GB

카테고리별 상세 결과 (EAGLE-3 기준)

프롬프트 유형에 따른 성능 차이가 크므로, 카테고리별 결과를 별도로 확인해야 한다.

카테고리Accept RatioE2E SpeedupTPOT P50 (ms)특이사항
short_qa0.842.95x11.8짧은 출력이지만 예측 정확도 높음
summarization0.833.12x11.5가장 큰 speedup
code_generation0.722.31x15.2구문 예측은 높으나 로직부에서 하락
creative_writing0.762.67x13.4의외로 높은 accept ratio
translation0.853.05x11.9번역은 입력에 대한 의존도 높아 예측 용이
structured_output0.883.21x10.9JSON/YAML의 구조적 패턴이 예측에 유리

Temperature 변화에 따른 Accept Ratio 추이

# Temperature별 accept ratio 측정 결과 (EAGLE-3, Llama 3.1 70B)
temperature_results = {
    0.0: {"accept_ratio": 0.81, "speedup": 2.89},
    0.3: {"accept_ratio": 0.76, "speedup": 2.61},
    0.5: {"accept_ratio": 0.71, "speedup": 2.38},
    0.7: {"accept_ratio": 0.64, "speedup": 2.08},
    1.0: {"accept_ratio": 0.55, "speedup": 1.72},
    1.2: {"accept_ratio": 0.47, "speedup": 1.41},
    1.5: {"accept_ratio": 0.38, "speedup": 1.15},  # 거의 효과 없음
}

# 결론: temperature > 1.0이면 speculative decoding 비활성화 권장
TEMP_THRESHOLD = 1.0

벤치마크 결과: 동시 요청 수(Concurrency) 영향

실제 프로덕션에서는 단일 요청이 아닌 다수 동시 요청을 처리해야 한다. Concurrency 증가에 따른 speculative decoding의 효과 변화를 측정했다.

동시 요청 수별 성능 (EAGLE-3, Llama 3.1 70B)

ConcurrencyThroughput (tokens/s)E2E SpeedupAccept Ratio비고
181.32.89x0.81최적 조건
2156.22.71x0.80거의 유지
4289.52.43x0.79약간 하락
8498.71.98x0.78감소 시작
16721.31.52x0.77뚜렷한 감소
32890.41.18x0.76효과 미미
64965.10.95x0.75역효과 발생
# 동시 요청별 벤치마크 실행 코드
import asyncio
import aiohttp
import time

async def concurrent_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    concurrency: int,
) -> dict:
    """동시 요청 수별 벤치마크"""
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def single_request(session, prompt):
        async with semaphore:
            start = time.perf_counter()
            payload = {
                "model": model,
                "messages": [{"role": "user", "content": prompt["text"]}],
                "max_tokens": prompt["max_tokens"],
                "temperature": 0.0,
                "stream": False,
            }
            async with session.post(
                f"{base_url}/v1/chat/completions",
                json=payload,
            ) as resp:
                data = await resp.json()
                end = time.perf_counter()
                tokens = data["usage"]["completion_tokens"]
                return {
                    "e2e_ms": (end - start) * 1000,
                    "output_tokens": tokens,
                    "tokens_per_sec": tokens / (end - start),
                }

    async with aiohttp.ClientSession() as session:
        tasks = [single_request(session, p) for p in prompts]
        total_start = time.perf_counter()
        results = await asyncio.gather(*tasks)
        total_elapsed = time.perf_counter() - total_start

    total_tokens = sum(r["output_tokens"] for r in results)
    return {
        "concurrency": concurrency,
        "total_tokens": total_tokens,
        "total_elapsed_sec": round(total_elapsed, 2),
        "throughput_tokens_per_sec": round(total_tokens / total_elapsed, 1),
        "avg_e2e_ms": round(statistics.mean(r["e2e_ms"] for r in results), 1),
        "p95_e2e_ms": round(
            sorted(r["e2e_ms"] for r in results)[int(len(results) * 0.95)], 1
        ),
    }

핵심 관찰: Concurrency 32 이상에서는 speculative decoding의 이점이 사라지며, 64 이상에서는 오히려 throughput이 감소했다. 이는 KV cache 메모리 경쟁과 draft 모델의 추가 연산 오버헤드 때문이다.

벤치마크 결과: 비용 효율 분석

서빙 비용은 GPU 시간으로 결정되므로, 단순 latency 개선보다 "동일 예산으로 얼마나 더 많은 요청을 처리할 수 있는가"가 프로덕션에서는 더 중요하다.

비용 효율 비교 (A100 80GB x4, 시간당 $13.04 기준)

기법시간당 처리량 (requests)요청당 비용 ($)Baseline 대비 비용 절감
Baseline3,200$0.00408-
Vanilla SD5,440$0.0024041% 절감
EAGLE-37,680$0.0017058% 절감
Medusa-26,400$0.0020450% 절감

주의: 위 수치는 concurrency 8 기준이다. 실제 서빙에서는 트래픽 패턴, 요청 크기 분포, SLO 요구사항에 따라 달라진다.

Draft 모델별 학습/유지 비용

기법초기 학습 비용학습 시간Target 모델 업데이트 시 재학습 필요
Vanilla SD$0 (기존 모델 사용)0불필요
Medusa-2~$50 (A100 1장, 3시간)3시간필요
EAGLE-1~$100 (A100 1장, 6시간)6시간필요
EAGLE-3~$200 (A100 1장, 12시간)12시간필요

num_speculative_tokens 민감도 분석

num_speculative_tokens 값에 따른 성능 변화를 EAGLE-3, Llama 3.1 70B 조합으로 측정했다.

num_speculative_tokensAccept RatioMean Accepted LengthTPOT P50 (ms)E2E SpeedupGPU 메모리 증가
10.910.9135.21.20x+0.3 GB
30.862.5816.82.52x+0.8 GB
50.814.0512.32.89x+1.8 GB
70.765.3211.13.02x+2.9 GB
100.696.9010.82.95x+4.5 GB
150.588.7011.52.71x+7.2 GB

결론: num_speculative_tokens=5~7이 최적 구간이다. 7을 넘으면 accept ratio 하락이 throughput 이점을 상쇄하고, GPU 메모리 소비만 증가한다.

재현 가능한 벤치마크 실행 방법

전체 벤치마크 스크립트

#!/bin/bash
# run_benchmark.sh - 전체 벤치마크 실행 스크립트
set -euo pipefail

MODEL="meta-llama/Llama-3.1-70B-Instruct"
DRAFT_MODELS=(
    "none"                                    # baseline
    "meta-llama/Llama-3.1-8B-Instruct"       # vanilla SD
    "eagle3-llama3.1-70b-instruct"            # EAGLE-3
)
SPEC_TOKENS=(0 5 5)
METHODS=("baseline" "vanilla" "eagle3")

RESULTS_DIR="benchmark_results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"

for i in "${!METHODS[@]}"; do
    method="${METHODS[$i]}"
    draft="${DRAFT_MODELS[$i]}"
    n_tokens="${SPEC_TOKENS[$i]}"

    echo "=== Running benchmark: $method ==="

    # 서버 시작
    if [ "$method" = "baseline" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    elif [ "$method" = "eagle3" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --speculative-method eagle \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    else
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    fi

    SERVER_PID=$!
    sleep 60  # 모델 로딩 대기

    # 벤치마크 실행
    python benchmark.py \
        --prompts eval_prompts.json \
        --model "$MODEL" \
        --warmup 10 \
        --runs 3 \
        --output "$RESULTS_DIR/${method}.json"

    # 서버 종료
    kill $SERVER_PID
    wait $SERVER_PID 2>/dev/null || true
    sleep 10
done

# 결과 집계
python aggregate_results.py --input-dir "$RESULTS_DIR" --output "$RESULTS_DIR/summary.json"
echo "Results saved to $RESULTS_DIR/summary.json"

결과 집계 및 비교

# aggregate_results.py
import json
import sys
from pathlib import Path

def aggregate_results(results_dir: str) -> dict:
    """벤치마크 결과 파일들을 읽어 비교 테이블 생성"""
    results_path = Path(results_dir)
    summary = {}

    for result_file in sorted(results_path.glob("*.json")):
        if result_file.name == "summary.json":
            continue
        method = result_file.stem
        data = json.load(open(result_file))

        # P50, P95 계산
        tpot_values = [r["tpot_ms"] for r in data]
        e2e_values = [r["e2e_ms"] for r in data]

        summary[method] = {
            "tpot_p50": round(sorted(tpot_values)[len(tpot_values) // 2], 2),
            "tpot_p95": round(sorted(tpot_values)[int(len(tpot_values) * 0.95)], 2),
            "e2e_p50": round(sorted(e2e_values)[len(e2e_values) // 2], 2),
            "e2e_p95": round(sorted(e2e_values)[int(len(e2e_values) * 0.95)], 2),
            "total_requests": len(data),
        }

    # Speedup 계산 (baseline 대비)
    if "baseline" in summary:
        baseline_e2e = summary["baseline"]["e2e_p50"]
        for method in summary:
            summary[method]["speedup"] = round(
                baseline_e2e / summary[method]["e2e_p50"], 2
            )

    return summary

if __name__ == "__main__":
    results_dir = sys.argv[1] if len(sys.argv) > 1 else "benchmark_results/latest"
    summary = aggregate_results(results_dir)
    print(json.dumps(summary, indent=2))

프로덕션 적용 권고사항

벤치마크 결과를 바탕으로 한 실무 적용 가이드라인이다.

기법 선택 의사결정 트리

Q: Target 모델을 자주 교체하는가?
├─ Yes: Vanilla SD (재학습 불필요)
│       또는 Prompt Lookup (학습 불필요)
└─ No: Q: GPU 메모리 여유가 있는가?
       ├─ Yes (>10GB 여유): EAGLE-3 (최고 성능)
       └─ No:  Q: 경량 학습 인프라가 있는가?
              ├─ Yes: Medusa-2 (메모리 효율적)
              └─ No:  Vanilla SD (Llama 3.1 8B as draft)

SLO별 권장 설정

SLO권장 기법num_speculative_tokens비고
TPOT P95 < 20msEAGLE-35Accept ratio 0.8 이상 기대
TPOT P95 < 35msVanilla SD 또는 Medusa5낮은 운영 복잡도
Throughput 최대화EAGLE-3, concurrency 8 이하7Concurrency 제한 필수
비용 최소화EAGLE-3558% 비용 절감

벤치마크의 한계

이 벤치마크 결과를 해석할 때 다음 한계를 인지해야 한다.

  1. 하드웨어 의존성: A100 SXM 기준 결과이며, A10G나 L4에서는 speedup 비율이 달라진다. 특히 NVLink 없는 PCIe 연결에서는 TP 통신 오버헤드가 커져 speedup이 줄어든다.
  2. 프롬프트셋 편향: 한국어/영어 혼합 프롬프트를 사용했으며, 순수 코드 생성이나 수학 추론 비중이 높은 서비스에서는 accept ratio가 다를 수 있다.
  3. vLLM 버전 의존: vLLM의 speculative decoding 구현은 빠르게 개선되고 있어, 버전에 따라 수치가 달라질 수 있다.
  4. KV cache 영향: max_model_len을 4096으로 고정했으며, 8192 이상으로 늘리면 KV cache 메모리 제약으로 concurrency 처리 능력이 변한다.
  5. 양자화 미적용: 이번 벤치마크는 bf16 precision 기준이다. GPTQ/AWQ 양자화 모델에 speculative decoding을 적용한 결과는 별도 벤치마크가 필요하다.

퀴즈

Q1. 이 벤치마크에서 가장 높은 E2E speedup을 기록한 기법과 수치는? 정답: ||EAGLE-3가 2.89x로 가장 높은 speedup을 기록했다. Structured output (JSON/YAML) 카테고리에서는 3.21x까지 올라갔다.||

Q2. Concurrency 64에서 speculative decoding의 speedup이 0.95x인 이유는? 정답: ||높은 concurrency에서는 이미 compute-bound 상태가 되어 speculation의 memory-bound 완화 이점이 사라지고, draft 모델의 추가 연산과 KV cache 메모리 경쟁이 오히려 성능을 저하시키기 때문이다.||

Q3. num_speculative_tokens를 15로 올리면 speedup이 오히려 줄어드는 이유는? 정답: ||추측 토큰 수가 늘어나면 뒤쪽 position의 accept ratio가 급격히 떨어진다. 15개 중 실제 수용되는 평균 8.7개로, 나머지 6.3개의 draft 연산은 낭비된다. 이 오버헤드가 추가 수용 토큰의 이점을 상쇄한다.||

Q4. Prompt Lookup Decoding의 accept ratio가 0.41로 낮은 이유는? 정답: ||Prompt Lookup은 입력 텍스트의 N-gram을 재활용하므로, 입력과 출력의 어휘적 중복이 적은 태스크(번역, 창작 등)에서는 매칭 확률이 낮다. 문서 요약처럼 입력 단어를 그대로 사용하는 태스크에서만 효과적이다.||

Q5. EAGLE-3의 추가 GPU 메모리가 Medusa-2의 2배 이상인 이유는? 정답: ||Medusa는 linear layer 몇 개로 구성된 경량 head인 반면, EAGLE-3는 target 모델의 feature를 처리하는 transformer layer를 포함하여 파라미터 수가 더 많다. 대신 이 복잡한 구조 덕분에 accept ratio가 0.81로 Medusa(0.68)보다 높다.||

Q6. 번역 태스크에서 accept ratio가 0.85로 높은 이유는? 정답: ||번역은 소스 문장의 구조와 어휘에 강하게 조건화되므로, draft 모델이 다음 토큰을 예측하기 쉽다. 특히 한영 번역에서 고빈도 표현의 대응 패턴이 명확하여 예측 정확도가 높아진다.||

Q7. 이 벤치마크를 자사 환경에서 재현할 때 가장 먼저 변경해야 할 설정은? 정답: ||프롬프트셋 구성이다. 자사 서비스의 실제 트래픽 분포(요청 유형, 입출력 길이, temperature 분포)를 반영한 프롬프트셋을 만들어야 의미 있는 벤치마크가 된다. 범용 프롬프트셋의 결과는 자사 환경과 큰 차이가 날 수 있다.||

Q8. 비용 절감률 58%를 달성하려면 어떤 전제 조건이 필요한가? 정답: ||EAGLE-3 draft head 학습 완료, concurrency 8 이하 운영, temperature 0에 가까운 설정, GPU 메모리 여유(+2GB 이상)가 전제 조건이다. 실제 프로덕션에서는 트래픽 변동으로 concurrency가 수시로 변하므로, 동적 speculative decoding on/off 라우팅이 필수다.||

References

LLM Serving: Speculative Decoding Production Benchmark 2026

LLM Serving: Speculative Decoding Production Benchmark 2026

Benchmark Purpose and Scope

This document presents production benchmark results comparing speculative decoding techniques under the same hardware, the same prompt set, and the same measurement methodology as of early 2026. The goal is to provide real-world serving environment figures, not idealized conditions from academic papers.

Techniques compared:

  • Vanilla Speculative Decoding (independent draft model) - Leviathan et al., arXiv:2211.17192
  • Medusa (multiple decoding heads) - Cai et al., arXiv:2401.10774
  • EAGLE-1/EAGLE-3 (feature-level speculation) - Li et al., arXiv:2401.15077 / arXiv:2503.01840
  • Prompt Lookup Decoding (N-gram matching, no training required)

Techniques excluded and reasons:

  • Staged Speculative Decoding: Low practical adoption rate relative to implementation complexity
  • REST (Retrieval-based): Requires serving architecture changes due to external datastore dependency

Test Environment

Hardware

GPU: 4x NVIDIA A100 80GB SXM (NVLink connected)
CPU: AMD EPYC 7763 64-Core
RAM: 512GB DDR4
OS: Ubuntu 22.04 LTS
CUDA: 12.4
Driver: 550.90.07

Software Stack

vLLM: 0.7.3
PyTorch: 2.5.1
Transformers: 4.47.0
Python: 3.11.10

Target Models

ModelParametersTensor ParallelNotes
Llama 3.1 70B Instruct70BTP=4Primary benchmark model
Qwen 2.5 72B Instruct72BTP=4Cross-validation
Mistral Large 2 (123B)123BTP=4Large model verification

Prompt Set Composition

The prompt set was designed to reflect actual production traffic distribution, not a single type.

# Prompt set composition (500 prompts)
prompt_distribution = {
    "short_qa": 150,          # 1-2 sentence questions, expected output 50-100 tokens
    "summarization": 100,      # 500-1000 word document summary, expected output 150-300 tokens
    "code_generation": 80,     # function/class generation, expected output 100-500 tokens
    "creative_writing": 50,    # stories/essays, expected output 300-800 tokens
    "translation": 70,         # KR-EN/EN-KR translation, expected output 100-300 tokens
    "structured_output": 50,   # JSON/YAML generation, expected output 50-200 tokens
}

Benchmark Execution Code

import json
import time
import statistics
from dataclasses import dataclass, asdict
from openai import OpenAI

@dataclass
class BenchmarkResult:
    method: str
    model: str
    draft_model: str
    num_spec_tokens: int
    prompt_category: str
    ttft_ms: float
    tpot_ms: float
    e2e_ms: float
    output_tokens: int
    accept_ratio: float
    gpu_memory_gb: float

def run_single_benchmark(
    client: OpenAI,
    model: str,
    prompt: str,
    max_tokens: int,
) -> dict:
    """Run benchmark for a single prompt"""
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=0.0,  # unified greedy decoding
        stream=True,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()

    ttft = (first_token_time - start) if first_token_time else (end - start)
    e2e = end - start
    tpot = (end - first_token_time) / max(token_count - 1, 1) if first_token_time else e2e

    return {
        "ttft_ms": round(ttft * 1000, 2),
        "tpot_ms": round(tpot * 1000, 2),
        "e2e_ms": round(e2e * 1000, 2),
        "output_tokens": token_count,
    }

def run_full_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    warmup_runs: int = 5,
    measure_runs: int = 3,
) -> list[dict]:
    """Run benchmark across the full prompt set"""
    client = OpenAI(base_url=f"{base_url}/v1", api_key="benchmark")

    # Warmup
    for i in range(warmup_runs):
        run_single_benchmark(client, model, prompts[0]["text"], 64)

    results = []
    for run_idx in range(measure_runs):
        for prompt in prompts:
            result = run_single_benchmark(
                client, model, prompt["text"], prompt["max_tokens"]
            )
            result["category"] = prompt["category"]
            result["run"] = run_idx
            results.append(result)

    return results

Benchmark Results: Llama 3.1 70B Instruct

Overall Summary (500 prompts, 3-run average)

TechniqueDraft ModelAccept RatioTPOT P50 (ms)TPOT P95 (ms)E2E SpeedupAdditional GPU Memory
Baseline (no spec)--42.358.71.00x0 GB
Vanilla SDLlama 3.1 8B0.6219.831.21.95x~16 GB
Medusa-25 heads0.6817.127.82.21x~0.8 GB
EAGLE-1EAGLE head0.7314.924.12.52x~1.5 GB
EAGLE-3EAGLE-3 head0.8112.319.62.89x~1.8 GB
Prompt LookupN-gram (n=3)0.4128.545.31.38x0 GB

Per-Category Detailed Results (EAGLE-3)

Performance varies significantly by prompt type, so category-level results should be reviewed separately.

CategoryAccept RatioE2E SpeedupTPOT P50 (ms)Notes
short_qa0.842.95x11.8Short output but high prediction accuracy
summarization0.833.12x11.5Highest speedup
code_generation0.722.31x15.2High syntax prediction but drops on logic
creative_writing0.762.67x13.4Surprisingly high accept ratio
translation0.853.05x11.9Translation is highly dependent on input, making prediction easier
structured_output0.883.21x10.9Structural patterns of JSON/YAML favor prediction
# Accept ratio measurements by temperature (EAGLE-3, Llama 3.1 70B)
temperature_results = {
    0.0: {"accept_ratio": 0.81, "speedup": 2.89},
    0.3: {"accept_ratio": 0.76, "speedup": 2.61},
    0.5: {"accept_ratio": 0.71, "speedup": 2.38},
    0.7: {"accept_ratio": 0.64, "speedup": 2.08},
    1.0: {"accept_ratio": 0.55, "speedup": 1.72},
    1.2: {"accept_ratio": 0.47, "speedup": 1.41},
    1.5: {"accept_ratio": 0.38, "speedup": 1.15},  # virtually no effect
}

# Conclusion: disable speculative decoding when temperature > 1.0
TEMP_THRESHOLD = 1.0

Benchmark Results: Concurrency Impact

In actual production, multiple concurrent requests must be handled rather than single requests. We measured how speculative decoding effectiveness changes as concurrency increases.

Performance by Concurrent Request Count (EAGLE-3, Llama 3.1 70B)

ConcurrencyThroughput (tokens/s)E2E SpeedupAccept RatioNotes
181.32.89x0.81Optimal conditions
2156.22.71x0.80Nearly maintained
4289.52.43x0.79Slight decrease
8498.71.98x0.78Decline begins
16721.31.52x0.77Notable decrease
32890.41.18x0.76Minimal effect
64965.10.95x0.75Negative impact occurs
# Benchmark execution code by concurrency level
import asyncio
import aiohttp
import time

async def concurrent_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    concurrency: int,
) -> dict:
    """Benchmark by concurrent request count"""
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def single_request(session, prompt):
        async with semaphore:
            start = time.perf_counter()
            payload = {
                "model": model,
                "messages": [{"role": "user", "content": prompt["text"]}],
                "max_tokens": prompt["max_tokens"],
                "temperature": 0.0,
                "stream": False,
            }
            async with session.post(
                f"{base_url}/v1/chat/completions",
                json=payload,
            ) as resp:
                data = await resp.json()
                end = time.perf_counter()
                tokens = data["usage"]["completion_tokens"]
                return {
                    "e2e_ms": (end - start) * 1000,
                    "output_tokens": tokens,
                    "tokens_per_sec": tokens / (end - start),
                }

    async with aiohttp.ClientSession() as session:
        tasks = [single_request(session, p) for p in prompts]
        total_start = time.perf_counter()
        results = await asyncio.gather(*tasks)
        total_elapsed = time.perf_counter() - total_start

    total_tokens = sum(r["output_tokens"] for r in results)
    return {
        "concurrency": concurrency,
        "total_tokens": total_tokens,
        "total_elapsed_sec": round(total_elapsed, 2),
        "throughput_tokens_per_sec": round(total_tokens / total_elapsed, 1),
        "avg_e2e_ms": round(statistics.mean(r["e2e_ms"] for r in results), 1),
        "p95_e2e_ms": round(
            sorted(r["e2e_ms"] for r in results)[int(len(results) * 0.95)], 1
        ),
    }

Key observation: At concurrency 32 and above, the benefits of speculative decoding disappear, and at 64 and above, throughput actually decreases. This is due to KV cache memory contention and the additional computational overhead of the draft model.

Benchmark Results: Cost Efficiency Analysis

Since serving costs are determined by GPU time, "how many more requests can be processed with the same budget" is more important in production than simple latency improvement.

Cost Efficiency Comparison (A100 80GB x4, $13.04/hour basis)

TechniqueHourly Throughput (requests)Cost per Request ($)Cost Reduction vs. Baseline
Baseline3,200$0.00408-
Vanilla SD5,440$0.0024041% reduction
EAGLE-37,680$0.0017058% reduction
Medusa-26,400$0.0020450% reduction

Note: The above figures are based on concurrency 8. In actual serving, numbers vary depending on traffic patterns, request size distribution, and SLO requirements.

Draft Model Training/Maintenance Costs

TechniqueInitial Training CostTraining TimeRetraining Needed on Target Model Update
Vanilla SD$0 (uses existing model)0Not required
Medusa-2~$50 (1x A100, 3 hours)3 hoursRequired
EAGLE-1~$100 (1x A100, 6 hours)6 hoursRequired
EAGLE-3~$200 (1x A100, 12 hours)12 hoursRequired

num_speculative_tokens Sensitivity Analysis

We measured performance changes based on num_speculative_tokens values using the EAGLE-3 + Llama 3.1 70B combination.

num_speculative_tokensAccept RatioMean Accepted LengthTPOT P50 (ms)E2E SpeedupGPU Memory Increase
10.910.9135.21.20x+0.3 GB
30.862.5816.82.52x+0.8 GB
50.814.0512.32.89x+1.8 GB
70.765.3211.13.02x+2.9 GB
100.696.9010.82.95x+4.5 GB
150.588.7011.52.71x+7.2 GB

Conclusion: num_speculative_tokens=5~7 is the optimal range. Beyond 7, the declining accept ratio offsets the throughput benefits, and only GPU memory consumption increases.

Reproducible Benchmark Execution

Full Benchmark Script

#!/bin/bash
# run_benchmark.sh - Full benchmark execution script
set -euo pipefail

MODEL="meta-llama/Llama-3.1-70B-Instruct"
DRAFT_MODELS=(
    "none"                                    # baseline
    "meta-llama/Llama-3.1-8B-Instruct"       # vanilla SD
    "eagle3-llama3.1-70b-instruct"            # EAGLE-3
)
SPEC_TOKENS=(0 5 5)
METHODS=("baseline" "vanilla" "eagle3")

RESULTS_DIR="benchmark_results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"

for i in "${!METHODS[@]}"; do
    method="${METHODS[$i]}"
    draft="${DRAFT_MODELS[$i]}"
    n_tokens="${SPEC_TOKENS[$i]}"

    echo "=== Running benchmark: $method ==="

    # Start server
    if [ "$method" = "baseline" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    elif [ "$method" = "eagle3" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --speculative-method eagle \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    else
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    fi

    SERVER_PID=$!
    sleep 60  # Wait for model loading

    # Run benchmark
    python benchmark.py \
        --prompts eval_prompts.json \
        --model "$MODEL" \
        --warmup 10 \
        --runs 3 \
        --output "$RESULTS_DIR/${method}.json"

    # Stop server
    kill $SERVER_PID
    wait $SERVER_PID 2>/dev/null || true
    sleep 10
done

# Aggregate results
python aggregate_results.py --input-dir "$RESULTS_DIR" --output "$RESULTS_DIR/summary.json"
echo "Results saved to $RESULTS_DIR/summary.json"

Result Aggregation and Comparison

# aggregate_results.py
import json
import sys
from pathlib import Path

def aggregate_results(results_dir: str) -> dict:
    """Read benchmark result files and generate comparison table"""
    results_path = Path(results_dir)
    summary = {}

    for result_file in sorted(results_path.glob("*.json")):
        if result_file.name == "summary.json":
            continue
        method = result_file.stem
        data = json.load(open(result_file))

        # Calculate P50, P95
        tpot_values = [r["tpot_ms"] for r in data]
        e2e_values = [r["e2e_ms"] for r in data]

        summary[method] = {
            "tpot_p50": round(sorted(tpot_values)[len(tpot_values) // 2], 2),
            "tpot_p95": round(sorted(tpot_values)[int(len(tpot_values) * 0.95)], 2),
            "e2e_p50": round(sorted(e2e_values)[len(e2e_values) // 2], 2),
            "e2e_p95": round(sorted(e2e_values)[int(len(e2e_values) * 0.95)], 2),
            "total_requests": len(data),
        }

    # Calculate speedup (relative to baseline)
    if "baseline" in summary:
        baseline_e2e = summary["baseline"]["e2e_p50"]
        for method in summary:
            summary[method]["speedup"] = round(
                baseline_e2e / summary[method]["e2e_p50"], 2
            )

    return summary

if __name__ == "__main__":
    results_dir = sys.argv[1] if len(sys.argv) > 1 else "benchmark_results/latest"
    summary = aggregate_results(results_dir)
    print(json.dumps(summary, indent=2))

Production Application Recommendations

Here are practical application guidelines based on the benchmark results.

Technique Selection Decision Tree

Q: Do you frequently swap the target model?
├─ Yes: Vanilla SD (no retraining needed)
│       or Prompt Lookup (no training needed)
└─ No: Q: Is there spare GPU memory?
       ├─ Yes (>10GB spare): EAGLE-3 (best performance)
       └─ No:  Q: Do you have lightweight training infrastructure?
              ├─ Yes: Medusa-2 (memory efficient)
              └─ No:  Vanilla SD (Llama 3.1 8B as draft)
SLORecommended Techniquenum_speculative_tokensNotes
TPOT P95 under 20msEAGLE-35Accept ratio over 0.8 expected
TPOT P95 under 35msVanilla SD or Medusa5Low operational complexity
Maximum throughputEAGLE-3, concurrency 8 or less7Concurrency limit mandatory
Minimum costEAGLE-3558% cost reduction

Benchmark Limitations

The following limitations should be recognized when interpreting these benchmark results.

  1. Hardware dependency: Results are based on A100 SXM. Speedup ratios differ on A10G or L4. Especially with PCIe connections without NVLink, TP communication overhead increases and speedup decreases.
  2. Prompt set bias: Korean/English mixed prompts were used. Services with a high proportion of pure code generation or mathematical reasoning may see different accept ratios.
  3. vLLM version dependency: vLLM's speculative decoding implementation is rapidly improving, so numbers may vary by version.
  4. KV cache impact: max_model_len was fixed at 4096. Increasing to 8192 or above changes concurrency handling capacity due to KV cache memory constraints.
  5. No quantization applied: This benchmark is based on bf16 precision. Applying speculative decoding to GPTQ/AWQ quantized models requires a separate benchmark.

Quiz

Q1. Which technique recorded the highest E2E speedup in this benchmark, and what was the value?

Answer: EAGLE-3 recorded the highest speedup at 2.89x. In the structured output (JSON/YAML) category, it reached up to 3.21x.

Q2. Why is the speculative decoding speedup 0.95x at concurrency 64? Answer: At high concurrency, the system is already compute-bound, so the memory-bound mitigation benefit of speculation disappears. The additional computation from the draft model and KV cache memory contention actually degrade performance.

Q3. Why does speedup decrease when num_speculative_tokens is increased to 15? Answer: As the number of speculative tokens increases, the accept ratio drops sharply for later positions. Of 15 tokens, only an average of 8.7 are accepted, and the draft computation for the remaining 6.3 is wasted. This overhead offsets the benefit of additional accepted tokens.

Q4. Why is Prompt Lookup Decoding's accept ratio low at 0.41? Answer: Prompt Lookup reuses N-grams from the input text, so matching probability is low for tasks with little lexical overlap between input and output (translation, creative writing, etc.). It is only effective for tasks like document summarization that directly reuse input words.

Q5. Why is EAGLE-3's additional GPU memory more than twice that of Medusa-2? Answer: Medusa consists of lightweight heads made of a few linear layers, while EAGLE-3 includes transformer layers that process the target model's features, resulting in more parameters. However, this complex structure yields an accept ratio of 0.81, higher than Medusa's 0.68.

Q6. Why is the accept ratio high at 0.85 for translation tasks? Answer: Translation is strongly conditioned on the source sentence's structure and vocabulary, making it easy for the draft model to predict the next token. Especially in Korean-English translation, the correspondence patterns of high-frequency expressions are clear, resulting in high prediction accuracy.

Q7. What should be changed first when reproducing this benchmark in your own environment?

Answer: The prompt set composition. You need to create a prompt set that reflects your service's actual traffic distribution (request types, input/output lengths, temperature distribution) for a meaningful benchmark. Results from a general-purpose prompt set can differ significantly from your environment.

Q8. What prerequisites are needed to achieve the 58% cost reduction? Answer: Prerequisites include completed EAGLE-3 draft head training, operation at concurrency 8 or below, temperature settings close to 0, and spare GPU memory (+2GB or more). In actual production, concurrency fluctuates constantly with traffic, so dynamic speculative decoding on/off routing is essential.

References