Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

LLM 서빙: Speculative Decoding 프로덕션 벤치마크 2026

벤치마크 목적과 범위
테스트 환경
벤치마크 결과: Llama 3.1 70B Instruct
벤치마크 결과: 동시 요청 수(Concurrency) 영향
- 동시 요청 수별 성능 (EAGLE-3, Llama 3.1 70B)
벤치마크 결과: 비용 효율 분석
- 비용 효율 비교 (A100 80GB x4, 시간당 $13.04 기준)
- Draft 모델별 학습/유지 비용
num_speculative_tokens 민감도 분석
재현 가능한 벤치마크 실행 방법
- 전체 벤치마크 스크립트
- 결과 집계 및 비교
프로덕션 적용 권고사항
- 기법 선택 의사결정 트리
- SLO별 권장 설정
벤치마크의 한계
퀴즈
References

벤치마크 목적과 범위

이 문서는 2026년 초 기준으로 speculative decoding 기법들을 동일한 하드웨어, 동일한 프롬프트셋, 동일한 측정 방법론으로 비교한 프로덕션 벤치마크 결과를 정리한다. 학술 논문의 이상적 조건이 아닌, 실제 서빙 환경에서의 수치를 제공하는 것이 목표다.

비교 대상 기법:

Vanilla Speculative Decoding (독립 draft 모델) - Leviathan et al., arXiv:2211.17192
Medusa (다중 디코딩 헤드) - Cai et al., arXiv:2401.10774
EAGLE-1/EAGLE-3 (feature-level speculation) - Li et al., arXiv:2401.15077 / arXiv:2503.01840
Prompt Lookup Decoding (N-gram 매칭, 학습 불필요)

비교 대상에서 제외한 기법과 이유:

Staged Speculative Decoding: 구현 복잡도 대비 실무 채택률 낮음
REST (Retrieval-based): 외부 데이터스토어 의존으로 서빙 아키텍처 변경 필요

테스트 환경

하드웨어

GPU: 4x NVIDIA A100 80GB SXM (NVLink 연결)
CPU: AMD EPYC 7763 64-Core
RAM: 512GB DDR4
OS: Ubuntu 22.04 LTS
CUDA: 12.4
Driver: 550.90.07

소프트웨어 스택

vLLM: 0.7.3
PyTorch: 2.5.1
Transformers: 4.47.0
Python: 3.11.10

Target 모델

모델	파라미터	Tensor Parallel	비고
Llama 3.1 70B Instruct	70B	TP=4	주력 벤치마크 모델
Qwen 2.5 72B Instruct	72B	TP=4	교차 검증용
Mistral Large 2 (123B)	123B	TP=4	대형 모델 확인용

프롬프트셋 구성

단일 유형이 아닌 실제 프로덕션 트래픽 분포를 반영한 프롬프트셋을 구성했다.

# 프롬프트셋 구성 (500개)
prompt_distribution = {
    "short_qa": 150,          # 1-2문장 질문, 예상 출력 50-100 tokens
    "summarization": 100,      # 500-1000 단어 문서 요약, 예상 출력 150-300 tokens
    "code_generation": 80,     # 함수/클래스 생성, 예상 출력 100-500 tokens
    "creative_writing": 50,    # 스토리/에세이, 예상 출력 300-800 tokens
    "translation": 70,         # 한영/영한 번역, 예상 출력 100-300 tokens
    "structured_output": 50,   # JSON/YAML 생성, 예상 출력 50-200 tokens
}

벤치마크 실행 코드

import json
import time
import statistics
from dataclasses import dataclass, asdict
from openai import OpenAI

@dataclass
class BenchmarkResult:
    method: str
    model: str
    draft_model: str
    num_spec_tokens: int
    prompt_category: str
    ttft_ms: float
    tpot_ms: float
    e2e_ms: float
    output_tokens: int
    accept_ratio: float
    gpu_memory_gb: float

def run_single_benchmark(
    client: OpenAI,
    model: str,
    prompt: str,
    max_tokens: int,
) -> dict:
    """단일 프롬프트에 대한 벤치마크 실행"""
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=0.0,  # greedy decoding으로 통일
        stream=True,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()

    ttft = (first_token_time - start) if first_token_time else (end - start)
    e2e = end - start
    tpot = (end - first_token_time) / max(token_count - 1, 1) if first_token_time else e2e

    return {
        "ttft_ms": round(ttft * 1000, 2),
        "tpot_ms": round(tpot * 1000, 2),
        "e2e_ms": round(e2e * 1000, 2),
        "output_tokens": token_count,
    }

def run_full_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    warmup_runs: int = 5,
    measure_runs: int = 3,
) -> list[dict]:
    """전체 프롬프트셋에 대한 벤치마크 실행"""
    client = OpenAI(base_url=f"{base_url}/v1", api_key="benchmark")

    # Warmup
    for i in range(warmup_runs):
        run_single_benchmark(client, model, prompts[0]["text"], 64)

    results = []
    for run_idx in range(measure_runs):
        for prompt in prompts:
            result = run_single_benchmark(
                client, model, prompt["text"], prompt["max_tokens"]
            )
            result["category"] = prompt["category"]
            result["run"] = run_idx
            results.append(result)

    return results

벤치마크 결과: Llama 3.1 70B Instruct

전체 요약 (500 프롬프트, 3회 반복 평균)

기법	Draft 모델	Accept Ratio	TPOT P50 (ms)	TPOT P95 (ms)	E2E Speedup	추가 GPU 메모리
Baseline (no spec)	-	-	42.3	58.7	1.00x	0 GB
Vanilla SD	Llama 3.1 8B	0.62	19.8	31.2	1.95x	~16 GB
Medusa-2	5 heads	0.68	17.1	27.8	2.21x	~0.8 GB
EAGLE-1	EAGLE head	0.73	14.9	24.1	2.52x	~1.5 GB
EAGLE-3	EAGLE-3 head	0.81	12.3	19.6	2.89x	~1.8 GB
Prompt Lookup	N-gram (n=3)	0.41	28.5	45.3	1.38x	0 GB

카테고리별 상세 결과 (EAGLE-3 기준)

프롬프트 유형에 따른 성능 차이가 크므로, 카테고리별 결과를 별도로 확인해야 한다.

카테고리	Accept Ratio	E2E Speedup	TPOT P50 (ms)	특이사항
short_qa	0.84	2.95x	11.8	짧은 출력이지만 예측 정확도 높음
summarization	0.83	3.12x	11.5	가장 큰 speedup
code_generation	0.72	2.31x	15.2	구문 예측은 높으나 로직부에서 하락
creative_writing	0.76	2.67x	13.4	의외로 높은 accept ratio
translation	0.85	3.05x	11.9	번역은 입력에 대한 의존도 높아 예측 용이
structured_output	0.88	3.21x	10.9	JSON/YAML의 구조적 패턴이 예측에 유리

Temperature 변화에 따른 Accept Ratio 추이

# Temperature별 accept ratio 측정 결과 (EAGLE-3, Llama 3.1 70B)
temperature_results = {
    0.0: {"accept_ratio": 0.81, "speedup": 2.89},
    0.3: {"accept_ratio": 0.76, "speedup": 2.61},
    0.5: {"accept_ratio": 0.71, "speedup": 2.38},
    0.7: {"accept_ratio": 0.64, "speedup": 2.08},
    1.0: {"accept_ratio": 0.55, "speedup": 1.72},
    1.2: {"accept_ratio": 0.47, "speedup": 1.41},
    1.5: {"accept_ratio": 0.38, "speedup": 1.15},  # 거의 효과 없음
}

# 결론: temperature > 1.0이면 speculative decoding 비활성화 권장
TEMP_THRESHOLD = 1.0

벤치마크 결과: 동시 요청 수(Concurrency) 영향

실제 프로덕션에서는 단일 요청이 아닌 다수 동시 요청을 처리해야 한다. Concurrency 증가에 따른 speculative decoding의 효과 변화를 측정했다.

동시 요청 수별 성능 (EAGLE-3, Llama 3.1 70B)

Concurrency	Throughput (tokens/s)	E2E Speedup	Accept Ratio	비고
1	81.3	2.89x	0.81	최적 조건
2	156.2	2.71x	0.80	거의 유지
4	289.5	2.43x	0.79	약간 하락
8	498.7	1.98x	0.78	감소 시작
16	721.3	1.52x	0.77	뚜렷한 감소
32	890.4	1.18x	0.76	효과 미미
64	965.1	0.95x	0.75	역효과 발생

# 동시 요청별 벤치마크 실행 코드
import asyncio
import aiohttp
import time

async def concurrent_benchmark(
    base_url: str,
    model: str,
    prompts: list[dict],
    concurrency: int,
) -> dict:
    """동시 요청 수별 벤치마크"""
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def single_request(session, prompt):
        async with semaphore:
            start = time.perf_counter()
            payload = {
                "model": model,
                "messages": [{"role": "user", "content": prompt["text"]}],
                "max_tokens": prompt["max_tokens"],
                "temperature": 0.0,
                "stream": False,
            }
            async with session.post(
                f"{base_url}/v1/chat/completions",
                json=payload,
            ) as resp:
                data = await resp.json()
                end = time.perf_counter()
                tokens = data["usage"]["completion_tokens"]
                return {
                    "e2e_ms": (end - start) * 1000,
                    "output_tokens": tokens,
                    "tokens_per_sec": tokens / (end - start),
                }

    async with aiohttp.ClientSession() as session:
        tasks = [single_request(session, p) for p in prompts]
        total_start = time.perf_counter()
        results = await asyncio.gather(*tasks)
        total_elapsed = time.perf_counter() - total_start

    total_tokens = sum(r["output_tokens"] for r in results)
    return {
        "concurrency": concurrency,
        "total_tokens": total_tokens,
        "total_elapsed_sec": round(total_elapsed, 2),
        "throughput_tokens_per_sec": round(total_tokens / total_elapsed, 1),
        "avg_e2e_ms": round(statistics.mean(r["e2e_ms"] for r in results), 1),
        "p95_e2e_ms": round(
            sorted(r["e2e_ms"] for r in results)[int(len(results) * 0.95)], 1
        ),
    }

핵심 관찰: Concurrency 32 이상에서는 speculative decoding의 이점이 사라지며, 64 이상에서는 오히려 throughput이 감소했다. 이는 KV cache 메모리 경쟁과 draft 모델의 추가 연산 오버헤드 때문이다.

벤치마크 결과: 비용 효율 분석

서빙 비용은 GPU 시간으로 결정되므로, 단순 latency 개선보다 "동일 예산으로 얼마나 더 많은 요청을 처리할 수 있는가"가 프로덕션에서는 더 중요하다.

비용 효율 비교 (A100 80GB x4, 시간당 $13.04 기준)

기법	시간당 처리량 (requests)	요청당 비용 ($)	Baseline 대비 비용 절감
Baseline	3,200	$0.00408	-
Vanilla SD	5,440	$0.00240	41% 절감
EAGLE-3	7,680	$0.00170	58% 절감
Medusa-2	6,400	$0.00204	50% 절감

주의: 위 수치는 concurrency 8 기준이다. 실제 서빙에서는 트래픽 패턴, 요청 크기 분포, SLO 요구사항에 따라 달라진다.

Draft 모델별 학습/유지 비용

기법	초기 학습 비용	학습 시간	Target 모델 업데이트 시 재학습 필요
Vanilla SD	$0 (기존 모델 사용)	0	불필요
Medusa-2	~$50 (A100 1장, 3시간)	3시간	필요
EAGLE-1	~$100 (A100 1장, 6시간)	6시간	필요
EAGLE-3	~$200 (A100 1장, 12시간)	12시간	필요

num_speculative_tokens 민감도 분석

num_speculative_tokens 값에 따른 성능 변화를 EAGLE-3, Llama 3.1 70B 조합으로 측정했다.

num_speculative_tokens	Accept Ratio	Mean Accepted Length	TPOT P50 (ms)	E2E Speedup	GPU 메모리 증가
1	0.91	0.91	35.2	1.20x	+0.3 GB
3	0.86	2.58	16.8	2.52x	+0.8 GB
5	0.81	4.05	12.3	2.89x	+1.8 GB
7	0.76	5.32	11.1	3.02x	+2.9 GB
10	0.69	6.90	10.8	2.95x	+4.5 GB
15	0.58	8.70	11.5	2.71x	+7.2 GB

결론: num_speculative_tokens=5~7이 최적 구간이다. 7을 넘으면 accept ratio 하락이 throughput 이점을 상쇄하고, GPU 메모리 소비만 증가한다.

재현 가능한 벤치마크 실행 방법

전체 벤치마크 스크립트

#!/bin/bash
# run_benchmark.sh - 전체 벤치마크 실행 스크립트
set -euo pipefail

MODEL="meta-llama/Llama-3.1-70B-Instruct"
DRAFT_MODELS=(
    "none"                                    # baseline
    "meta-llama/Llama-3.1-8B-Instruct"       # vanilla SD
    "eagle3-llama3.1-70b-instruct"            # EAGLE-3
)
SPEC_TOKENS=(0 5 5)
METHODS=("baseline" "vanilla" "eagle3")

RESULTS_DIR="benchmark_results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"

for i in "${!METHODS[@]}"; do
    method="${METHODS[$i]}"
    draft="${DRAFT_MODELS[$i]}"
    n_tokens="${SPEC_TOKENS[$i]}"

    echo "=== Running benchmark: $method ==="

    # 서버 시작
    if [ "$method" = "baseline" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    elif [ "$method" = "eagle3" ]; then
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --speculative-method eagle \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    else
        python -m vllm.entrypoints.openai.api_server \
            --model "$MODEL" \
            --speculative-model "$draft" \
            --num-speculative-tokens "$n_tokens" \
            --tensor-parallel-size 4 \
            --gpu-memory-utilization 0.92 \
            --port 8000 &
    fi

    SERVER_PID=$!
    sleep 60  # 모델 로딩 대기

    # 벤치마크 실행
    python benchmark.py \
        --prompts eval_prompts.json \
        --model "$MODEL" \
        --warmup 10 \
        --runs 3 \
        --output "$RESULTS_DIR/${method}.json"

    # 서버 종료
    kill $SERVER_PID
    wait $SERVER_PID 2>/dev/null || true
    sleep 10
done

# 결과 집계
python aggregate_results.py --input-dir "$RESULTS_DIR" --output "$RESULTS_DIR/summary.json"
echo "Results saved to $RESULTS_DIR/summary.json"

결과 집계 및 비교

# aggregate_results.py
import json
import sys
from pathlib import Path

def aggregate_results(results_dir: str) -> dict:
    """벤치마크 결과 파일들을 읽어 비교 테이블 생성"""
    results_path = Path(results_dir)
    summary = {}

    for result_file in sorted(results_path.glob("*.json")):
        if result_file.name == "summary.json":
            continue
        method = result_file.stem
        data = json.load(open(result_file))

        # P50, P95 계산
        tpot_values = [r["tpot_ms"] for r in data]
        e2e_values = [r["e2e_ms"] for r in data]

        summary[method] = {
            "tpot_p50": round(sorted(tpot_values)[len(tpot_values) // 2], 2),
            "tpot_p95": round(sorted(tpot_values)[int(len(tpot_values) * 0.95)], 2),
            "e2e_p50": round(sorted(e2e_values)[len(e2e_values) // 2], 2),
            "e2e_p95": round(sorted(e2e_values)[int(len(e2e_values) * 0.95)], 2),
            "total_requests": len(data),
        }

    # Speedup 계산 (baseline 대비)
    if "baseline" in summary:
        baseline_e2e = summary["baseline"]["e2e_p50"]
        for method in summary:
            summary[method]["speedup"] = round(
                baseline_e2e / summary[method]["e2e_p50"], 2
            )

    return summary

if __name__ == "__main__":
    results_dir = sys.argv[1] if len(sys.argv) > 1 else "benchmark_results/latest"
    summary = aggregate_results(results_dir)
    print(json.dumps(summary, indent=2))

프로덕션 적용 권고사항

벤치마크 결과를 바탕으로 한 실무 적용 가이드라인이다.

기법 선택 의사결정 트리

Q: Target 모델을 자주 교체하는가?
├─ Yes: Vanilla SD (재학습 불필요)
│       또는 Prompt Lookup (학습 불필요)
└─ No: Q: GPU 메모리 여유가 있는가?
       ├─ Yes (>10GB 여유): EAGLE-3 (최고 성능)
       └─ No:  Q: 경량 학습 인프라가 있는가?
              ├─ Yes: Medusa-2 (메모리 효율적)
              └─ No:  Vanilla SD (Llama 3.1 8B as draft)

SLO별 권장 설정

SLO	권장 기법	num_speculative_tokens	비고
TPOT P95 < 20ms	EAGLE-3	5	Accept ratio 0.8 이상 기대
TPOT P95 < 35ms	Vanilla SD 또는 Medusa	5	낮은 운영 복잡도
Throughput 최대화	EAGLE-3, concurrency 8 이하	7	Concurrency 제한 필수
비용 최소화	EAGLE-3	5	58% 비용 절감

벤치마크의 한계

이 벤치마크 결과를 해석할 때 다음 한계를 인지해야 한다.

하드웨어 의존성: A100 SXM 기준 결과이며, A10G나 L4에서는 speedup 비율이 달라진다. 특히 NVLink 없는 PCIe 연결에서는 TP 통신 오버헤드가 커져 speedup이 줄어든다.
프롬프트셋 편향: 한국어/영어 혼합 프롬프트를 사용했으며, 순수 코드 생성이나 수학 추론 비중이 높은 서비스에서는 accept ratio가 다를 수 있다.
vLLM 버전 의존: vLLM의 speculative decoding 구현은 빠르게 개선되고 있어, 버전에 따라 수치가 달라질 수 있다.
KV cache 영향: max_model_len을 4096으로 고정했으며, 8192 이상으로 늘리면 KV cache 메모리 제약으로 concurrency 처리 능력이 변한다.
양자화 미적용: 이번 벤치마크는 bf16 precision 기준이다. GPTQ/AWQ 양자화 모델에 speculative decoding을 적용한 결과는 별도 벤치마크가 필요하다.

퀴즈

Q1. 이 벤치마크에서 가장 높은 E2E speedup을 기록한 기법과 수치는?

정답: ||EAGLE-3가 2.89x로 가장 높은 speedup을 기록했다. Structured output (JSON/YAML) 카테고리에서는 3.21x까지 올라갔다.||

Q2. Concurrency 64에서 speculative decoding의 speedup이 0.95x인 이유는?

정답: ||높은 concurrency에서는 이미 compute-bound 상태가 되어 speculation의 memory-bound 완화 이점이 사라지고, draft 모델의 추가 연산과 KV cache 메모리 경쟁이 오히려 성능을 저하시키기 때문이다.||

Q3. num_speculative_tokens를 15로 올리면 speedup이 오히려 줄어드는 이유는?

정답: ||추측 토큰 수가 늘어나면 뒤쪽 position의 accept ratio가 급격히 떨어진다. 15개 중 실제 수용되는 평균 8.7개로, 나머지 6.3개의 draft 연산은 낭비된다. 이 오버헤드가 추가 수용 토큰의 이점을 상쇄한다.||

Q4. Prompt Lookup Decoding의 accept ratio가 0.41로 낮은 이유는?

정답: ||Prompt Lookup은 입력 텍스트의 N-gram을 재활용하므로, 입력과 출력의 어휘적 중복이 적은 태스크(번역, 창작 등)에서는 매칭 확률이 낮다. 문서 요약처럼 입력 단어를 그대로 사용하는 태스크에서만 효과적이다.||

Q5. EAGLE-3의 추가 GPU 메모리가 Medusa-2의 2배 이상인 이유는?

정답: ||Medusa는 linear layer 몇 개로 구성된 경량 head인 반면, EAGLE-3는 target 모델의 feature를 처리하는 transformer layer를 포함하여 파라미터 수가 더 많다. 대신 이 복잡한 구조 덕분에 accept ratio가 0.81로 Medusa(0.68)보다 높다.||

Q6. 번역 태스크에서 accept ratio가 0.85로 높은 이유는?

정답: ||번역은 소스 문장의 구조와 어휘에 강하게 조건화되므로, draft 모델이 다음 토큰을 예측하기 쉽다. 특히 한영 번역에서 고빈도 표현의 대응 패턴이 명확하여 예측 정확도가 높아진다.||

Q7. 이 벤치마크를 자사 환경에서 재현할 때 가장 먼저 변경해야 할 설정은?

정답: ||프롬프트셋 구성이다. 자사 서비스의 실제 트래픽 분포(요청 유형, 입출력 길이, temperature 분포)를 반영한 프롬프트셋을 만들어야 의미 있는 벤치마크가 된다. 범용 프롬프트셋의 결과는 자사 환경과 큰 차이가 날 수 있다.||

Q8. 비용 절감률 58%를 달성하려면 어떤 전제 조건이 필요한가?

정답: ||EAGLE-3 draft head 학습 완료, concurrency 8 이하 운영, temperature 0에 가까운 설정, GPU 메모리 여유(+2GB 이상)가 전제 조건이다. 실제 프로덕션에서는 트래픽 변동으로 concurrency가 수시로 변하므로, 동적 speculative decoding on/off 라우팅이 필수다.||