Split View: Speculative Decoding으로 LLM 추론 2~3배 빠르게: 원리부터 실전 구현까지

Speculative Decoding으로 LLM 추론 2~3배 빠르게: 원리부터 실전 구현까지

1. LLM 추론의 근본적 병목
- 1.1 산술 강도 분석
2. Speculative Decoding 원리
3. 고급 기법들
4. vLLM에서 Speculative Decoding 사용하기
5. 최적의 K값 선택
6. 실전 고려사항
- 6.1 Draft 모델 선택 기준
- 6.2 Batch 환경에서의 주의점
7. 퀴즈

1. LLM 추론의 근본적 병목

LLM의 Autoregressive 디코딩은 본질적으로 시리얼하다:

토큰 1 생성 → 토큰 2 생성 → 토큰 3 생성 → ...
     ↓              ↓              ↓
  전체 모델        전체 모델       전체 모델
  Forward Pass    Forward Pass   Forward Pass

각 토큰 생성 시 70B 모델 전체를 Forward Pass해야 하고, 이 과정은 Memory-Bandwidth Bound다. GPU 연산 능력은 남지만 메모리 대역폭이 병목이 된다.

1.1 산술 강도 분석

70B 모델, FP16:
- 모델 크기: ~140GB
- 1토큰 생성: 140GB 메모리 읽기
- A100 80GB 메모리 대역폭: 2TB/s
- 이론적 최대: 2000/140 ≈ 14 tokens/s

실제로는 KV Cache 접근 등으로 ~10 tokens/s
→ GPU 연산 활용률: 1~2% 😱

핵심 통찰: 1토큰이든 K토큰이든 모델 가중치를 읽는 비용은 동일하다. 한 번 읽을 때 여러 토큰을 처리하면 효율이 올라간다.

2. Speculative Decoding 원리

2.1 기본 아이디어

작은 Draft 모델이 K개 토큰을 빠르게 제안하고, 큰 Target 모델이 한 번의 Forward Pass로 K개를 동시에 검증한다:

Draft Model (1B):  t1 → t2 → t3 → t4 → t5  (빠르게 5개 제안)
                    ↓    ↓    ↓    ↓    ↓
Target Model (70B): ✅   ✅   ✅   ❌   -    (한 번에 검증)
                                    ↓
                              t3' 재생성    (거절 후 보정)

결과: [t1, t2, t3, t3'] → 1번의 70B Forward Pass로 4토큰 생성!

2.2 수학적 보장: Rejection Sampling

Speculative Decoding의 핵심은 출력 분포가 Target 모델과 정확히 동일하다는 수학적 보장이다.

Draft 모델 분포 $q(x)$ , Target 모델 분포 $p(x)$ 에서:

수용 확률:

\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)

거절 시 보정 분포:

p'(x) = \text{norm}\left(\max(0, p(x) - q(x))\right)

이 과정을 거치면 최종 출력 분포는 **정확히 $p(x)$ **가 된다.

import torch

def speculative_decode(draft_model, target_model, input_ids, K=5):
    """Speculative Decoding 핵심 알고리즘"""
    
    # 1) Draft 모델로 K개 토큰 생성
    draft_tokens = []
    draft_probs = []
    current = input_ids.clone()
    
    for _ in range(K):
        logits = draft_model(current).logits[:, -1]
        probs = torch.softmax(logits, dim=-1)
        token = torch.multinomial(probs, 1)
        draft_tokens.append(token)
        draft_probs.append(probs.gather(-1, token))
        current = torch.cat([current, token], dim=-1)
    
    # 2) Target 모델로 한 번에 검증
    all_tokens = torch.cat([input_ids] + draft_tokens, dim=-1)
    target_logits = target_model(all_tokens).logits
    
    # 3) Rejection Sampling
    accepted = []
    n = input_ids.shape[-1]
    
    for i in range(K):
        target_prob = torch.softmax(target_logits[:, n+i-1], dim=-1)
        p_target = target_prob.gather(-1, draft_tokens[i])
        q_draft = draft_probs[i]
        
        # 수용 확률
        accept_prob = torch.min(
            torch.ones_like(p_target),
            p_target / q_draft
        )
        
        if torch.rand(1) < accept_prob:
            accepted.append(draft_tokens[i])
        else:
            # 거절: 보정 분포에서 새 토큰 샘플링
            residual = torch.clamp(target_prob - 
                torch.softmax(draft_model(all_tokens[:, :n+i]).logits[:, -1], dim=-1),
                min=0)
            residual = residual / residual.sum(dim=-1, keepdim=True)
            new_token = torch.multinomial(residual, 1)
            accepted.append(new_token)
            break
    else:
        # 모두 수용 시 보너스 토큰
        bonus = torch.multinomial(
            torch.softmax(target_logits[:, n+K-1], dim=-1), 1
        )
        accepted.append(bonus)
    
    return torch.cat(accepted, dim=-1)

2.3 수용률과 속도 향상

수용률 $\alpha$ 일 때, 평균 생성 토큰 수:

E[\text{tokens per step}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}

Draft-Target 쌍	수용률 α	K=5 평균 토큰	속도 향상
GPT-2 → GPT-4	0.4	1.6	1.3x
Llama-68M → Llama-70B	0.7	2.8	2.3x
Llama-1B → Llama-70B	0.8	3.6	2.8x

3. 고급 기법들

3.1 Self-Speculative Decoding (Draft 모델 없이)

별도 Draft 모델 없이 Target 모델 자체의 Early Exit 또는 Layer Skipping을 활용:

# Layer Skip 방식
class SelfSpeculativeModel(nn.Module):
    def draft_forward(self, x):
        """처음 8개 레이어만 사용하여 빠른 draft"""
        for layer in self.layers[:8]:
            x = layer(x)
        return self.lm_head(self.norm(x))
    
    def verify_forward(self, x):
        """전체 레이어로 검증"""
        for layer in self.layers:
            x = layer(x)
        return self.lm_head(self.norm(x))

장점: Draft 모델 별도 로딩 불필요, 메모리 절약

3.2 Medusa: Multi-Head Speculative Decoding

Draft 모델 대신 여러 개의 LM Head를 추가하여 동시에 여러 위치의 토큰을 예측:

          Target LM Head → t[n+1]
Input → Hidden States →
          Medusa Head 1 → t[n+2]  (예측)
          Medusa Head 2 → t[n+3]  (예측)
          Medusa Head 3 → t[n+4]  (예측)

3.3 Apple Mirror Speculative Decoding (2026)

Apple의 최신 연구(2026.01). 기존 Speculative Decoding의 시리얼 검증 병목을 해결:

Mirror Model: Target 모델의 경량화 버전이 Draft와 Verify를 동시에 수행
기존: Draft → Verify → Draft → Verify (시리얼)
Mirror: Draft₁ + Verify₀ → Draft₂ + Verify₁ → ... (파이프라인)

4. vLLM에서 Speculative Decoding 사용하기

4.1 설정

from vllm import LLM, SamplingParams

# Draft 모델 지정
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
    num_speculative_tokens=5,
    tensor_parallel_size=4,
    gpu_memory_utilization=0.9,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain quantum computing:"], params)

4.2 벤치마크 스크립트

# 기본 디코딩
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4

# Speculative Decoding
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5 \
    --tensor-parallel-size 4

4.3 TensorRT-LLM에서 사용

import tensorrt_llm
from tensorrt_llm import BuildConfig

# Draft 모델과 Target 모델 동시 빌드
build_config = BuildConfig(
    max_batch_size=8,
    max_input_len=2048,
    max_seq_len=4096,
    speculative_decoding_mode="draft_tokens_external",
    max_draft_len=5,
)

5. 최적의 K값 선택

import time

def find_optimal_k(draft_model, target_model, test_prompts, k_range=range(1, 11)):
    """최적의 speculative token 수 탐색"""
    results = {}
    
    for k in k_range:
        start = time.time()
        total_tokens = 0
        
        for prompt in test_prompts:
            output = speculative_generate(
                draft_model, target_model, prompt,
                num_speculative_tokens=k, max_tokens=256
            )
            total_tokens += len(output)
        
        elapsed = time.time() - start
        throughput = total_tokens / elapsed
        results[k] = throughput
        print(f"K={k}: {throughput:.1f} tokens/s")
    
    optimal_k = max(results, key=results.get)
    print(f"\nOptimal K = {optimal_k} ({results[optimal_k]:.1f} tokens/s)")
    return optimal_k

일반적 가이드라인:

Draft가 강할수록 (수용률 높음): K를 크게 (7~10)
Draft가 약할수록: K를 작게 (3~5)
코드 생성: K=5~7 (반복 패턴 많아 수용률 높음)
창의적 텍스트: K=3~4 (다양성 높아 수용률 낮음)

6. 실전 고려사항

6.1 Draft 모델 선택 기준

같은 토크나이저: 토크나이저가 다르면 토큰 정렬 문제 발생
같은 패밀리: Llama-1B → Llama-70B (같은 학습 데이터, 높은 수용률)
적절한 크기 비율: Target의 1/50~1/10 (너무 크면 Draft 비용 증가)
빠른 추론: Draft는 지연 시간이 핵심

6.2 Batch 환경에서의 주의점

Speculative Decoding은 배치 크기가 커지면 효과가 감소한다:

배치 내 각 요청의 수용률이 달라 동기화 문제
이미 compute-bound인 배치에서는 추가 연산이 부담
처리량(throughput) 보다 지연 시간(latency) 최적화에 더 적합

7. 퀴즈

Q1. Speculative Decoding이 출력 분포를 변경하지 않는 이유는?

Rejection Sampling 기법 덕분. 수용 확률 $\min(1, p(x)/q(x))$ 로 샘플링하고, 거절 시 보정 분포 $\max(0, p(x)-q(x))$ 에서 재샘플링하면 최종 분포가 정확히 Target 분포 $p(x)$ 가 됨.

Q2. LLM 추론이 Memory-Bandwidth Bound인 이유는?

토큰 1개 생성에 전체 모델 가중치를 메모리에서 읽어야 하지만, 실제 연산량(FLOP)은 적음. GPU 연산 능력 대비 메모리 대역폭이 병목. 70B FP16 = 140GB를 매 토큰마다 읽음.

Q3. Self-Speculative Decoding의 장단점은?

장점: 별도 Draft 모델 불필요, 메모리 절약. 단점: Target 모델의 일부 레이어만 사용하므로 수용률이 전용 Draft 모델보다 낮을 수 있음.

Q4. K값이 너무 크면 왜 비효율적인가?

수용률이 지수적으로 감소( $\alpha^K$ )하여, 후반 토큰이 거절될 확률이 높아짐. Draft 모델의 K번 Forward Pass 비용은 항상 발생하므로, 거절된 토큰에 대한 Draft 비용이 낭비.

Q5. 배치 크기가 클 때 Speculative Decoding 효과가 감소하는 이유는?

(1) 배치 내 수용률이 달라 동기화 오버헤드 (2) 배치가 크면 이미 compute-bound여서 GPU 활용률이 높음 (3) Draft 토큰 관리의 메모리 오버헤드 증가.

Q6. Medusa 방식이 기존 Speculative Decoding과 다른 점은?

별도 Draft 모델 대신 Multiple LM Head를 Target 모델에 추가하여, 한 번의 Forward Pass로 여러 위치의 토큰을 동시에 예측. 추가 모델 로딩 불필요.

Accelerating LLM Inference 2-3x with Speculative Decoding: From Theory to Production

1. The Fundamental Bottleneck of LLM Inference
- 1.1 Arithmetic Intensity Analysis
2. Speculative Decoding Principles
3. Advanced Techniques
4. Using Speculative Decoding in vLLM
5. Choosing the Optimal K Value
6. Practical Considerations
- 6.1 Draft Model Selection Criteria
- 6.2 Caveats in Batch Environments
7. Quiz
Quiz

1. The Fundamental Bottleneck of LLM Inference

Autoregressive decoding in LLMs is inherently serial:

Token 1 generation → Token 2 generation → Token 3 generation → ...
       ↓                    ↓                    ↓
   Full model           Full model           Full model
   Forward Pass         Forward Pass         Forward Pass

Each token generation requires a full forward pass through a 70B model, and this process is Memory-Bandwidth Bound. GPU compute capacity is underutilized while memory bandwidth becomes the bottleneck.

1.1 Arithmetic Intensity Analysis

70B model, FP16:
- Model size: ~140GB
- 1 token generation: 140GB memory read
- A100 80GB memory bandwidth: 2TB/s
- Theoretical maximum: 2000/140 ≈ 14 tokens/s

In practice ~10 tokens/s due to KV Cache access, etc.
→ GPU compute utilization: 1-2%

Key insight: Whether generating 1 token or K tokens, the cost of reading model weights is the same. Processing multiple tokens per read improves efficiency.

2. Speculative Decoding Principles

2.1 Core Idea

A small Draft model quickly proposes K tokens, and a large Target model verifies all K simultaneously in a single forward pass:

Draft Model (1B):  t1 → t2 → t3 → t4 → t5  (quickly propose 5)
                    ↓    ↓    ↓    ↓    ↓
Target Model (70B): ✅   ✅   ✅   ❌   -    (verify at once)
                                    ↓
                              t3' regeneration  (correct after rejection)

Result: [t1, t2, t3, t3'] → 4 tokens from 1 forward pass of 70B!

2.2 Mathematical Guarantee: Rejection Sampling

The core of Speculative Decoding is the mathematical guarantee that the output distribution is exactly identical to the Target model's distribution.

Given Draft model distribution $q(x)$ and Target model distribution $p(x)$ :

Acceptance probability:

\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)

Correction distribution on rejection:

p'(x) = \text{norm}\left(\max(0, p(x) - q(x))\right)

After this process, the final output distribution is exactly $p(x)$ .

import torch

def speculative_decode(draft_model, target_model, input_ids, K=5):
    """Core Speculative Decoding algorithm"""

    # 1) Generate K tokens with Draft model
    draft_tokens = []
    draft_probs = []
    current = input_ids.clone()

    for _ in range(K):
        logits = draft_model(current).logits[:, -1]
        probs = torch.softmax(logits, dim=-1)
        token = torch.multinomial(probs, 1)
        draft_tokens.append(token)
        draft_probs.append(probs.gather(-1, token))
        current = torch.cat([current, token], dim=-1)

    # 2) Verify all at once with Target model
    all_tokens = torch.cat([input_ids] + draft_tokens, dim=-1)
    target_logits = target_model(all_tokens).logits

    # 3) Rejection Sampling
    accepted = []
    n = input_ids.shape[-1]

    for i in range(K):
        target_prob = torch.softmax(target_logits[:, n+i-1], dim=-1)
        p_target = target_prob.gather(-1, draft_tokens[i])
        q_draft = draft_probs[i]

        # Acceptance probability
        accept_prob = torch.min(
            torch.ones_like(p_target),
            p_target / q_draft
        )

        if torch.rand(1) < accept_prob:
            accepted.append(draft_tokens[i])
        else:
            # Rejection: sample new token from correction distribution
            residual = torch.clamp(target_prob -
                torch.softmax(draft_model(all_tokens[:, :n+i]).logits[:, -1], dim=-1),
                min=0)
            residual = residual / residual.sum(dim=-1, keepdim=True)
            new_token = torch.multinomial(residual, 1)
            accepted.append(new_token)
            break
    else:
        # Bonus token when all accepted
        bonus = torch.multinomial(
            torch.softmax(target_logits[:, n+K-1], dim=-1), 1
        )
        accepted.append(bonus)

    return torch.cat(accepted, dim=-1)

2.3 Acceptance Rate and Speed Improvement

With acceptance rate $\alpha$ , the expected number of generated tokens:

E[\text{tokens per step}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}

Draft-Target Pair	Acceptance Rate α	K=5 Avg Tokens	Speedup
GPT-2 → GPT-4	0.4	1.6	1.3x
Llama-68M → Llama-70B	0.7	2.8	2.3x
Llama-1B → Llama-70B	0.8	3.6	2.8x

3. Advanced Techniques

3.1 Self-Speculative Decoding (Without a Draft Model)

Leverages the Target model's own Early Exit or Layer Skipping without a separate Draft model:

# Layer Skip approach
class SelfSpeculativeModel(nn.Module):
    def draft_forward(self, x):
        """Fast draft using only first 8 layers"""
        for layer in self.layers[:8]:
            x = layer(x)
        return self.lm_head(self.norm(x))

    def verify_forward(self, x):
        """Verification with all layers"""
        for layer in self.layers:
            x = layer(x)
        return self.lm_head(self.norm(x))

Advantages: No need to load a separate Draft model, saves memory

3.2 Medusa: Multi-Head Speculative Decoding

Instead of a Draft model, adds multiple LM Heads to simultaneously predict tokens at multiple positions:

          Target LM Head → t[n+1]
Input → Hidden States →
          Medusa Head 1 → t[n+2]  (prediction)
          Medusa Head 2 → t[n+3]  (prediction)
          Medusa Head 3 → t[n+4]  (prediction)

3.3 Apple Mirror Speculative Decoding (2026)

Apple's latest research (2026.01). Addresses the serial verification bottleneck of existing Speculative Decoding:

Mirror Model: A lightweight version of the Target model that performs Draft and Verify simultaneously
Existing: Draft → Verify → Draft → Verify (serial)
Mirror: Draft₁ + Verify₀ → Draft₂ + Verify₁ → ... (pipelined)

4. Using Speculative Decoding in vLLM

4.1 Configuration

from vllm import LLM, SamplingParams

# Specify Draft model
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
    num_speculative_tokens=5,
    tensor_parallel_size=4,
    gpu_memory_utilization=0.9,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain quantum computing:"], params)

4.2 Benchmark Script

# Standard decoding
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4

# Speculative Decoding
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5 \
    --tensor-parallel-size 4

4.3 Using with TensorRT-LLM

import tensorrt_llm
from tensorrt_llm import BuildConfig

# Build Draft and Target models simultaneously
build_config = BuildConfig(
    max_batch_size=8,
    max_input_len=2048,
    max_seq_len=4096,
    speculative_decoding_mode="draft_tokens_external",
    max_draft_len=5,
)

5. Choosing the Optimal K Value

import time

def find_optimal_k(draft_model, target_model, test_prompts, k_range=range(1, 11)):
    """Search for optimal number of speculative tokens"""
    results = {}

    for k in k_range:
        start = time.time()
        total_tokens = 0

        for prompt in test_prompts:
            output = speculative_generate(
                draft_model, target_model, prompt,
                num_speculative_tokens=k, max_tokens=256
            )
            total_tokens += len(output)

        elapsed = time.time() - start
        throughput = total_tokens / elapsed
        results[k] = throughput
        print(f"K={k}: {throughput:.1f} tokens/s")

    optimal_k = max(results, key=results.get)
    print(f"\nOptimal K = {optimal_k} ({results[optimal_k]:.1f} tokens/s)")
    return optimal_k

General guidelines:

Stronger Draft (higher acceptance rate): Use larger K (7-10)
Weaker Draft: Use smaller K (3-5)
Code generation: K=5-7 (high acceptance rate due to repetitive patterns)
Creative text: K=3-4 (lower acceptance rate due to high diversity)

6. Practical Considerations

6.1 Draft Model Selection Criteria

Same tokenizer: Different tokenizers cause token alignment issues
Same family: Llama-1B → Llama-70B (same training data, high acceptance rate)
Appropriate size ratio: 1/50 to 1/10 of Target (too large increases Draft overhead)
Fast inference: Low latency is critical for the Draft model

6.2 Caveats in Batch Environments

Speculative Decoding becomes less effective as batch size increases:

Synchronization issues due to varying acceptance rates across requests in a batch
Additional computation is burdensome in already compute-bound batches
Better suited for latency optimization than throughput optimization

7. Quiz

Q1. Why doesn't Speculative Decoding change the output distribution?

Thanks to Rejection Sampling. By sampling with acceptance probability $\min(1, p(x)/q(x))$ and resampling from the correction distribution $\max(0, p(x)-q(x))$ on rejection, the final distribution exactly equals the Target distribution $p(x)$ .

Q2. Why is LLM inference Memory-Bandwidth Bound?

Generating one token requires reading all model weights from memory, but the actual computation (FLOPs) is small. Memory bandwidth is the bottleneck relative to GPU compute capacity. 70B FP16 = 140GB must be read for every token.

Q3. What are the pros and cons of Self-Speculative Decoding?

Pros: No separate Draft model needed, saves memory. Cons: Since only a subset of the Target model's layers is used, acceptance rate may be lower than with a dedicated Draft model.

Q4. Why is a K value that's too large inefficient?

The acceptance rate decreases exponentially ( $\alpha^K$ ), so later tokens are increasingly likely to be rejected. Since the cost of K forward passes through the Draft model is always incurred, Draft compute is wasted on rejected tokens.

Q5. Why does Speculative Decoding become less effective with large batch sizes?

(1) Synchronization overhead due to varying acceptance rates within a batch (2) Large batches are already compute-bound so GPU utilization is already high (3) Increased memory overhead from managing draft tokens.

Q6. How does the Medusa approach differ from standard Speculative Decoding?

Instead of a separate Draft model, multiple LM Heads are added to the Target model to simultaneously predict tokens at multiple positions in a single forward pass. No additional model loading required.

Quiz

Q1: What is the main topic covered in "Accelerating LLM Inference 2-3x with Speculative Decoding: From Theory to Production"?

A deep dive into the mathematical foundations of Speculative Decoding, the Draft-Verify pipeline, acceptance probability analysis, practical implementation in vLLM/TensorRT-LLM, and Apple's Mirror Speculative Decoding.

Q2: What is The Fundamental Bottleneck of LLM Inference?

Autoregressive decoding in LLMs is inherently serial: Each token generation requires a full forward pass through a 70B model, and this process is Memory-Bandwidth Bound. GPU compute capacity is underutilized while memory bandwidth becomes the bottleneck.

Q3: Explain the core concept of Speculative Decoding Principles.

2.1 Core Idea A small Draft model quickly proposes K tokens, and a large Target model verifies all K simultaneously in a single forward pass: 2.2 Mathematical Guarantee: Rejection Sampling The core of Speculative Decoding is the mathematical guarantee that the output distribution...

Q4: What are the key aspects of Advanced Techniques?

3.1 Self-Speculative Decoding (Without a Draft Model) Leverages the Target model's own Early Exit or Layer Skipping without a separate Draft model: Advantages: No need to load a separate Draft model, saves memory 3.2 Medusa: Multi-Head Speculative Decoding Instead of a Draft mode...

Q5: How does Using Speculative Decoding in vLLM work?

4.1 Configuration 4.2 Benchmark Script 4.3 Using with TensorRT-LLM