- Authors
- Name
- 1. The Fundamental Bottleneck of LLM Inference
- 2. Speculative Decoding Principles
- 3. Advanced Techniques
- 4. Using Speculative Decoding in vLLM
- 5. Choosing the Optimal K Value
- 6. Practical Considerations
- 7. Quiz
1. The Fundamental Bottleneck of LLM Inference
Autoregressive decoding in LLMs is inherently serial:
Token 1 generation → Token 2 generation → Token 3 generation → ...
↓ ↓ ↓
Full model Full model Full model
Forward Pass Forward Pass Forward Pass
Each token generation requires a full forward pass through a 70B model, and this process is Memory-Bandwidth Bound. GPU compute capacity is underutilized while memory bandwidth becomes the bottleneck.
1.1 Arithmetic Intensity Analysis
70B model, FP16:
- Model size: ~140GB
- 1 token generation: 140GB memory read
- A100 80GB memory bandwidth: 2TB/s
- Theoretical maximum: 2000/140 ≈ 14 tokens/s
In practice ~10 tokens/s due to KV Cache access, etc.
→ GPU compute utilization: 1-2%
Key insight: Whether generating 1 token or K tokens, the cost of reading model weights is the same. Processing multiple tokens per read improves efficiency.
2. Speculative Decoding Principles
2.1 Core Idea
A small Draft model quickly proposes K tokens, and a large Target model verifies all K simultaneously in a single forward pass:
Draft Model (1B): t1 → t2 → t3 → t4 → t5 (quickly propose 5)
↓ ↓ ↓ ↓ ↓
Target Model (70B): ✅ ✅ ✅ ❌ - (verify at once)
↓
t3' regeneration (correct after rejection)
Result: [t1, t2, t3, t3'] → 4 tokens from 1 forward pass of 70B!
2.2 Mathematical Guarantee: Rejection Sampling
The core of Speculative Decoding is the mathematical guarantee that the output distribution is exactly identical to the Target model's distribution.
Given Draft model distribution and Target model distribution :
Acceptance probability:
Correction distribution on rejection:
After this process, the final output distribution is exactly .
import torch
def speculative_decode(draft_model, target_model, input_ids, K=5):
"""Core Speculative Decoding algorithm"""
# 1) Generate K tokens with Draft model
draft_tokens = []
draft_probs = []
current = input_ids.clone()
for _ in range(K):
logits = draft_model(current).logits[:, -1]
probs = torch.softmax(logits, dim=-1)
token = torch.multinomial(probs, 1)
draft_tokens.append(token)
draft_probs.append(probs.gather(-1, token))
current = torch.cat([current, token], dim=-1)
# 2) Verify all at once with Target model
all_tokens = torch.cat([input_ids] + draft_tokens, dim=-1)
target_logits = target_model(all_tokens).logits
# 3) Rejection Sampling
accepted = []
n = input_ids.shape[-1]
for i in range(K):
target_prob = torch.softmax(target_logits[:, n+i-1], dim=-1)
p_target = target_prob.gather(-1, draft_tokens[i])
q_draft = draft_probs[i]
# Acceptance probability
accept_prob = torch.min(
torch.ones_like(p_target),
p_target / q_draft
)
if torch.rand(1) < accept_prob:
accepted.append(draft_tokens[i])
else:
# Rejection: sample new token from correction distribution
residual = torch.clamp(target_prob -
torch.softmax(draft_model(all_tokens[:, :n+i]).logits[:, -1], dim=-1),
min=0)
residual = residual / residual.sum(dim=-1, keepdim=True)
new_token = torch.multinomial(residual, 1)
accepted.append(new_token)
break
else:
# Bonus token when all accepted
bonus = torch.multinomial(
torch.softmax(target_logits[:, n+K-1], dim=-1), 1
)
accepted.append(bonus)
return torch.cat(accepted, dim=-1)
2.3 Acceptance Rate and Speed Improvement
With acceptance rate , the expected number of generated tokens:
| Draft-Target Pair | Acceptance Rate α | K=5 Avg Tokens | Speedup |
|---|---|---|---|
| GPT-2 → GPT-4 | 0.4 | 1.6 | 1.3x |
| Llama-68M → Llama-70B | 0.7 | 2.8 | 2.3x |
| Llama-1B → Llama-70B | 0.8 | 3.6 | 2.8x |
3. Advanced Techniques
3.1 Self-Speculative Decoding (Without a Draft Model)
Leverages the Target model's own Early Exit or Layer Skipping without a separate Draft model:
# Layer Skip approach
class SelfSpeculativeModel(nn.Module):
def draft_forward(self, x):
"""Fast draft using only first 8 layers"""
for layer in self.layers[:8]:
x = layer(x)
return self.lm_head(self.norm(x))
def verify_forward(self, x):
"""Verification with all layers"""
for layer in self.layers:
x = layer(x)
return self.lm_head(self.norm(x))
Advantages: No need to load a separate Draft model, saves memory
3.2 Medusa: Multi-Head Speculative Decoding
Instead of a Draft model, adds multiple LM Heads to simultaneously predict tokens at multiple positions:
Target LM Head → t[n+1]
Input → Hidden States →
Medusa Head 1 → t[n+2] (prediction)
Medusa Head 2 → t[n+3] (prediction)
Medusa Head 3 → t[n+4] (prediction)
3.3 Apple Mirror Speculative Decoding (2026)
Apple's latest research (2026.01). Addresses the serial verification bottleneck of existing Speculative Decoding:
- Mirror Model: A lightweight version of the Target model that performs Draft and Verify simultaneously
- Existing: Draft → Verify → Draft → Verify (serial)
- Mirror: Draft₁ + Verify₀ → Draft₂ + Verify₁ → ... (pipelined)
4. Using Speculative Decoding in vLLM
4.1 Configuration
from vllm import LLM, SamplingParams
# Specify Draft model
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.2-1B-Instruct",
num_speculative_tokens=5,
tensor_parallel_size=4,
gpu_memory_utilization=0.9,
)
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain quantum computing:"], params)
4.2 Benchmark Script
# Standard decoding
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
# Speculative Decoding
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 4
4.3 Using with TensorRT-LLM
import tensorrt_llm
from tensorrt_llm import BuildConfig
# Build Draft and Target models simultaneously
build_config = BuildConfig(
max_batch_size=8,
max_input_len=2048,
max_seq_len=4096,
speculative_decoding_mode="draft_tokens_external",
max_draft_len=5,
)
5. Choosing the Optimal K Value
import time
def find_optimal_k(draft_model, target_model, test_prompts, k_range=range(1, 11)):
"""Search for optimal number of speculative tokens"""
results = {}
for k in k_range:
start = time.time()
total_tokens = 0
for prompt in test_prompts:
output = speculative_generate(
draft_model, target_model, prompt,
num_speculative_tokens=k, max_tokens=256
)
total_tokens += len(output)
elapsed = time.time() - start
throughput = total_tokens / elapsed
results[k] = throughput
print(f"K={k}: {throughput:.1f} tokens/s")
optimal_k = max(results, key=results.get)
print(f"\nOptimal K = {optimal_k} ({results[optimal_k]:.1f} tokens/s)")
return optimal_k
General guidelines:
- Stronger Draft (higher acceptance rate): Use larger K (7-10)
- Weaker Draft: Use smaller K (3-5)
- Code generation: K=5-7 (high acceptance rate due to repetitive patterns)
- Creative text: K=3-4 (lower acceptance rate due to high diversity)
6. Practical Considerations
6.1 Draft Model Selection Criteria
- Same tokenizer: Different tokenizers cause token alignment issues
- Same family: Llama-1B → Llama-70B (same training data, high acceptance rate)
- Appropriate size ratio: 1/50 to 1/10 of Target (too large increases Draft overhead)
- Fast inference: Low latency is critical for the Draft model
6.2 Caveats in Batch Environments
Speculative Decoding becomes less effective as batch size increases:
- Synchronization issues due to varying acceptance rates across requests in a batch
- Additional computation is burdensome in already compute-bound batches
- Better suited for latency optimization than throughput optimization
7. Quiz
Q1. Why doesn't Speculative Decoding change the output distribution?
Thanks to Rejection Sampling. By sampling with acceptance probability and resampling from the correction distribution on rejection, the final distribution exactly equals the Target distribution .
Q2. Why is LLM inference Memory-Bandwidth Bound?
Generating one token requires reading all model weights from memory, but the actual computation (FLOPs) is small. Memory bandwidth is the bottleneck relative to GPU compute capacity. 70B FP16 = 140GB must be read for every token.
Q3. What are the pros and cons of Self-Speculative Decoding?
Pros: No separate Draft model needed, saves memory. Cons: Since only a subset of the Target model's layers is used, acceptance rate may be lower than with a dedicated Draft model.
Q4. Why is a K value that's too large inefficient?
The acceptance rate decreases exponentially (), so later tokens are increasingly likely to be rejected. Since the cost of K forward passes through the Draft model is always incurred, Draft compute is wasted on rejected tokens.
Q5. Why does Speculative Decoding become less effective with large batch sizes?
(1) Synchronization overhead due to varying acceptance rates within a batch (2) Large batches are already compute-bound so GPU utilization is already high (3) Increased memory overhead from managing draft tokens.
Q6. How does the Medusa approach differ from standard Speculative Decoding?
Instead of a separate Draft model, multiple LM Heads are added to the Target model to simultaneously predict tokens at multiple positions in a single forward pass. No additional model loading required.