Complete LLM Quantization Comparison — GPTQ vs AWQ vs GGUF

1. What is Quantization?
- Types of Quantization
- Memory by Bit Width
2. GPTQ (GPU-Optimized Quantization)
- Running GPTQ Quantization
- GPTQ Characteristics
3. AWQ (Activation-aware Weight Quantization)
- Running AWQ Quantization
- Key Differences Between AWQ and GPTQ
4. GGUF (llama.cpp Format)
- GGUF Quantization Method Types
- Running GGUF Conversion
5. Using Quantized Models with vLLM
- llama.cpp Server
6. Benchmark Comparison
7. Practical Selection Guide
8. Quiz
Quiz

1. What is Quantization?

Quantization is a technique that represents model weights at lower precision to improve memory usage and inference speed. Reducing FP16 (16-bit) weights to INT4 (4-bit) saves approximately 4x memory.

Types of Quantization

PTQ (Post-Training Quantization): Quantization after training. GPTQ, AWQ, and GGUF all use PTQ
QAT (Quantization-Aware Training): Simulates quantization during training

Memory by Bit Width

Precision	7B Model Memory	70B Model Memory
FP32	28 GB	280 GB
FP16	14 GB	140 GB
INT8	7 GB	70 GB
INT4	3.5 GB	35 GB

2. GPTQ (GPU-Optimized Quantization)

GPTQ performs layer-wise quantization based on Optimal Brain Quantization (OBQ). It uses calibration data to minimize quantization error.

Running GPTQ Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer

# Load model
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

# Prepare calibration data
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_data = [tokenizer(text["text"]) for text in dataset.select(range(128))]

# GPTQ quantization
quantizer = GPTQQuantizer(
    bits=4,
    group_size=128,
    desc_act=True,       # Quantize in activation order
    damp_percent=0.01,
    dataset=calibration_data,
)
quantized_model = quantizer.quantize_model(model, tokenizer)

# Save
quantized_model.save_pretrained("./llama-3.1-8b-gptq-4bit")
tokenizer.save_pretrained("./llama-3.1-8b-gptq-4bit")

GPTQ Characteristics

GPU-optimized kernels (ExLlama, Marlin)
Requires calibration data (typically 128-256 samples)
Smaller group_size means higher accuracy but slower speed

3. AWQ (Activation-aware Weight Quantization)

AWQ is a method that analyzes activation distributions to protect important weight channels. Instead of quantizing all weights equally, it scales weights of channels with large activation magnitudes to preserve precision.

Running AWQ Quantization

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "./llama-3.1-8b-awq-4bit"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# AWQ quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"    # GEMM or GEMV
}

# Run quantization
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Key Differences Between AWQ and GPTQ

Item	GPTQ	AWQ
Approach	Layer-wise error minimization	Activation-based channel protection
Calibration	128+ samples required	Possible with fewer samples
Quantization Speed	Slow (several hours)	Fast (tens of minutes)
Inference Speed	Fast with ExLlama kernel	Fast with GEMM kernel
Accuracy	High	Similar or slightly better

4. GGUF (llama.cpp Format)

GGUF is a quantization format used by llama.cpp that supports hybrid CPU + GPU inference.

GGUF Quantization Method Types

Method	Bits	Description
Q2_K	2.5	Extreme compression, significant quality loss
Q4_K_M	4.8	Most popular balance point
Q5_K_M	5.5	High quality retention
Q6_K	6.6	Near FP16 quality
Q8_0	8.0	Nearly lossless
IQ4_XS	4.3	Importance matrix based

Running GGUF Conversion

# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j$(nproc)

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
  ../Llama-3.1-8B-Instruct \
  --outfile llama-3.1-8b-f16.gguf \
  --outtype f16

# Apply quantization
./llama-quantize \
  llama-3.1-8b-f16.gguf \
  llama-3.1-8b-Q4_K_M.gguf \
  Q4_K_M

# More accurate quantization with importance matrix (IQ series)
./llama-imatrix \
  -m llama-3.1-8b-f16.gguf \
  -f calibration.txt \
  -o imatrix.dat

./llama-quantize \
  --imatrix imatrix.dat \
  llama-3.1-8b-f16.gguf \
  llama-3.1-8b-IQ4_XS.gguf \
  IQ4_XS

5. Using Quantized Models with vLLM

# GPTQ model
vllm serve ./llama-3.1-8b-gptq-4bit \
  --quantization gptq \
  --max-model-len 8192

# AWQ model
vllm serve ./llama-3.1-8b-awq-4bit \
  --quantization awq \
  --max-model-len 8192

# GGUF model (vLLM 0.6+)
vllm serve ./llama-3.1-8b-Q4_K_M.gguf \
  --max-model-len 8192

llama.cpp Server

./llama-server \
  -m llama-3.1-8b-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \          # Number of GPU layers
  -c 8192 \          # Context length
  --n-predict 2048

6. Benchmark Comparison

Benchmark for 7B model (RTX 4090):

Method	VRAM	tok/s (generation)	Perplexity	Quantization Time
FP16	14 GB	95	5.12	-
GPTQ-4bit	4.2 GB	145	5.28	2-4 hours
AWQ-4bit	4.0 GB	150	5.25	30 min
GGUF Q4_K_M	4.5 GB	130	5.30	5 min
GGUF Q5_K_M	5.3 GB	120	5.18	5 min

7. Practical Selection Guide

GPU-only serving (vLLM) → AWQ (fast quantization + good performance)
GPU-only serving (accuracy-focused) → GPTQ (desc_act=True)
CPU + GPU hybrid → GGUF Q4_K_M
MacBook / CPU-only → GGUF Q4_K_M or Q5_K_M
Extreme compression → GGUF IQ4_XS (with imatrix)

8. Quiz

Q1: Why is AWQ faster to quantize than GPTQ?

GPTQ computes the Hessian inverse for each layer and quantizes weights sequentially, which is computationally expensive. AWQ only analyzes activation distributions to identify important channels and applies scaling, enabling fast quantization without complex optimization processes.

Q2: What do the K and M in GGUF Q4_K_M stand for?

K: Refers to the K-quant method. It is a mixed quantization approach that applies different bit widths depending on the importance of each layer. M: Stands for Medium quality. There are S (Small), M (Medium), and L (Large) variants, where L allocates higher bits to more layers.

Q3: Should you choose GPTQ or AWQ for vLLM?

In most cases, AWQ is recommended:

Quantization speed is much faster (30 minutes vs several hours) Inference performance is similar or slightly better The perplexity difference is negligible

However, if maximum accuracy is required, GPTQ (desc_act=True) may have a slight edge.

Quiz

Q1: What is the main topic covered in "Complete LLM Quantization Comparison — GPTQ vs AWQ vs GGUF"?

A comprehensive guide to LLM Quantization — from quantization fundamentals to comparing GPTQ, AWQ, and GGUF methods, vLLM/llama.cpp integration, and practical benchmarks.

Q2: What is Quantization??

Quantization is a technique that represents model weights at lower precision to improve memory usage and inference speed. Reducing FP16 (16-bit) weights to INT4 (4-bit) saves approximately 4x memory.

Q3: Explain the core concept of GPTQ (GPU-Optimized Quantization).

GPTQ performs layer-wise quantization based on Optimal Brain Quantization (OBQ). It uses calibration data to minimize quantization error.

Q4: What are the key aspects of AWQ (Activation-aware Weight Quantization)?

AWQ is a method that analyzes activation distributions to protect important weight channels. Instead of quantizing all weights equally, it scales weights of channels with large activation magnitudes to preserve precision.

Q5: How does GGUF (llama.cpp Format) work?

GGUF is a quantization format used by llama.cpp that supports hybrid CPU + GPU inference. GGUF Quantization Method Types Running GGUF Conversion