- Published on
Deep Learning Model Quantization Complete Guide: Master INT8, INT4, GPTQ, AWQ, GGUF
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction
As deep learning models grow increasingly large, inference costs and memory requirements have skyrocketed. GPT-3 has 175B parameters, Llama 3 has 70B, and storing them in full FP32 precision requires 700GB and 280GB of memory respectively — completely infeasible for typical GPUs.
Model Quantization is the key technology to solve this problem. By compressing 32-bit floating-point (FP32) weights to 8-bit or 4-bit integers, it reduces memory by 4–8x and improves inference speed by 2–4x. Remarkably, quality loss is minimal.
In this guide, we thoroughly explore quantization from mathematical foundations to the latest techniques: GPTQ, AWQ, GGUF, and bitsandbytes.
1. Quantization Fundamentals: Understanding Number Representations
1.1 Floating-Point Formats
Understanding the floating-point formats used in modern deep learning is the starting point for quantization.
FP32 (Float32)
- Sign (1 bit) + Exponent (8 bits) + Mantissa (23 bits) = 32 bits total
- Range: approximately -3.4e38 to 3.4e38
- Precision: ~7 decimal digits
FP16 (Float16)
- Sign (1 bit) + Exponent (5 bits) + Mantissa (10 bits) = 16 bits total
- Range: -65504 to 65504 (much narrower than FP32)
- Precision: ~3 decimal digits
- Risk of overflow during training; requires gradient scaling
BF16 (Brain Float16)
- Sign (1 bit) + Exponent (8 bits) + Mantissa (7 bits) = 16 bits total
- Maintains the same exponent range as FP32 while reducing mantissa bits
- No overflow risk, safer for deep learning training
- Developed by Google Brain, natively supported on modern GPUs (A100, H100)
import torch
import numpy as np
# Check memory size of each data type
x_fp32 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float32)
x_fp16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float16)
x_bf16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.bfloat16)
print(f"FP32: {x_fp32.element_size()} bytes per element") # 4 bytes
print(f"FP16: {x_fp16.element_size()} bytes per element") # 2 bytes
print(f"BF16: {x_bf16.element_size()} bytes per element") # 2 bytes
# Memory calculation for a 7B parameter model
params = 7e9
fp32_memory_gb = params * 4 / 1e9
fp16_memory_gb = params * 2 / 1e9
int8_memory_gb = params * 1 / 1e9
int4_memory_gb = params * 0.5 / 1e9
print(f"\n7B Model Memory Requirements:")
print(f"FP32: {fp32_memory_gb:.1f} GB") # 28.0 GB
print(f"FP16: {fp16_memory_gb:.1f} GB") # 14.0 GB
print(f"INT8: {int8_memory_gb:.1f} GB") # 7.0 GB
print(f"INT4: {int4_memory_gb:.1f} GB") # 3.5 GB
1.2 Integer Representations
The core of quantization is mapping floating-point values to integers.
INT8: -128 to 127 (signed) or 0 to 255 (unsigned) INT4: -8 to 7 (signed) or 0 to 15 (unsigned) INT2: -2 to 1 (signed) or 0 to 3 (unsigned)
1.3 Quantization Formula
The fundamental formula to convert a floating-point value x to integer q:
q = clamp(round(x / scale) + zero_point, q_min, q_max)
Dequantization:
x_approx = scale * (q - zero_point)
Where:
- scale: quantization scale factor (scale = (max_val - min_val) / (q_max - q_min))
- zero_point: the offset representing which real value integer 0 corresponds to
- q_min, q_max: integer range bounds (-128, 127 for INT8)
import torch
import numpy as np
def symmetric_quantize(x: torch.Tensor, num_bits: int = 8):
"""Symmetric quantization implementation"""
q_max = 2 ** (num_bits - 1) - 1 # 127 for INT8
q_min = -q_max # -127
# Compute scale
max_abs = x.abs().max()
scale = max_abs / q_max
# Quantize
q = torch.clamp(torch.round(x / scale), q_min, q_max).to(torch.int8)
return q, scale
def asymmetric_quantize(x: torch.Tensor, num_bits: int = 8):
"""Asymmetric quantization implementation"""
q_max = 2 ** num_bits - 1 # 255 for UINT8
q_min = 0
# Compute scale and zero_point
min_val = x.min()
max_val = x.max()
scale = (max_val - min_val) / (q_max - q_min)
zero_point = q_min - torch.round(min_val / scale)
zero_point = torch.clamp(zero_point, q_min, q_max).to(torch.int32)
# Quantize
q = torch.clamp(torch.round(x / scale) + zero_point, q_min, q_max).to(torch.uint8)
return q, scale, zero_point
def dequantize(q: torch.Tensor, scale: torch.Tensor, zero_point: torch.Tensor = None):
"""Dequantization"""
if zero_point is None:
return scale * q.float()
return scale * (q.float() - zero_point.float())
# Test
x = torch.randn(100)
print(f"Original data range: [{x.min():.4f}, {x.max():.4f}]")
# Symmetric quantization
q_sym, scale_sym = symmetric_quantize(x)
x_reconstructed_sym = dequantize(q_sym, scale_sym)
error_sym = (x - x_reconstructed_sym).abs().mean()
print(f"Symmetric quantization mean error: {error_sym:.6f}")
# Asymmetric quantization
q_asym, scale_asym, zp_asym = asymmetric_quantize(x)
x_reconstructed_asym = dequantize(q_asym, scale_asym, zp_asym)
error_asym = (x - x_reconstructed_asym).abs().mean()
print(f"Asymmetric quantization mean error: {error_asym:.6f}")
1.4 Symmetric vs Asymmetric Quantization
Symmetric Quantization
- zero_point = 0
- Symmetric positive/negative range
- Suitable for weights (mostly zero-centered distribution)
- Simpler computation: x_approx = scale * q
Asymmetric Quantization
- zero_point != 0
- Can represent arbitrary ranges
- Suitable for activations (always non-negative after ReLU)
- More complex computation: x_approx = scale * (q - zero_point)
1.5 Quantization Granularity
Determines how many parameters share a single scale/zero_point.
Per-Tensor: One scale for the entire tensor
- Minimal memory overhead
- Largest precision loss
Per-Channel (Per-Row/Column): Individual scale per channel
- Separate scale for each row/column of the weight matrix
- Effectively handles distribution differences across channels
Per-Group (Per-Block): Individual scale per fixed-size group
- Typical group_size = 128
- Compromise between per-channel and per-tensor
- Commonly used in GPTQ and AWQ
import torch
def per_group_quantize(weight: torch.Tensor, group_size: int = 128, num_bits: int = 4):
"""Per-Group quantization implementation"""
rows, cols = weight.shape
# Split into groups
weight_grouped = weight.reshape(-1, group_size)
# Max/min per group
max_vals = weight_grouped.max(dim=1, keepdim=True)[0]
min_vals = weight_grouped.min(dim=1, keepdim=True)[0]
q_max = 2 ** num_bits - 1 # 15 for INT4
# Compute scales
scales = (max_vals - min_vals) / q_max
zero_points = torch.round(-min_vals / scales)
# Quantize
q = torch.clamp(torch.round(weight_grouped / scales) + zero_points, 0, q_max)
# Dequantize
weight_dequant = scales * (q - zero_points)
weight_dequant = weight_dequant.reshape(rows, cols)
return q, scales, zero_points, weight_dequant
# Example: Transformer weight quantization
weight = torch.randn(4096, 4096) # Llama-style weight
q, scales, zp, weight_dequant = per_group_quantize(weight, group_size=128, num_bits=4)
error = (weight - weight_dequant).abs().mean()
print(f"Per-Group INT4 quantization mean error: {error:.6f}")
print(f"Compression ratio: {weight.element_size() * weight.numel() / (q.numel() / 2 + scales.numel() * 4):.2f}x")
2. Post-Training Quantization (PTQ)
PTQ quantizes an already-trained model without retraining — the most practical approach and most widely used.
2.1 Calibration Dataset
PTQ uses a small calibration dataset to determine appropriate scale and zero_point values.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset
def collect_calibration_data(model_name: str, num_samples: int = 128):
"""Collect calibration data"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
# WikiText-2 or C4 dataset is commonly used
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
texts = []
for item in dataset:
if len(item['text'].strip()) > 100:
texts.append(item['text'].strip())
if len(texts) >= num_samples:
break
# Tokenize
encoded = [
tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
for text in texts
]
return encoded
def collect_activation_stats(model, calibration_data, layer_name: str):
"""Collect activation statistics for a specific layer"""
stats = {"min": float("inf"), "max": float("-inf")}
def hook_fn(module, input, output):
with torch.no_grad():
act = output.detach().float()
stats["min"] = min(stats["min"], act.min().item())
stats["max"] = max(stats["max"], act.max().item())
# Register hook
target_layer = dict(model.named_modules())[layer_name]
handle = target_layer.register_forward_hook(hook_fn)
# Run calibration data
model.eval()
with torch.no_grad():
for batch in calibration_data[:32]:
model(**batch)
handle.remove()
return stats
2.2 Min-Max Calibration
The simplest method: uses the global minimum and maximum values from calibration data.
class MinMaxCalibrator:
"""Min-Max calibrator"""
def __init__(self):
self.min_val = float("inf")
self.max_val = float("-inf")
def update(self, tensor: torch.Tensor):
self.min_val = min(self.min_val, tensor.min().item())
self.max_val = max(self.max_val, tensor.max().item())
def compute_scale_zp(self, num_bits: int = 8, symmetric: bool = True):
q_max = 2 ** (num_bits - 1) - 1 if symmetric else 2 ** num_bits - 1
if symmetric:
max_abs = max(abs(self.min_val), abs(self.max_val))
scale = max_abs / q_max
zero_point = 0
else:
scale = (self.max_val - self.min_val) / q_max
zero_point = -round(self.min_val / scale)
return scale, zero_point
2.3 Histogram Calibration
To reduce the impact of outliers, finds the optimal range based on the distribution histogram.
import numpy as np
from scipy import stats
class HistogramCalibrator:
"""Histogram-based calibrator (minimizes KL Divergence)"""
def __init__(self, num_bins: int = 2048):
self.num_bins = num_bins
self.histogram = None
self.bin_edges = None
def update(self, tensor: torch.Tensor):
data = tensor.detach().float().numpy().flatten()
if self.histogram is None:
self.histogram, self.bin_edges = np.histogram(data, bins=self.num_bins)
else:
new_hist, _ = np.histogram(data, bins=self.bin_edges)
self.histogram += new_hist
def compute_optimal_range(self, num_bits: int = 8):
"""Search for optimal range minimizing KL Divergence"""
num_quantized_bins = 2 ** num_bits - 1
best_kl = float("inf")
best_threshold = None
for i in range(num_quantized_bins, len(self.histogram)):
reference = self.histogram[:i].copy().astype(float)
reference /= reference.sum()
quantized = np.zeros(i)
bin_size = i / num_quantized_bins
for j in range(num_quantized_bins):
start = int(j * bin_size)
end = int((j + 1) * bin_size)
quantized[start:end] = reference[start:end].sum() / (end - start)
quantized = np.where(quantized == 0, 1e-10, quantized)
reference_clipped = np.where(reference == 0, 1e-10, reference)
kl = stats.entropy(reference_clipped, quantized)
if kl < best_kl:
best_kl = kl
best_threshold = self.bin_edges[i]
return -best_threshold, best_threshold
2.4 Impact on Perplexity
The most common metric for measuring quantization quality is Perplexity (PPL).
import torch
import math
from transformers import AutoModelForCausalLM, AutoTokenizer
def compute_perplexity(model, tokenizer, text: str, device: str = "cuda"):
"""Compute perplexity"""
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.to(device)
max_length = 1024
stride = 512
nlls = []
prev_end_loc = 0
for begin_loc in range(0, input_ids.size(1), stride):
end_loc = min(begin_loc + max_length, input_ids.size(1))
trg_len = end_loc - prev_end_loc
input_ids_chunk = input_ids[:, begin_loc:end_loc]
target_ids = input_ids_chunk.clone()
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids_chunk, labels=target_ids)
neg_log_likelihood = outputs.loss
nlls.append(neg_log_likelihood)
prev_end_loc = end_loc
if end_loc == input_ids.size(1):
break
ppl = torch.exp(torch.stack(nlls).mean())
return ppl.item()
# Example PPL comparison
# FP16: PPL ~5.68
# INT8: PPL ~5.71 (~0.5% increase)
# INT4 (GPTQ): PPL ~5.89 (~3.7% increase)
# INT4 (naive): PPL ~6.52 (~14.8% increase)
3. Quantization-Aware Training (QAT)
QAT simulates quantization during training so the model adapts to quantization noise.
3.1 Fake Quantization
Simulates quantization effects in FP32 instead of actual INT8 operations.
import torch
import torch.nn as nn
import torch.nn.functional as F
class FakeQuantize(nn.Module):
"""Fake quantization module"""
def __init__(self, num_bits: int = 8, symmetric: bool = True):
super().__init__()
self.num_bits = num_bits
self.symmetric = symmetric
self.register_buffer('scale', torch.tensor(1.0))
self.register_buffer('zero_point', torch.tensor(0))
self.register_buffer('fake_quant_enabled', torch.tensor(1))
if symmetric:
self.q_min = -(2 ** (num_bits - 1))
self.q_max = 2 ** (num_bits - 1) - 1
else:
self.q_min = 0
self.q_max = 2 ** num_bits - 1
def forward(self, x: torch.Tensor) -> torch.Tensor:
if self.fake_quant_enabled[0] == 0:
return x
# Update scale with exponential moving average during training
if self.training:
with torch.no_grad():
if self.symmetric:
max_abs = x.abs().max()
new_scale = max_abs / self.q_max
else:
new_scale = (x.max() - x.min()) / (self.q_max - self.q_min)
self.scale.copy_(0.9 * self.scale + 0.1 * new_scale)
# Fake quantize: quantize then dequantize
x_scaled = x / self.scale
x_clipped = torch.clamp(x_scaled, self.q_min, self.q_max)
x_rounded = torch.round(x_clipped)
x_dequant = x_rounded * self.scale
return x_dequant
3.2 STE (Straight-Through Estimator)
class STERound(torch.autograd.Function):
"""Straight-Through Estimator for round()"""
@staticmethod
def forward(ctx, x):
return torch.round(x)
@staticmethod
def backward(ctx, grad_output):
# Pass gradient through round() unchanged (identity approximation)
return grad_output
class STEClamp(torch.autograd.Function):
"""Straight-Through Estimator for clamp()"""
@staticmethod
def forward(ctx, x, min_val, max_val):
ctx.save_for_backward(x)
ctx.min_val = min_val
ctx.max_val = max_val
return torch.clamp(x, min_val, max_val)
@staticmethod
def backward(ctx, grad_output):
x, = ctx.saved_tensors
# Pass gradient only within clamp range
grad = grad_output * ((x >= ctx.min_val) & (x <= ctx.max_val)).float()
return grad, None, None
class QATLinear(nn.Module):
"""Linear layer with QAT applied"""
def __init__(self, in_features, out_features, num_bits=8):
super().__init__()
self.linear = nn.Linear(in_features, out_features)
self.weight_fake_quant = FakeQuantize(num_bits=num_bits)
self.act_fake_quant = FakeQuantize(num_bits=num_bits, symmetric=False)
def forward(self, x):
# Activation quantization
x_q = self.act_fake_quant(x)
# Weight quantization
w_q = self.weight_fake_quant(self.linear.weight)
# FP32 compute (INT8 in actual deployment)
return F.linear(x_q, w_q, self.linear.bias)
3.3 When is QAT Needed?
- When PTQ quality loss is too high: Especially effective for small models (BERT-small, etc.)
- Quantizing to INT4 or lower: Essential for extreme compression
- Precision-sensitive tasks: Object detection, ASR, etc.
# QAT training workflow
import torch.optim as optim
from torch.quantization import prepare_qat, convert
def train_qat_model(model, train_loader, num_epochs=10):
"""QAT training example"""
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = prepare_qat(model.train())
optimizer = optim.Adam(model_prepared.parameters(), lr=1e-5)
for epoch in range(num_epochs):
for batch in train_loader:
inputs, labels = batch
outputs = model_prepared(inputs)
loss = F.cross_entropy(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Convert to INT8 model
model_prepared.eval()
model_quantized = convert(model_prepared)
return model_quantized
4. PyTorch Quantization API
4.1 torch.ao.quantization
PyTorch's official quantization API.
import torch
from torch.ao.quantization import (
get_default_qconfig,
get_default_qat_qconfig,
prepare,
prepare_qat,
convert
)
# Static quantization (PTQ)
def static_quantization_example():
"""Static quantization example"""
model = MyModel()
model.eval()
# Backend config (fbgemm: x86, qnnpack: ARM)
model.qconfig = get_default_qconfig('fbgemm')
# Prepare for calibration
model_prepared = prepare(model)
# Collect statistics from calibration data
with torch.no_grad():
for data in calibration_loader:
model_prepared(data)
# Convert to INT8 model
model_quantized = convert(model_prepared)
return model_quantized
# Dynamic quantization (effective for LSTM, Linear)
def dynamic_quantization_example():
"""Dynamic quantization example"""
model = MyModel()
model_quantized = torch.quantization.quantize_dynamic(
model,
{nn.Linear, nn.LSTM}, # Layer types to quantize
dtype=torch.qint8
)
return model_quantized
4.2 FX Graph Mode Quantization
A more flexible and powerful quantization approach.
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping
def fx_quantization_example(model, calibration_data):
"""FX Graph Mode quantization"""
model.eval()
qconfig_mapping = QConfigMapping().set_global(
get_default_qconfig('fbgemm')
)
example_inputs = (torch.randn(1, 3, 224, 224),)
model_prepared = prepare_fx(
model,
qconfig_mapping,
example_inputs
)
with torch.no_grad():
for batch in calibration_data:
model_prepared(batch)
model_quantized = convert_fx(model_prepared)
return model_quantized
5. GPTQ: Accurate Post-Training Quantization
GPTQ, published in 2022, is an LLM-specific quantization algorithm that minimizes quality loss even at INT4. (arXiv:2209.05433)
5.1 GPTQ Algorithm Principles
GPTQ is based on OBQ (Optimal Brain Quantization). The core idea: quantize weights layer by layer sequentially, then compensate for quantization errors in the already-quantized weights by updating the remaining weights.
OBQ error minimization objective:
argmin_Q ||WX - QX||_F^2
Where W is the original weight, Q is the quantized weight, and X is the input activation.
Hessian-based weight update:
After quantizing each weight, the resulting error is propagated to the remaining weights using the inverse Hessian H^(-1).
import torch
import math
def gptq_quantize_weight(weight: torch.Tensor,
hessian: torch.Tensor,
num_bits: int = 4,
group_size: int = 128,
damp_percent: float = 0.01):
"""
Quantize weights using the GPTQ algorithm
Args:
weight: [out_features, in_features] weight matrix
hessian: [in_features, in_features] Hessian matrix (H = 2 * X @ X.T)
num_bits: quantization bit count
group_size: group size
damp_percent: damping ratio for Hessian stabilization
"""
W = weight.clone().float()
n_rows, n_cols = W.shape
# Hessian damping (numerical stability)
H = hessian.clone().float()
dead_cols = torch.diag(H) == 0
H[dead_cols, dead_cols] = 1
W[:, dead_cols] = 0
damp = damp_percent * H.diag().mean()
H.diagonal().add_(damp)
# Inverse Hessian via Cholesky decomposition
H_inv = torch.linalg.cholesky(H)
H_inv = torch.cholesky_inverse(H_inv)
H_inv = torch.linalg.cholesky(H_inv, upper=True)
Q = torch.zeros_like(W)
Losses = torch.zeros_like(W)
q_max = 2 ** (num_bits - 1) - 1
for col_idx in range(n_cols):
w_col = W[:, col_idx]
h_inv_diag = H_inv[col_idx, col_idx]
# Compute per-group scale
if group_size != -1 and col_idx % group_size == 0:
group_end = min(col_idx + group_size, n_cols)
w_group = W[:, col_idx:group_end]
max_abs = w_group.abs().max(dim=1)[0].unsqueeze(1)
scale = max_abs / q_max
scale = torch.clamp(scale, min=1e-8)
# Quantize
q_col = torch.clamp(torch.round(w_col / scale.squeeze()), -q_max, q_max)
q_col = q_col * scale.squeeze()
Q[:, col_idx] = q_col
# Quantization error
err = (w_col - q_col) / h_inv_diag
Losses[:, col_idx] = err ** 2 / 2
# Propagate error to remaining weights (the key step!)
W[:, col_idx + 1:] -= err.unsqueeze(1) * H_inv[col_idx, col_idx + 1:].unsqueeze(0)
return Q, Losses
5.2 Using AutoGPTQ
Practical GPTQ quantization uses the AutoGPTQ library.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch
def quantize_with_gptq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128
):
"""Quantize model with AutoGPTQ"""
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
quantize_config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
damp_percent=0.01,
desc_act=False,
sym=True,
true_sequential=True
)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config
)
# Prepare calibration data
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_data = []
for text in dataset["text"][:128]:
if len(text.strip()) > 50:
encoded = tokenizer(
text.strip(),
return_tensors="pt",
max_length=2048,
truncation=True
)
calibration_data.append(encoded["input_ids"].squeeze())
print(f"Starting GPTQ {bits}bit quantization...")
model.quantize(calibration_data)
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"Quantization complete: {output_dir}")
return model, tokenizer
def load_gptq_model(model_dir: str, device: str = "cuda"):
"""Load GPTQ quantized model"""
model = AutoGPTQForCausalLM.from_quantized(
model_dir,
device=device,
use_triton=False,
disable_exllama=False,
inject_fused_attention=True,
inject_fused_mlp=True
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
return model, tokenizer
6. AWQ: Activation-aware Weight Quantization
AWQ, published in 2023, analyzes activation distributions to protect important weight channels. (arXiv:2306.00978)
6.1 Differences from GPTQ
| Feature | GPTQ | AWQ |
|---|---|---|
| Approach | Hessian-based error compensation | Activation-based scaling |
| Calibration data | Required (128+ samples) | Required (32+ samples) |
| Speed | Slow (1–4 hours) | Fast (tens of minutes) |
| Quality | Excellent | Excellent (comparable or better) |
| Key feature | Per-channel optimization | Activation outlier handling |
6.2 Using AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
def quantize_with_awq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128
):
"""Quantize model with AutoAWQ"""
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoAWQForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
use_cache=False
)
quant_config = {
"zero_point": True,
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM"
}
print(f"Starting AWQ {bits}bit quantization...")
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ quantization complete: {output_dir}")
return model
def load_awq_model(model_dir: str, device: str = "cuda"):
"""Load AWQ quantized model"""
model = AutoAWQForCausalLM.from_quantized(
model_dir,
fuse_layers=True,
trust_remote_code=True,
safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
return model, tokenizer
7. GGUF/GGML: The llama.cpp Ecosystem
GGUF (GPT-Generated Unified Format) is the model format for the llama.cpp project, enabling efficient LLM execution even on CPU.
7.1 Understanding GGUF
GGUF was introduced in 2023 to replace GGML. It stores model metadata, hyperparameters, and tokenizer information in a single file.
7.2 Quantization Levels Comparison
| Format | Bits | Memory (7B) | PPL Increase | Recommended Use |
|---|---|---|---|---|
| Q2_K | 2.6 | 2.8 GB | High | Extreme compression |
| Q3_K_S | 3.0 | 3.3 GB | Medium | Memory saving |
| Q4_0 | 4.0 | 3.8 GB | Low | Balanced |
| Q4_K_M | 4.1 | 4.1 GB | Very low | General recommendation |
| Q5_0 | 5.0 | 4.7 GB | Minimal | High quality |
| Q5_K_M | 5.1 | 4.8 GB | Minimal | High quality recommended |
| Q6_K | 6.0 | 5.5 GB | Nearly none | Near FP16 |
| Q8_0 | 8.0 | 7.2 GB | None | Reference use |
| F16 | 16.0 | 13.5 GB | None | Baseline |
K-quants (Q4_K_M, Q5_K_M, etc.) keep some layers at higher precision to improve quality.
7.3 Building and Using llama.cpp
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
# CPU-only build
cmake -B build
cmake --build build --config Release -j $(nproc)
# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
--model meta-llama/Llama-2-7b-hf \
--outfile llama2-7b-f16.gguf \
--outtype f16
# Quantize to Q4_K_M
./build/bin/llama-quantize \
llama2-7b-f16.gguf \
llama2-7b-q4_k_m.gguf \
Q4_K_M
# Run inference
./build/bin/llama-cli \
-m llama2-7b-q4_k_m.gguf \
-p "The future of AI is" \
-n 100 \
--ctx-size 4096 \
--threads 8 \
--n-gpu-layers 35
7.4 Python Bindings (llama-cpp-python)
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="./llama2-7b-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
n_threads=8,
verbose=False
)
# Text generation
output = llm(
"Once upon a time",
max_tokens=200,
temperature=0.7,
top_p=0.9,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
# Chat completion format
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
max_tokens=500,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
# Streaming output
for chunk in llm.create_chat_completion(
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True
):
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
print(delta["content"], end="", flush=True)
8. bitsandbytes: LLM Quantization Library
bitsandbytes, developed by Tim Dettmers, integrates seamlessly with HuggingFace transformers.
8.1 LLM.int8() — 8-bit Mixed Precision
LLM.int8() handles activation outliers in FP16 during matrix multiplication while using INT8 for the rest.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load INT8 model
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto"
)
def print_model_size(model, label):
"""Print model memory usage"""
total_params = sum(p.numel() for p in model.parameters())
total_bytes = sum(
p.numel() * p.element_size() for p in model.parameters()
)
print(f"{label}: {total_params/1e9:.2f}B params, {total_bytes/1e9:.2f} GB")
print_model_size(model_8bit, "INT8 model")
# INT8 model: 6.74B params, ~7.0 GB
8.2 4-bit Quantization (Used in QLoRA)
import bitsandbytes as bnb
from transformers import BitsAndBytesConfig
# NF4 quantization config (QLoRA)
bnb_config_nf4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Double quantization
)
# FP4 quantization config
bnb_config_fp4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="fp4",
bnb_4bit_compute_dtype=torch.float16,
)
# Load model
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config_nf4,
device_map="auto"
)
print_model_size(model_4bit, "NF4 model")
# NF4 model: 6.74B params, ~4.0 GB (with double quantization)
# QLoRA fine-tuning setup
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model_4bit = prepare_model_for_kbit_training(model_4bit)
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model_lora = get_peft_model(model_4bit, lora_config)
model_lora.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,607,232 || trainable%: 0.1197
8.3 NF4 vs FP4
NF4 (Normal Float 4)
- Non-linear 4-bit quantization assuming a normal distribution
- Leverages the observation that weight distributions are approximately normal
- Better representational power at the same bit count
FP4 (Float 4)
- Floating-point based 4-bit
- Can represent wider ranges
9. SmoothQuant: W8A8 Quantization
SmoothQuant quantizes both weights (W) and activations (A) to INT8 for faster inference.
9.1 The Activation Outlier Problem
LLM activations exhibit very large values (outliers) in specific channels, making W8A8 quantization challenging.
9.2 Migration Scaling
SmoothQuant's key insight: transfer the difficulty from activations to weights.
Y = (X * diag(s)^(-1)) * (diag(s) * W)
= X_smooth * W_smooth
def smooth_quantize(
model,
calibration_samples,
alpha: float = 0.5
):
"""
Apply SmoothQuant
Args:
alpha: migration strength (0=weights only, 1=activations only)
Recommended: 0.5 (equal distribution)
"""
act_scales = {}
def collect_scales(name):
def hook(module, input, output):
inp = input[0].detach()
if inp.dim() == 3:
inp = inp.reshape(-1, inp.size(-1))
channel_max = inp.abs().max(dim=0)[0]
if name not in act_scales:
act_scales[name] = channel_max
else:
act_scales[name] = torch.maximum(act_scales[name], channel_max)
return hook
handles = []
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
handles.append(module.register_forward_hook(collect_scales(name)))
with torch.no_grad():
for sample in calibration_samples:
model(**sample)
for h in handles:
h.remove()
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear) and name in act_scales:
act_scale = act_scales[name]
weight_scale = module.weight.abs().max(dim=0)[0]
# Compute migration scale
smooth_scale = (act_scale ** alpha) / (weight_scale ** (1 - alpha))
smooth_scale = torch.clamp(smooth_scale, min=1e-5)
# Apply scale to weights
module.weight.data = module.weight.data / smooth_scale.unsqueeze(0)
return model, act_scales
10. SpQR: Sparse Quantization Representation
SpQR stores important weights (outliers) in FP16 separately and quantizes the remainder to low precision.
import torch
def spqr_quantize(weight: torch.Tensor,
num_bits: int = 3,
outlier_threshold_percentile: float = 1.0):
"""
SpQR quantization (simplified version)
Core: store top p% outliers as FP16, quantize rest to low bits
"""
threshold = torch.quantile(weight.abs(), 1 - outlier_threshold_percentile / 100)
outlier_mask = weight.abs() > threshold
# Store outliers (FP16)
outlier_values = weight.clone()
outlier_values[~outlier_mask] = 0
# Quantize remainder
regular_weight = weight.clone()
regular_weight[outlier_mask] = 0
q_max = 2 ** (num_bits - 1) - 1
group_size = 16
rows, cols = regular_weight.shape
regular_grouped = regular_weight.reshape(-1, group_size)
max_abs = regular_grouped.abs().max(dim=1, keepdim=True)[0]
scales = max_abs / q_max
scales = torch.clamp(scales, min=1e-8)
q = torch.clamp(torch.round(regular_grouped / scales), -q_max, q_max).to(torch.int8)
regular_dequant = (scales * q.float()).reshape(rows, cols)
reconstructed = regular_dequant + outlier_values
error = (weight - reconstructed).abs().mean().item()
outlier_memory = outlier_mask.sum().item() * 2
regular_memory = (~outlier_mask).sum().item() * (num_bits / 8)
total_memory = outlier_memory + regular_memory
original_memory = weight.numel() * weight.element_size()
compression_ratio = original_memory / total_memory
print(f"Outlier ratio: {outlier_mask.float().mean():.2%}")
print(f"Mean reconstruction error: {error:.6f}")
print(f"Compression ratio: {compression_ratio:.2f}x")
return q, scales, outlier_values, outlier_mask
11. Quantization Benchmark Comparison
11.1 Llama-2-7B Benchmark
import time
import torch
import GPUtil
def benchmark_quantization(model, tokenizer, device="cuda", num_runs=50):
"""Benchmark quantized model"""
prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
if device == "cuda":
torch.cuda.synchronize()
gpu = GPUtil.getGPUs()[0]
memory_used_gb = gpu.memoryUsed / 1024
# Warmup
with torch.no_grad():
for _ in range(5):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False
)
# Measure speed
if device == "cuda":
torch.cuda.synchronize()
start = time.time()
with torch.no_grad():
for _ in range(num_runs):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False
)
if device == "cuda":
torch.cuda.synchronize()
elapsed = time.time() - start
avg_time = elapsed / num_runs
tokens_per_second = 50 / avg_time
return {
"memory_gb": memory_used_gb,
"avg_time_ms": avg_time * 1000,
"tokens_per_second": tokens_per_second
}
# Example results (A100 80GB, Llama-2-7B)
benchmark_results = {
"FP16": {"memory_gb": 13.5, "tokens_per_second": 52.3, "ppl": 5.68},
"INT8 (bitsandbytes)": {"memory_gb": 7.8, "tokens_per_second": 38.1, "ppl": 5.71},
"INT4 GPTQ": {"memory_gb": 4.5, "tokens_per_second": 65.2, "ppl": 5.89},
"INT4 AWQ": {"memory_gb": 4.3, "tokens_per_second": 68.7, "ppl": 5.86},
"Q4_K_M (GGUF, CPU)": {"memory_gb": 4.1, "tokens_per_second": 45.2, "ppl": 5.91},
"INT4 NF4": {"memory_gb": 4.0, "tokens_per_second": 31.5, "ppl": 5.94},
}
print("=" * 80)
print(f"{'Method':<25} {'Memory (GB)':<12} {'Tok/s':<12} {'PPL':<8}")
print("=" * 80)
for method, stats in benchmark_results.items():
print(f"{method:<25} {stats['memory_gb']:<12.1f} {stats['tokens_per_second']:<12.1f} {stats['ppl']:<8.2f}")
12. Practical Guide: Choosing the Right Quantization Method
12.1 Strategy by Model Size
Small models under 7B:
- GGUF Q4_K_M: optimal for local CPU execution
- AWQ INT4: recommended for GPU server deployment
- FP16 viable if memory allows (under 24GB GPU)
Mid-size models 13B–30B:
- GPTQ INT4 or AWQ INT4: runs on a single 24GB GPU
- GGUF Q4_K_M: can run in 16GB RAM
Large models 70B+:
- GPTQ INT4: runs on a single A100 80GB
- GPTQ INT2: for extreme compression
- Multi-GPU + Tensor Parallel combination
12.2 Strategy by Task
def recommend_quantization(
task: str,
model_size_b: float,
gpu_memory_gb: float,
cpu_only: bool = False,
fine_tuning_needed: bool = False
):
"""Recommend quantization based on task and environment"""
recommendations = []
if cpu_only:
recommendations.append({
"method": "GGUF Q4_K_M",
"reason": "Optimized for CPU inference, based on llama.cpp",
"library": "llama-cpp-python"
})
return recommendations
if fine_tuning_needed:
recommendations.append({
"method": "bitsandbytes NF4 + QLoRA",
"reason": "Fine-tuning capable, LoRA adapter training with ~4GB overhead",
"library": "bitsandbytes + peft"
})
return recommendations
fp16_memory = model_size_b * 2
int8_memory = model_size_b * 1
int4_memory = model_size_b * 0.5
if fp16_memory <= gpu_memory_gb * 0.8:
recommendations.append({
"method": "FP16 (baseline)",
"reason": "Memory is sufficient, best quality",
"memory_gb": fp16_memory
})
if int8_memory <= gpu_memory_gb * 0.8:
if task in ["chat", "completion", "summarization"]:
recommendations.append({
"method": "AWQ INT8",
"reason": "Optimal balance of quality and speed",
"library": "autoawq",
"memory_gb": int8_memory
})
if int4_memory <= gpu_memory_gb * 0.8:
recommendations.append({
"method": "AWQ INT4",
"reason": "Fast inference, excellent quality",
"library": "autoawq",
"memory_gb": int4_memory
})
recommendations.append({
"method": "GPTQ INT4",
"reason": "Best INT4 quality, slower quantization process",
"library": "auto-gptq",
"memory_gb": int4_memory
})
return recommendations
# Example usage
recommendations = recommend_quantization(
task="chat",
model_size_b=7.0,
gpu_memory_gb=16.0,
fine_tuning_needed=False
)
for rec in recommendations:
print(f"\nMethod: {rec['method']}")
print(f"Reason: {rec['reason']}")
if 'library' in rec:
print(f"Library: {rec['library']}")
if 'memory_gb' in rec:
print(f"Expected memory: {rec['memory_gb']:.1f} GB")
12.3 Complete Quantization Pipeline
import torch
import os
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
class QuantizationPipeline:
"""Unified quantization pipeline"""
def __init__(self, model_name: str, output_base_dir: str):
self.model_name = model_name
self.output_base_dir = output_base_dir
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
os.makedirs(output_base_dir, exist_ok=True)
def quantize_gptq(self, bits: int = 4, group_size: int = 128):
"""GPTQ quantization"""
output_dir = os.path.join(self.output_base_dir, f"gptq-{bits}bit")
config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
sym=True,
desc_act=False
)
model = AutoGPTQForCausalLM.from_pretrained(
self.model_name,
quantize_config=config
)
calibration_data = self._prepare_calibration_data()
model.quantize(calibration_data)
model.save_quantized(output_dir)
self.tokenizer.save_pretrained(output_dir)
print(f"GPTQ {bits}bit saved: {output_dir}")
return output_dir
def quantize_awq(self, bits: int = 4, group_size: int = 128):
"""AWQ quantization"""
output_dir = os.path.join(self.output_base_dir, f"awq-{bits}bit")
model = AutoAWQForCausalLM.from_pretrained(
self.model_name,
low_cpu_mem_usage=True
)
quant_config = {
"zero_point": True,
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM"
}
model.quantize(self.tokenizer, quant_config=quant_config)
model.save_quantized(output_dir)
self.tokenizer.save_pretrained(output_dir)
print(f"AWQ {bits}bit saved: {output_dir}")
return output_dir
def _prepare_calibration_data(self, num_samples: int = 128):
"""Prepare calibration data"""
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
data = []
for text in dataset["text"]:
if len(text.strip()) > 50:
encoded = self.tokenizer(
text.strip(),
return_tensors="pt",
max_length=2048,
truncation=True
)
data.append(encoded["input_ids"].squeeze())
if len(data) >= num_samples:
break
return data
Conclusion
Model quantization is a foundational technology for democratizing LLMs. To summarize what we covered:
- Fundamentals: The math behind compressing FP32 to INT4 (scale, zero_point)
- PTQ vs QAT: PTQ is practical without retraining; QAT is essential for extreme compression
- GPTQ: Best INT4 quality via Hessian-based error compensation
- AWQ: Fast and efficient quantization based on activation distributions
- GGUF: Optimized for CPU execution, multiple quality levels available
- bitsandbytes: HuggingFace integration, essential for QLoRA fine-tuning
Recommended strategies:
- Local execution: GGUF Q4_K_M
- GPU server deployment: AWQ 4-bit
- Quality-critical scenarios: GPTQ 4-bit or FP16
- Fine-tuning needed: bitsandbytes NF4 + QLoRA
Quantization technology is evolving rapidly, with 2-bit methods like QuIP# and AQLM emerging. The journey toward smaller, faster models continues.
References
- GPTQ: arXiv:2209.05433
- AWQ: arXiv:2306.00978
- llama.cpp: github.com/ggerganov/llama.cpp
- bitsandbytes: github.com/TimDettmers/bitsandbytes
- SmoothQuant: arXiv:2211.10438
- SpQR: arXiv:2306.03078
- PyTorch Quantization: pytorch.org/docs/stable/quantization.html