Deep Learning Model Quantization Complete Guide: Master INT8, INT4, GPTQ, AWQ, GGUF

Introduction

As deep learning models grow increasingly large, inference costs and memory requirements have skyrocketed. GPT-3 has 175B parameters, Llama 3 has 70B, and storing them in full FP32 precision requires 700GB and 280GB of memory respectively — completely infeasible for typical GPUs.

Model Quantization is the key technology to solve this problem. By compressing 32-bit floating-point (FP32) weights to 8-bit or 4-bit integers, it reduces memory by 4–8x and improves inference speed by 2–4x. Remarkably, quality loss is minimal.

In this guide, we thoroughly explore quantization from mathematical foundations to the latest techniques: GPTQ, AWQ, GGUF, and bitsandbytes.

1. Quantization Fundamentals: Understanding Number Representations

1.1 Floating-Point Formats

Understanding the floating-point formats used in modern deep learning is the starting point for quantization.

FP32 (Float32)

Sign (1 bit) + Exponent (8 bits) + Mantissa (23 bits) = 32 bits total
Range: approximately -3.4e38 to 3.4e38
Precision: ~7 decimal digits

FP16 (Float16)

Sign (1 bit) + Exponent (5 bits) + Mantissa (10 bits) = 16 bits total
Range: -65504 to 65504 (much narrower than FP32)
Precision: ~3 decimal digits
Risk of overflow during training; requires gradient scaling

BF16 (Brain Float16)

Sign (1 bit) + Exponent (8 bits) + Mantissa (7 bits) = 16 bits total
Maintains the same exponent range as FP32 while reducing mantissa bits
No overflow risk, safer for deep learning training
Developed by Google Brain, natively supported on modern GPUs (A100, H100)

import torch
import numpy as np

# Check memory size of each data type
x_fp32 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float32)
x_fp16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float16)
x_bf16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.bfloat16)

print(f"FP32: {x_fp32.element_size()} bytes per element")  # 4 bytes
print(f"FP16: {x_fp16.element_size()} bytes per element")  # 2 bytes
print(f"BF16: {x_bf16.element_size()} bytes per element")  # 2 bytes

# Memory calculation for a 7B parameter model
params = 7e9
fp32_memory_gb = params * 4 / 1e9
fp16_memory_gb = params * 2 / 1e9
int8_memory_gb = params * 1 / 1e9
int4_memory_gb = params * 0.5 / 1e9

print(f"\n7B Model Memory Requirements:")
print(f"FP32: {fp32_memory_gb:.1f} GB")   # 28.0 GB
print(f"FP16: {fp16_memory_gb:.1f} GB")   # 14.0 GB
print(f"INT8: {int8_memory_gb:.1f} GB")   # 7.0 GB
print(f"INT4: {int4_memory_gb:.1f} GB")   # 3.5 GB

1.2 Integer Representations

The core of quantization is mapping floating-point values to integers.

INT8: -128 to 127 (signed) or 0 to 255 (unsigned) INT4: -8 to 7 (signed) or 0 to 15 (unsigned) INT2: -2 to 1 (signed) or 0 to 3 (unsigned)

1.3 Quantization Formula

The fundamental formula to convert a floating-point value x to integer q:

q = clamp(round(x / scale) + zero_point, q_min, q_max)

Dequantization:

x_approx = scale * (q - zero_point)

Where:

scale: quantization scale factor (scale = (max_val - min_val) / (q_max - q_min))
zero_point: the offset representing which real value integer 0 corresponds to
q_min, q_max: integer range bounds (-128, 127 for INT8)

import torch
import numpy as np

def symmetric_quantize(x: torch.Tensor, num_bits: int = 8):
    """Symmetric quantization implementation"""
    q_max = 2 ** (num_bits - 1) - 1  # 127 for INT8
    q_min = -q_max  # -127

    # Compute scale
    max_abs = x.abs().max()
    scale = max_abs / q_max

    # Quantize
    q = torch.clamp(torch.round(x / scale), q_min, q_max).to(torch.int8)

    return q, scale

def asymmetric_quantize(x: torch.Tensor, num_bits: int = 8):
    """Asymmetric quantization implementation"""
    q_max = 2 ** num_bits - 1  # 255 for UINT8
    q_min = 0

    # Compute scale and zero_point
    min_val = x.min()
    max_val = x.max()
    scale = (max_val - min_val) / (q_max - q_min)
    zero_point = q_min - torch.round(min_val / scale)
    zero_point = torch.clamp(zero_point, q_min, q_max).to(torch.int32)

    # Quantize
    q = torch.clamp(torch.round(x / scale) + zero_point, q_min, q_max).to(torch.uint8)

    return q, scale, zero_point

def dequantize(q: torch.Tensor, scale: torch.Tensor, zero_point: torch.Tensor = None):
    """Dequantization"""
    if zero_point is None:
        return scale * q.float()
    return scale * (q.float() - zero_point.float())

# Test
x = torch.randn(100)
print(f"Original data range: [{x.min():.4f}, {x.max():.4f}]")

# Symmetric quantization
q_sym, scale_sym = symmetric_quantize(x)
x_reconstructed_sym = dequantize(q_sym, scale_sym)
error_sym = (x - x_reconstructed_sym).abs().mean()
print(f"Symmetric quantization mean error: {error_sym:.6f}")

# Asymmetric quantization
q_asym, scale_asym, zp_asym = asymmetric_quantize(x)
x_reconstructed_asym = dequantize(q_asym, scale_asym, zp_asym)
error_asym = (x - x_reconstructed_asym).abs().mean()
print(f"Asymmetric quantization mean error: {error_asym:.6f}")

1.4 Symmetric vs Asymmetric Quantization

Symmetric Quantization

zero_point = 0
Symmetric positive/negative range
Suitable for weights (mostly zero-centered distribution)
Simpler computation: x_approx = scale * q

Asymmetric Quantization

zero_point != 0
Can represent arbitrary ranges
Suitable for activations (always non-negative after ReLU)
More complex computation: x_approx = scale * (q - zero_point)

1.5 Quantization Granularity

Determines how many parameters share a single scale/zero_point.

Per-Tensor: One scale for the entire tensor

Minimal memory overhead
Largest precision loss

Per-Channel (Per-Row/Column): Individual scale per channel

Separate scale for each row/column of the weight matrix
Effectively handles distribution differences across channels

Per-Group (Per-Block): Individual scale per fixed-size group

Typical group_size = 128
Compromise between per-channel and per-tensor
Commonly used in GPTQ and AWQ

import torch

def per_group_quantize(weight: torch.Tensor, group_size: int = 128, num_bits: int = 4):
    """Per-Group quantization implementation"""
    rows, cols = weight.shape

    # Split into groups
    weight_grouped = weight.reshape(-1, group_size)

    # Max/min per group
    max_vals = weight_grouped.max(dim=1, keepdim=True)[0]
    min_vals = weight_grouped.min(dim=1, keepdim=True)[0]

    q_max = 2 ** num_bits - 1  # 15 for INT4

    # Compute scales
    scales = (max_vals - min_vals) / q_max
    zero_points = torch.round(-min_vals / scales)

    # Quantize
    q = torch.clamp(torch.round(weight_grouped / scales) + zero_points, 0, q_max)

    # Dequantize
    weight_dequant = scales * (q - zero_points)
    weight_dequant = weight_dequant.reshape(rows, cols)

    return q, scales, zero_points, weight_dequant

# Example: Transformer weight quantization
weight = torch.randn(4096, 4096)  # Llama-style weight
q, scales, zp, weight_dequant = per_group_quantize(weight, group_size=128, num_bits=4)

error = (weight - weight_dequant).abs().mean()
print(f"Per-Group INT4 quantization mean error: {error:.6f}")
print(f"Compression ratio: {weight.element_size() * weight.numel() / (q.numel() / 2 + scales.numel() * 4):.2f}x")

2. Post-Training Quantization (PTQ)

PTQ quantizes an already-trained model without retraining — the most practical approach and most widely used.

2.1 Calibration Dataset

PTQ uses a small calibration dataset to determine appropriate scale and zero_point values.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

def collect_calibration_data(model_name: str, num_samples: int = 128):
    """Collect calibration data"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # WikiText-2 or C4 dataset is commonly used
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    texts = []
    for item in dataset:
        if len(item['text'].strip()) > 100:
            texts.append(item['text'].strip())
        if len(texts) >= num_samples:
            break

    # Tokenize
    encoded = [
        tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
        for text in texts
    ]

    return encoded

def collect_activation_stats(model, calibration_data, layer_name: str):
    """Collect activation statistics for a specific layer"""
    stats = {"min": float("inf"), "max": float("-inf")}

    def hook_fn(module, input, output):
        with torch.no_grad():
            act = output.detach().float()
            stats["min"] = min(stats["min"], act.min().item())
            stats["max"] = max(stats["max"], act.max().item())

    # Register hook
    target_layer = dict(model.named_modules())[layer_name]
    handle = target_layer.register_forward_hook(hook_fn)

    # Run calibration data
    model.eval()
    with torch.no_grad():
        for batch in calibration_data[:32]:
            model(**batch)

    handle.remove()
    return stats

2.2 Min-Max Calibration

The simplest method: uses the global minimum and maximum values from calibration data.

class MinMaxCalibrator:
    """Min-Max calibrator"""

    def __init__(self):
        self.min_val = float("inf")
        self.max_val = float("-inf")

    def update(self, tensor: torch.Tensor):
        self.min_val = min(self.min_val, tensor.min().item())
        self.max_val = max(self.max_val, tensor.max().item())

    def compute_scale_zp(self, num_bits: int = 8, symmetric: bool = True):
        q_max = 2 ** (num_bits - 1) - 1 if symmetric else 2 ** num_bits - 1

        if symmetric:
            max_abs = max(abs(self.min_val), abs(self.max_val))
            scale = max_abs / q_max
            zero_point = 0
        else:
            scale = (self.max_val - self.min_val) / q_max
            zero_point = -round(self.min_val / scale)

        return scale, zero_point

2.3 Histogram Calibration

To reduce the impact of outliers, finds the optimal range based on the distribution histogram.

import numpy as np
from scipy import stats

class HistogramCalibrator:
    """Histogram-based calibrator (minimizes KL Divergence)"""

    def __init__(self, num_bins: int = 2048):
        self.num_bins = num_bins
        self.histogram = None
        self.bin_edges = None

    def update(self, tensor: torch.Tensor):
        data = tensor.detach().float().numpy().flatten()

        if self.histogram is None:
            self.histogram, self.bin_edges = np.histogram(data, bins=self.num_bins)
        else:
            new_hist, _ = np.histogram(data, bins=self.bin_edges)
            self.histogram += new_hist

    def compute_optimal_range(self, num_bits: int = 8):
        """Search for optimal range minimizing KL Divergence"""
        num_quantized_bins = 2 ** num_bits - 1

        best_kl = float("inf")
        best_threshold = None

        for i in range(num_quantized_bins, len(self.histogram)):
            reference = self.histogram[:i].copy().astype(float)
            reference /= reference.sum()

            quantized = np.zeros(i)
            bin_size = i / num_quantized_bins

            for j in range(num_quantized_bins):
                start = int(j * bin_size)
                end = int((j + 1) * bin_size)
                quantized[start:end] = reference[start:end].sum() / (end - start)

            quantized = np.where(quantized == 0, 1e-10, quantized)
            reference_clipped = np.where(reference == 0, 1e-10, reference)

            kl = stats.entropy(reference_clipped, quantized)

            if kl < best_kl:
                best_kl = kl
                best_threshold = self.bin_edges[i]

        return -best_threshold, best_threshold

2.4 Impact on Perplexity

The most common metric for measuring quantization quality is Perplexity (PPL).

import torch
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, text: str, device: str = "cuda"):
    """Compute perplexity"""
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(device)

    max_length = 1024
    stride = 512

    nlls = []
    prev_end_loc = 0

    for begin_loc in range(0, input_ids.size(1), stride):
        end_loc = min(begin_loc + max_length, input_ids.size(1))
        trg_len = end_loc - prev_end_loc

        input_ids_chunk = input_ids[:, begin_loc:end_loc]
        target_ids = input_ids_chunk.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids_chunk, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood)
        prev_end_loc = end_loc

        if end_loc == input_ids.size(1):
            break

    ppl = torch.exp(torch.stack(nlls).mean())
    return ppl.item()

# Example PPL comparison
# FP16:          PPL ~5.68
# INT8:          PPL ~5.71 (~0.5% increase)
# INT4 (GPTQ):   PPL ~5.89 (~3.7% increase)
# INT4 (naive):  PPL ~6.52 (~14.8% increase)

3. Quantization-Aware Training (QAT)

QAT simulates quantization during training so the model adapts to quantization noise.

3.1 Fake Quantization

Simulates quantization effects in FP32 instead of actual INT8 operations.

import torch
import torch.nn as nn
import torch.nn.functional as F

class FakeQuantize(nn.Module):
    """Fake quantization module"""

    def __init__(self, num_bits: int = 8, symmetric: bool = True):
        super().__init__()
        self.num_bits = num_bits
        self.symmetric = symmetric

        self.register_buffer('scale', torch.tensor(1.0))
        self.register_buffer('zero_point', torch.tensor(0))
        self.register_buffer('fake_quant_enabled', torch.tensor(1))

        if symmetric:
            self.q_min = -(2 ** (num_bits - 1))
            self.q_max = 2 ** (num_bits - 1) - 1
        else:
            self.q_min = 0
            self.q_max = 2 ** num_bits - 1

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.fake_quant_enabled[0] == 0:
            return x

        # Update scale with exponential moving average during training
        if self.training:
            with torch.no_grad():
                if self.symmetric:
                    max_abs = x.abs().max()
                    new_scale = max_abs / self.q_max
                else:
                    new_scale = (x.max() - x.min()) / (self.q_max - self.q_min)

                self.scale.copy_(0.9 * self.scale + 0.1 * new_scale)

        # Fake quantize: quantize then dequantize
        x_scaled = x / self.scale
        x_clipped = torch.clamp(x_scaled, self.q_min, self.q_max)
        x_rounded = torch.round(x_clipped)
        x_dequant = x_rounded * self.scale

        return x_dequant

3.2 STE (Straight-Through Estimator)

class STERound(torch.autograd.Function):
    """Straight-Through Estimator for round()"""

    @staticmethod
    def forward(ctx, x):
        return torch.round(x)

    @staticmethod
    def backward(ctx, grad_output):
        # Pass gradient through round() unchanged (identity approximation)
        return grad_output

class STEClamp(torch.autograd.Function):
    """Straight-Through Estimator for clamp()"""

    @staticmethod
    def forward(ctx, x, min_val, max_val):
        ctx.save_for_backward(x)
        ctx.min_val = min_val
        ctx.max_val = max_val
        return torch.clamp(x, min_val, max_val)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        # Pass gradient only within clamp range
        grad = grad_output * ((x >= ctx.min_val) & (x <= ctx.max_val)).float()
        return grad, None, None

class QATLinear(nn.Module):
    """Linear layer with QAT applied"""

    def __init__(self, in_features, out_features, num_bits=8):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.weight_fake_quant = FakeQuantize(num_bits=num_bits)
        self.act_fake_quant = FakeQuantize(num_bits=num_bits, symmetric=False)

    def forward(self, x):
        # Activation quantization
        x_q = self.act_fake_quant(x)
        # Weight quantization
        w_q = self.weight_fake_quant(self.linear.weight)
        # FP32 compute (INT8 in actual deployment)
        return F.linear(x_q, w_q, self.linear.bias)

3.3 When is QAT Needed?

When PTQ quality loss is too high: Especially effective for small models (BERT-small, etc.)
Quantizing to INT4 or lower: Essential for extreme compression
Precision-sensitive tasks: Object detection, ASR, etc.

# QAT training workflow
import torch.optim as optim
from torch.quantization import prepare_qat, convert

def train_qat_model(model, train_loader, num_epochs=10):
    """QAT training example"""

    model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
    model_prepared = prepare_qat(model.train())

    optimizer = optim.Adam(model_prepared.parameters(), lr=1e-5)

    for epoch in range(num_epochs):
        for batch in train_loader:
            inputs, labels = batch
            outputs = model_prepared(inputs)
            loss = F.cross_entropy(outputs, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Convert to INT8 model
    model_prepared.eval()
    model_quantized = convert(model_prepared)

    return model_quantized

4. PyTorch Quantization API

4.1 torch.ao.quantization

PyTorch's official quantization API.

import torch
from torch.ao.quantization import (
    get_default_qconfig,
    get_default_qat_qconfig,
    prepare,
    prepare_qat,
    convert
)

# Static quantization (PTQ)
def static_quantization_example():
    """Static quantization example"""
    model = MyModel()
    model.eval()

    # Backend config (fbgemm: x86, qnnpack: ARM)
    model.qconfig = get_default_qconfig('fbgemm')

    # Prepare for calibration
    model_prepared = prepare(model)

    # Collect statistics from calibration data
    with torch.no_grad():
        for data in calibration_loader:
            model_prepared(data)

    # Convert to INT8 model
    model_quantized = convert(model_prepared)

    return model_quantized

# Dynamic quantization (effective for LSTM, Linear)
def dynamic_quantization_example():
    """Dynamic quantization example"""
    model = MyModel()

    model_quantized = torch.quantization.quantize_dynamic(
        model,
        {nn.Linear, nn.LSTM},  # Layer types to quantize
        dtype=torch.qint8
    )

    return model_quantized

4.2 FX Graph Mode Quantization

A more flexible and powerful quantization approach.

from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping

def fx_quantization_example(model, calibration_data):
    """FX Graph Mode quantization"""
    model.eval()

    qconfig_mapping = QConfigMapping().set_global(
        get_default_qconfig('fbgemm')
    )

    example_inputs = (torch.randn(1, 3, 224, 224),)

    model_prepared = prepare_fx(
        model,
        qconfig_mapping,
        example_inputs
    )

    with torch.no_grad():
        for batch in calibration_data:
            model_prepared(batch)

    model_quantized = convert_fx(model_prepared)

    return model_quantized

5. GPTQ: Accurate Post-Training Quantization

GPTQ, published in 2022, is an LLM-specific quantization algorithm that minimizes quality loss even at INT4. (arXiv:2209.05433)

5.1 GPTQ Algorithm Principles

GPTQ is based on OBQ (Optimal Brain Quantization). The core idea: quantize weights layer by layer sequentially, then compensate for quantization errors in the already-quantized weights by updating the remaining weights.

OBQ error minimization objective:

argmin_Q ||WX - QX||_F^2

Where W is the original weight, Q is the quantized weight, and X is the input activation.

Hessian-based weight update:

After quantizing each weight, the resulting error is propagated to the remaining weights using the inverse Hessian H^(-1).

import torch
import math

def gptq_quantize_weight(weight: torch.Tensor,
                          hessian: torch.Tensor,
                          num_bits: int = 4,
                          group_size: int = 128,
                          damp_percent: float = 0.01):
    """
    Quantize weights using the GPTQ algorithm

    Args:
        weight: [out_features, in_features] weight matrix
        hessian: [in_features, in_features] Hessian matrix (H = 2 * X @ X.T)
        num_bits: quantization bit count
        group_size: group size
        damp_percent: damping ratio for Hessian stabilization
    """
    W = weight.clone().float()
    n_rows, n_cols = W.shape

    # Hessian damping (numerical stability)
    H = hessian.clone().float()
    dead_cols = torch.diag(H) == 0
    H[dead_cols, dead_cols] = 1
    W[:, dead_cols] = 0

    damp = damp_percent * H.diag().mean()
    H.diagonal().add_(damp)

    # Inverse Hessian via Cholesky decomposition
    H_inv = torch.linalg.cholesky(H)
    H_inv = torch.cholesky_inverse(H_inv)
    H_inv = torch.linalg.cholesky(H_inv, upper=True)

    Q = torch.zeros_like(W)
    Losses = torch.zeros_like(W)

    q_max = 2 ** (num_bits - 1) - 1

    for col_idx in range(n_cols):
        w_col = W[:, col_idx]
        h_inv_diag = H_inv[col_idx, col_idx]

        # Compute per-group scale
        if group_size != -1 and col_idx % group_size == 0:
            group_end = min(col_idx + group_size, n_cols)
            w_group = W[:, col_idx:group_end]
            max_abs = w_group.abs().max(dim=1)[0].unsqueeze(1)
            scale = max_abs / q_max
            scale = torch.clamp(scale, min=1e-8)

        # Quantize
        q_col = torch.clamp(torch.round(w_col / scale.squeeze()), -q_max, q_max)
        q_col = q_col * scale.squeeze()
        Q[:, col_idx] = q_col

        # Quantization error
        err = (w_col - q_col) / h_inv_diag
        Losses[:, col_idx] = err ** 2 / 2

        # Propagate error to remaining weights (the key step!)
        W[:, col_idx + 1:] -= err.unsqueeze(1) * H_inv[col_idx, col_idx + 1:].unsqueeze(0)

    return Q, Losses

5.2 Using AutoGPTQ

Practical GPTQ quantization uses the AutoGPTQ library.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

def quantize_with_gptq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128
):
    """Quantize model with AutoGPTQ"""

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    quantize_config = BaseQuantizeConfig(
        bits=bits,
        group_size=group_size,
        damp_percent=0.01,
        desc_act=False,
        sym=True,
        true_sequential=True
    )

    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config
    )

    # Prepare calibration data
    from datasets import load_dataset
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    calibration_data = []
    for text in dataset["text"][:128]:
        if len(text.strip()) > 50:
            encoded = tokenizer(
                text.strip(),
                return_tensors="pt",
                max_length=2048,
                truncation=True
            )
            calibration_data.append(encoded["input_ids"].squeeze())

    print(f"Starting GPTQ {bits}bit quantization...")
    model.quantize(calibration_data)

    model.save_quantized(output_dir, use_safetensors=True)
    tokenizer.save_pretrained(output_dir)

    print(f"Quantization complete: {output_dir}")
    return model, tokenizer


def load_gptq_model(model_dir: str, device: str = "cuda"):
    """Load GPTQ quantized model"""

    model = AutoGPTQForCausalLM.from_quantized(
        model_dir,
        device=device,
        use_triton=False,
        disable_exllama=False,
        inject_fused_attention=True,
        inject_fused_mlp=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer

6. AWQ: Activation-aware Weight Quantization

AWQ, published in 2023, analyzes activation distributions to protect important weight channels. (arXiv:2306.00978)

6.1 Differences from GPTQ

Feature	GPTQ	AWQ
Approach	Hessian-based error compensation	Activation-based scaling
Calibration data	Required (128+ samples)	Required (32+ samples)
Speed	Slow (1–4 hours)	Fast (tens of minutes)
Quality	Excellent	Excellent (comparable or better)
Key feature	Per-channel optimization	Activation outlier handling

6.2 Using AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

def quantize_with_awq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128
):
    """Quantize model with AutoAWQ"""

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )

    model = AutoAWQForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        use_cache=False
    )

    quant_config = {
        "zero_point": True,
        "q_group_size": group_size,
        "w_bit": bits,
        "version": "GEMM"
    }

    print(f"Starting AWQ {bits}bit quantization...")
    model.quantize(tokenizer, quant_config=quant_config)

    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)

    print(f"AWQ quantization complete: {output_dir}")
    return model

def load_awq_model(model_dir: str, device: str = "cuda"):
    """Load AWQ quantized model"""

    model = AutoAWQForCausalLM.from_quantized(
        model_dir,
        fuse_layers=True,
        trust_remote_code=True,
        safetensors=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer

7. GGUF/GGML: The llama.cpp Ecosystem

GGUF (GPT-Generated Unified Format) is the model format for the llama.cpp project, enabling efficient LLM execution even on CPU.

7.1 Understanding GGUF

GGUF was introduced in 2023 to replace GGML. It stores model metadata, hyperparameters, and tokenizer information in a single file.

7.2 Quantization Levels Comparison

Format	Bits	Memory (7B)	PPL Increase	Recommended Use
Q2_K	2.6	2.8 GB	High	Extreme compression
Q3_K_S	3.0	3.3 GB	Medium	Memory saving
Q4_0	4.0	3.8 GB	Low	Balanced
Q4_K_M	4.1	4.1 GB	Very low	General recommendation
Q5_0	5.0	4.7 GB	Minimal	High quality
Q5_K_M	5.1	4.8 GB	Minimal	High quality recommended
Q6_K	6.0	5.5 GB	Nearly none	Near FP16
Q8_0	8.0	7.2 GB	None	Reference use
F16	16.0	13.5 GB	None	Baseline

K-quants (Q4_K_M, Q5_K_M, etc.) keep some layers at higher precision to improve quality.

7.3 Building and Using llama.cpp

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

# CPU-only build
cmake -B build
cmake --build build --config Release -j $(nproc)

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
    --model meta-llama/Llama-2-7b-hf \
    --outfile llama2-7b-f16.gguf \
    --outtype f16

# Quantize to Q4_K_M
./build/bin/llama-quantize \
    llama2-7b-f16.gguf \
    llama2-7b-q4_k_m.gguf \
    Q4_K_M

# Run inference
./build/bin/llama-cli \
    -m llama2-7b-q4_k_m.gguf \
    -p "The future of AI is" \
    -n 100 \
    --ctx-size 4096 \
    --threads 8 \
    --n-gpu-layers 35

7.4 Python Bindings (llama-cpp-python)

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./llama2-7b-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    n_threads=8,
    verbose=False
)

# Text generation
output = llm(
    "Once upon a time",
    max_tokens=200,
    temperature=0.7,
    top_p=0.9,
    stop=["</s>", "\n\n"]
)

print(output["choices"][0]["text"])

# Chat completion format
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response["choices"][0]["message"]["content"])

# Streaming output
for chunk in llm.create_chat_completion(
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True
):
    delta = chunk["choices"][0].get("delta", {})
    if "content" in delta:
        print(delta["content"], end="", flush=True)

8. bitsandbytes: LLM Quantization Library

bitsandbytes, developed by Tim Dettmers, integrates seamlessly with HuggingFace transformers.

8.1 LLM.int8() — 8-bit Mixed Precision

LLM.int8() handles activation outliers in FP16 during matrix multiplication while using INT8 for the rest.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load INT8 model
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

def print_model_size(model, label):
    """Print model memory usage"""
    total_params = sum(p.numel() for p in model.parameters())
    total_bytes = sum(
        p.numel() * p.element_size() for p in model.parameters()
    )
    print(f"{label}: {total_params/1e9:.2f}B params, {total_bytes/1e9:.2f} GB")

print_model_size(model_8bit, "INT8 model")
# INT8 model: 6.74B params, ~7.0 GB

8.2 4-bit Quantization (Used in QLoRA)

import bitsandbytes as bnb
from transformers import BitsAndBytesConfig

# NF4 quantization config (QLoRA)
bnb_config_nf4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Double quantization
)

# FP4 quantization config
bnb_config_fp4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Load model
model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config_nf4,
    device_map="auto"
)

print_model_size(model_4bit, "NF4 model")
# NF4 model: 6.74B params, ~4.0 GB (with double quantization)

# QLoRA fine-tuning setup
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_4bit = prepare_model_for_kbit_training(model_4bit)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model_lora = get_peft_model(model_4bit, lora_config)
model_lora.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,607,232 || trainable%: 0.1197

8.3 NF4 vs FP4

NF4 (Normal Float 4)

Non-linear 4-bit quantization assuming a normal distribution
Leverages the observation that weight distributions are approximately normal
Better representational power at the same bit count

FP4 (Float 4)

Floating-point based 4-bit
Can represent wider ranges

9. SmoothQuant: W8A8 Quantization

SmoothQuant quantizes both weights (W) and activations (A) to INT8 for faster inference.

9.1 The Activation Outlier Problem

LLM activations exhibit very large values (outliers) in specific channels, making W8A8 quantization challenging.

9.2 Migration Scaling

SmoothQuant's key insight: transfer the difficulty from activations to weights.

Y = (X * diag(s)^(-1)) * (diag(s) * W)
  = X_smooth * W_smooth

def smooth_quantize(
    model,
    calibration_samples,
    alpha: float = 0.5
):
    """
    Apply SmoothQuant

    Args:
        alpha: migration strength (0=weights only, 1=activations only)
               Recommended: 0.5 (equal distribution)
    """

    act_scales = {}

    def collect_scales(name):
        def hook(module, input, output):
            inp = input[0].detach()
            if inp.dim() == 3:
                inp = inp.reshape(-1, inp.size(-1))

            channel_max = inp.abs().max(dim=0)[0]

            if name not in act_scales:
                act_scales[name] = channel_max
            else:
                act_scales[name] = torch.maximum(act_scales[name], channel_max)
        return hook

    handles = []
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            handles.append(module.register_forward_hook(collect_scales(name)))

    with torch.no_grad():
        for sample in calibration_samples:
            model(**sample)

    for h in handles:
        h.remove()

    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear) and name in act_scales:
            act_scale = act_scales[name]
            weight_scale = module.weight.abs().max(dim=0)[0]

            # Compute migration scale
            smooth_scale = (act_scale ** alpha) / (weight_scale ** (1 - alpha))
            smooth_scale = torch.clamp(smooth_scale, min=1e-5)

            # Apply scale to weights
            module.weight.data = module.weight.data / smooth_scale.unsqueeze(0)

    return model, act_scales

10. SpQR: Sparse Quantization Representation

SpQR stores important weights (outliers) in FP16 separately and quantizes the remainder to low precision.

import torch

def spqr_quantize(weight: torch.Tensor,
                   num_bits: int = 3,
                   outlier_threshold_percentile: float = 1.0):
    """
    SpQR quantization (simplified version)

    Core: store top p% outliers as FP16, quantize rest to low bits
    """

    threshold = torch.quantile(weight.abs(), 1 - outlier_threshold_percentile / 100)
    outlier_mask = weight.abs() > threshold

    # Store outliers (FP16)
    outlier_values = weight.clone()
    outlier_values[~outlier_mask] = 0

    # Quantize remainder
    regular_weight = weight.clone()
    regular_weight[outlier_mask] = 0

    q_max = 2 ** (num_bits - 1) - 1
    group_size = 16

    rows, cols = regular_weight.shape
    regular_grouped = regular_weight.reshape(-1, group_size)

    max_abs = regular_grouped.abs().max(dim=1, keepdim=True)[0]
    scales = max_abs / q_max
    scales = torch.clamp(scales, min=1e-8)

    q = torch.clamp(torch.round(regular_grouped / scales), -q_max, q_max).to(torch.int8)
    regular_dequant = (scales * q.float()).reshape(rows, cols)

    reconstructed = regular_dequant + outlier_values

    error = (weight - reconstructed).abs().mean().item()
    outlier_memory = outlier_mask.sum().item() * 2
    regular_memory = (~outlier_mask).sum().item() * (num_bits / 8)
    total_memory = outlier_memory + regular_memory
    original_memory = weight.numel() * weight.element_size()
    compression_ratio = original_memory / total_memory

    print(f"Outlier ratio: {outlier_mask.float().mean():.2%}")
    print(f"Mean reconstruction error: {error:.6f}")
    print(f"Compression ratio: {compression_ratio:.2f}x")

    return q, scales, outlier_values, outlier_mask

11. Quantization Benchmark Comparison

11.1 Llama-2-7B Benchmark

import time
import torch
import GPUtil

def benchmark_quantization(model, tokenizer, device="cuda", num_runs=50):
    """Benchmark quantized model"""

    prompt = "The history of artificial intelligence began"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    if device == "cuda":
        torch.cuda.synchronize()
        gpu = GPUtil.getGPUs()[0]
        memory_used_gb = gpu.memoryUsed / 1024

    # Warmup
    with torch.no_grad():
        for _ in range(5):
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False
            )

    # Measure speed
    if device == "cuda":
        torch.cuda.synchronize()
    start = time.time()

    with torch.no_grad():
        for _ in range(num_runs):
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False
            )

    if device == "cuda":
        torch.cuda.synchronize()
    elapsed = time.time() - start

    avg_time = elapsed / num_runs
    tokens_per_second = 50 / avg_time

    return {
        "memory_gb": memory_used_gb,
        "avg_time_ms": avg_time * 1000,
        "tokens_per_second": tokens_per_second
    }

# Example results (A100 80GB, Llama-2-7B)
benchmark_results = {
    "FP16": {"memory_gb": 13.5, "tokens_per_second": 52.3, "ppl": 5.68},
    "INT8 (bitsandbytes)": {"memory_gb": 7.8, "tokens_per_second": 38.1, "ppl": 5.71},
    "INT4 GPTQ": {"memory_gb": 4.5, "tokens_per_second": 65.2, "ppl": 5.89},
    "INT4 AWQ": {"memory_gb": 4.3, "tokens_per_second": 68.7, "ppl": 5.86},
    "Q4_K_M (GGUF, CPU)": {"memory_gb": 4.1, "tokens_per_second": 45.2, "ppl": 5.91},
    "INT4 NF4": {"memory_gb": 4.0, "tokens_per_second": 31.5, "ppl": 5.94},
}

print("=" * 80)
print(f"{'Method':<25} {'Memory (GB)':<12} {'Tok/s':<12} {'PPL':<8}")
print("=" * 80)
for method, stats in benchmark_results.items():
    print(f"{method:<25} {stats['memory_gb']:<12.1f} {stats['tokens_per_second']:<12.1f} {stats['ppl']:<8.2f}")

12. Practical Guide: Choosing the Right Quantization Method

12.1 Strategy by Model Size

Small models under 7B:

GGUF Q4_K_M: optimal for local CPU execution
AWQ INT4: recommended for GPU server deployment
FP16 viable if memory allows (under 24GB GPU)

Mid-size models 13B–30B:

GPTQ INT4 or AWQ INT4: runs on a single 24GB GPU
GGUF Q4_K_M: can run in 16GB RAM

Large models 70B+:

GPTQ INT4: runs on a single A100 80GB
GPTQ INT2: for extreme compression
Multi-GPU + Tensor Parallel combination

12.2 Strategy by Task

def recommend_quantization(
    task: str,
    model_size_b: float,
    gpu_memory_gb: float,
    cpu_only: bool = False,
    fine_tuning_needed: bool = False
):
    """Recommend quantization based on task and environment"""

    recommendations = []

    if cpu_only:
        recommendations.append({
            "method": "GGUF Q4_K_M",
            "reason": "Optimized for CPU inference, based on llama.cpp",
            "library": "llama-cpp-python"
        })
        return recommendations

    if fine_tuning_needed:
        recommendations.append({
            "method": "bitsandbytes NF4 + QLoRA",
            "reason": "Fine-tuning capable, LoRA adapter training with ~4GB overhead",
            "library": "bitsandbytes + peft"
        })
        return recommendations

    fp16_memory = model_size_b * 2
    int8_memory = model_size_b * 1
    int4_memory = model_size_b * 0.5

    if fp16_memory <= gpu_memory_gb * 0.8:
        recommendations.append({
            "method": "FP16 (baseline)",
            "reason": "Memory is sufficient, best quality",
            "memory_gb": fp16_memory
        })

    if int8_memory <= gpu_memory_gb * 0.8:
        if task in ["chat", "completion", "summarization"]:
            recommendations.append({
                "method": "AWQ INT8",
                "reason": "Optimal balance of quality and speed",
                "library": "autoawq",
                "memory_gb": int8_memory
            })

    if int4_memory <= gpu_memory_gb * 0.8:
        recommendations.append({
            "method": "AWQ INT4",
            "reason": "Fast inference, excellent quality",
            "library": "autoawq",
            "memory_gb": int4_memory
        })
        recommendations.append({
            "method": "GPTQ INT4",
            "reason": "Best INT4 quality, slower quantization process",
            "library": "auto-gptq",
            "memory_gb": int4_memory
        })

    return recommendations

# Example usage
recommendations = recommend_quantization(
    task="chat",
    model_size_b=7.0,
    gpu_memory_gb=16.0,
    fine_tuning_needed=False
)

for rec in recommendations:
    print(f"\nMethod: {rec['method']}")
    print(f"Reason: {rec['reason']}")
    if 'library' in rec:
        print(f"Library: {rec['library']}")
    if 'memory_gb' in rec:
        print(f"Expected memory: {rec['memory_gb']:.1f} GB")

12.3 Complete Quantization Pipeline

import torch
import os
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

class QuantizationPipeline:
    """Unified quantization pipeline"""

    def __init__(self, model_name: str, output_base_dir: str):
        self.model_name = model_name
        self.output_base_dir = output_base_dir
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        os.makedirs(output_base_dir, exist_ok=True)

    def quantize_gptq(self, bits: int = 4, group_size: int = 128):
        """GPTQ quantization"""
        output_dir = os.path.join(self.output_base_dir, f"gptq-{bits}bit")

        config = BaseQuantizeConfig(
            bits=bits,
            group_size=group_size,
            sym=True,
            desc_act=False
        )

        model = AutoGPTQForCausalLM.from_pretrained(
            self.model_name,
            quantize_config=config
        )

        calibration_data = self._prepare_calibration_data()
        model.quantize(calibration_data)
        model.save_quantized(output_dir)
        self.tokenizer.save_pretrained(output_dir)

        print(f"GPTQ {bits}bit saved: {output_dir}")
        return output_dir

    def quantize_awq(self, bits: int = 4, group_size: int = 128):
        """AWQ quantization"""
        output_dir = os.path.join(self.output_base_dir, f"awq-{bits}bit")

        model = AutoAWQForCausalLM.from_pretrained(
            self.model_name,
            low_cpu_mem_usage=True
        )

        quant_config = {
            "zero_point": True,
            "q_group_size": group_size,
            "w_bit": bits,
            "version": "GEMM"
        }

        model.quantize(self.tokenizer, quant_config=quant_config)
        model.save_quantized(output_dir)
        self.tokenizer.save_pretrained(output_dir)

        print(f"AWQ {bits}bit saved: {output_dir}")
        return output_dir

    def _prepare_calibration_data(self, num_samples: int = 128):
        """Prepare calibration data"""
        from datasets import load_dataset

        dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

        data = []
        for text in dataset["text"]:
            if len(text.strip()) > 50:
                encoded = self.tokenizer(
                    text.strip(),
                    return_tensors="pt",
                    max_length=2048,
                    truncation=True
                )
                data.append(encoded["input_ids"].squeeze())
                if len(data) >= num_samples:
                    break

        return data

Conclusion

Model quantization is a foundational technology for democratizing LLMs. To summarize what we covered:

Fundamentals: The math behind compressing FP32 to INT4 (scale, zero_point)
PTQ vs QAT: PTQ is practical without retraining; QAT is essential for extreme compression
GPTQ: Best INT4 quality via Hessian-based error compensation
AWQ: Fast and efficient quantization based on activation distributions
GGUF: Optimized for CPU execution, multiple quality levels available
bitsandbytes: HuggingFace integration, essential for QLoRA fine-tuning

Recommended strategies:

Local execution: GGUF Q4_K_M
GPU server deployment: AWQ 4-bit
Quality-critical scenarios: GPTQ 4-bit or FP16
Fine-tuning needed: bitsandbytes NF4 + QLoRA

Quantization technology is evolving rapidly, with 2-bit methods like QuIP# and AQLM emerging. The journey toward smaller, faster models continues.

References

GPTQ: arXiv:2209.05433
AWQ: arXiv:2306.00978
llama.cpp: github.com/ggerganov/llama.cpp
bitsandbytes: github.com/TimDettmers/bitsandbytes
SmoothQuant: arXiv:2211.10438
SpQR: arXiv:2306.03078
PyTorch Quantization: pytorch.org/docs/stable/quantization.html