Skip to content
Published on

Deep Learning Model Quantization Complete Guide: Master INT8, INT4, GPTQ, AWQ, GGUF

Authors

Introduction

As deep learning models grow increasingly large, inference costs and memory requirements have skyrocketed. GPT-3 has 175B parameters, Llama 3 has 70B, and storing them in full FP32 precision requires 700GB and 280GB of memory respectively — completely infeasible for typical GPUs.

Model Quantization is the key technology to solve this problem. By compressing 32-bit floating-point (FP32) weights to 8-bit or 4-bit integers, it reduces memory by 4–8x and improves inference speed by 2–4x. Remarkably, quality loss is minimal.

In this guide, we thoroughly explore quantization from mathematical foundations to the latest techniques: GPTQ, AWQ, GGUF, and bitsandbytes.


1. Quantization Fundamentals: Understanding Number Representations

1.1 Floating-Point Formats

Understanding the floating-point formats used in modern deep learning is the starting point for quantization.

FP32 (Float32)

  • Sign (1 bit) + Exponent (8 bits) + Mantissa (23 bits) = 32 bits total
  • Range: approximately -3.4e38 to 3.4e38
  • Precision: ~7 decimal digits

FP16 (Float16)

  • Sign (1 bit) + Exponent (5 bits) + Mantissa (10 bits) = 16 bits total
  • Range: -65504 to 65504 (much narrower than FP32)
  • Precision: ~3 decimal digits
  • Risk of overflow during training; requires gradient scaling

BF16 (Brain Float16)

  • Sign (1 bit) + Exponent (8 bits) + Mantissa (7 bits) = 16 bits total
  • Maintains the same exponent range as FP32 while reducing mantissa bits
  • No overflow risk, safer for deep learning training
  • Developed by Google Brain, natively supported on modern GPUs (A100, H100)
import torch
import numpy as np

# Check memory size of each data type
x_fp32 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float32)
x_fp16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float16)
x_bf16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.bfloat16)

print(f"FP32: {x_fp32.element_size()} bytes per element")  # 4 bytes
print(f"FP16: {x_fp16.element_size()} bytes per element")  # 2 bytes
print(f"BF16: {x_bf16.element_size()} bytes per element")  # 2 bytes

# Memory calculation for a 7B parameter model
params = 7e9
fp32_memory_gb = params * 4 / 1e9
fp16_memory_gb = params * 2 / 1e9
int8_memory_gb = params * 1 / 1e9
int4_memory_gb = params * 0.5 / 1e9

print(f"\n7B Model Memory Requirements:")
print(f"FP32: {fp32_memory_gb:.1f} GB")   # 28.0 GB
print(f"FP16: {fp16_memory_gb:.1f} GB")   # 14.0 GB
print(f"INT8: {int8_memory_gb:.1f} GB")   # 7.0 GB
print(f"INT4: {int4_memory_gb:.1f} GB")   # 3.5 GB

1.2 Integer Representations

The core of quantization is mapping floating-point values to integers.

INT8: -128 to 127 (signed) or 0 to 255 (unsigned) INT4: -8 to 7 (signed) or 0 to 15 (unsigned) INT2: -2 to 1 (signed) or 0 to 3 (unsigned)

1.3 Quantization Formula

The fundamental formula to convert a floating-point value x to integer q:

q = clamp(round(x / scale) + zero_point, q_min, q_max)

Dequantization:

x_approx = scale * (q - zero_point)

Where:

  • scale: quantization scale factor (scale = (max_val - min_val) / (q_max - q_min))
  • zero_point: the offset representing which real value integer 0 corresponds to
  • q_min, q_max: integer range bounds (-128, 127 for INT8)
import torch
import numpy as np

def symmetric_quantize(x: torch.Tensor, num_bits: int = 8):
    """Symmetric quantization implementation"""
    q_max = 2 ** (num_bits - 1) - 1  # 127 for INT8
    q_min = -q_max  # -127

    # Compute scale
    max_abs = x.abs().max()
    scale = max_abs / q_max

    # Quantize
    q = torch.clamp(torch.round(x / scale), q_min, q_max).to(torch.int8)

    return q, scale

def asymmetric_quantize(x: torch.Tensor, num_bits: int = 8):
    """Asymmetric quantization implementation"""
    q_max = 2 ** num_bits - 1  # 255 for UINT8
    q_min = 0

    # Compute scale and zero_point
    min_val = x.min()
    max_val = x.max()
    scale = (max_val - min_val) / (q_max - q_min)
    zero_point = q_min - torch.round(min_val / scale)
    zero_point = torch.clamp(zero_point, q_min, q_max).to(torch.int32)

    # Quantize
    q = torch.clamp(torch.round(x / scale) + zero_point, q_min, q_max).to(torch.uint8)

    return q, scale, zero_point

def dequantize(q: torch.Tensor, scale: torch.Tensor, zero_point: torch.Tensor = None):
    """Dequantization"""
    if zero_point is None:
        return scale * q.float()
    return scale * (q.float() - zero_point.float())

# Test
x = torch.randn(100)
print(f"Original data range: [{x.min():.4f}, {x.max():.4f}]")

# Symmetric quantization
q_sym, scale_sym = symmetric_quantize(x)
x_reconstructed_sym = dequantize(q_sym, scale_sym)
error_sym = (x - x_reconstructed_sym).abs().mean()
print(f"Symmetric quantization mean error: {error_sym:.6f}")

# Asymmetric quantization
q_asym, scale_asym, zp_asym = asymmetric_quantize(x)
x_reconstructed_asym = dequantize(q_asym, scale_asym, zp_asym)
error_asym = (x - x_reconstructed_asym).abs().mean()
print(f"Asymmetric quantization mean error: {error_asym:.6f}")

1.4 Symmetric vs Asymmetric Quantization

Symmetric Quantization

  • zero_point = 0
  • Symmetric positive/negative range
  • Suitable for weights (mostly zero-centered distribution)
  • Simpler computation: x_approx = scale * q

Asymmetric Quantization

  • zero_point != 0
  • Can represent arbitrary ranges
  • Suitable for activations (always non-negative after ReLU)
  • More complex computation: x_approx = scale * (q - zero_point)

1.5 Quantization Granularity

Determines how many parameters share a single scale/zero_point.

Per-Tensor: One scale for the entire tensor

  • Minimal memory overhead
  • Largest precision loss

Per-Channel (Per-Row/Column): Individual scale per channel

  • Separate scale for each row/column of the weight matrix
  • Effectively handles distribution differences across channels

Per-Group (Per-Block): Individual scale per fixed-size group

  • Typical group_size = 128
  • Compromise between per-channel and per-tensor
  • Commonly used in GPTQ and AWQ
import torch

def per_group_quantize(weight: torch.Tensor, group_size: int = 128, num_bits: int = 4):
    """Per-Group quantization implementation"""
    rows, cols = weight.shape

    # Split into groups
    weight_grouped = weight.reshape(-1, group_size)

    # Max/min per group
    max_vals = weight_grouped.max(dim=1, keepdim=True)[0]
    min_vals = weight_grouped.min(dim=1, keepdim=True)[0]

    q_max = 2 ** num_bits - 1  # 15 for INT4

    # Compute scales
    scales = (max_vals - min_vals) / q_max
    zero_points = torch.round(-min_vals / scales)

    # Quantize
    q = torch.clamp(torch.round(weight_grouped / scales) + zero_points, 0, q_max)

    # Dequantize
    weight_dequant = scales * (q - zero_points)
    weight_dequant = weight_dequant.reshape(rows, cols)

    return q, scales, zero_points, weight_dequant

# Example: Transformer weight quantization
weight = torch.randn(4096, 4096)  # Llama-style weight
q, scales, zp, weight_dequant = per_group_quantize(weight, group_size=128, num_bits=4)

error = (weight - weight_dequant).abs().mean()
print(f"Per-Group INT4 quantization mean error: {error:.6f}")
print(f"Compression ratio: {weight.element_size() * weight.numel() / (q.numel() / 2 + scales.numel() * 4):.2f}x")

2. Post-Training Quantization (PTQ)

PTQ quantizes an already-trained model without retraining — the most practical approach and most widely used.

2.1 Calibration Dataset

PTQ uses a small calibration dataset to determine appropriate scale and zero_point values.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

def collect_calibration_data(model_name: str, num_samples: int = 128):
    """Collect calibration data"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # WikiText-2 or C4 dataset is commonly used
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    texts = []
    for item in dataset:
        if len(item['text'].strip()) > 100:
            texts.append(item['text'].strip())
        if len(texts) >= num_samples:
            break

    # Tokenize
    encoded = [
        tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
        for text in texts
    ]

    return encoded

def collect_activation_stats(model, calibration_data, layer_name: str):
    """Collect activation statistics for a specific layer"""
    stats = {"min": float("inf"), "max": float("-inf")}

    def hook_fn(module, input, output):
        with torch.no_grad():
            act = output.detach().float()
            stats["min"] = min(stats["min"], act.min().item())
            stats["max"] = max(stats["max"], act.max().item())

    # Register hook
    target_layer = dict(model.named_modules())[layer_name]
    handle = target_layer.register_forward_hook(hook_fn)

    # Run calibration data
    model.eval()
    with torch.no_grad():
        for batch in calibration_data[:32]:
            model(**batch)

    handle.remove()
    return stats

2.2 Min-Max Calibration

The simplest method: uses the global minimum and maximum values from calibration data.

class MinMaxCalibrator:
    """Min-Max calibrator"""

    def __init__(self):
        self.min_val = float("inf")
        self.max_val = float("-inf")

    def update(self, tensor: torch.Tensor):
        self.min_val = min(self.min_val, tensor.min().item())
        self.max_val = max(self.max_val, tensor.max().item())

    def compute_scale_zp(self, num_bits: int = 8, symmetric: bool = True):
        q_max = 2 ** (num_bits - 1) - 1 if symmetric else 2 ** num_bits - 1

        if symmetric:
            max_abs = max(abs(self.min_val), abs(self.max_val))
            scale = max_abs / q_max
            zero_point = 0
        else:
            scale = (self.max_val - self.min_val) / q_max
            zero_point = -round(self.min_val / scale)

        return scale, zero_point

2.3 Histogram Calibration

To reduce the impact of outliers, finds the optimal range based on the distribution histogram.

import numpy as np
from scipy import stats

class HistogramCalibrator:
    """Histogram-based calibrator (minimizes KL Divergence)"""

    def __init__(self, num_bins: int = 2048):
        self.num_bins = num_bins
        self.histogram = None
        self.bin_edges = None

    def update(self, tensor: torch.Tensor):
        data = tensor.detach().float().numpy().flatten()

        if self.histogram is None:
            self.histogram, self.bin_edges = np.histogram(data, bins=self.num_bins)
        else:
            new_hist, _ = np.histogram(data, bins=self.bin_edges)
            self.histogram += new_hist

    def compute_optimal_range(self, num_bits: int = 8):
        """Search for optimal range minimizing KL Divergence"""
        num_quantized_bins = 2 ** num_bits - 1

        best_kl = float("inf")
        best_threshold = None

        for i in range(num_quantized_bins, len(self.histogram)):
            reference = self.histogram[:i].copy().astype(float)
            reference /= reference.sum()

            quantized = np.zeros(i)
            bin_size = i / num_quantized_bins

            for j in range(num_quantized_bins):
                start = int(j * bin_size)
                end = int((j + 1) * bin_size)
                quantized[start:end] = reference[start:end].sum() / (end - start)

            quantized = np.where(quantized == 0, 1e-10, quantized)
            reference_clipped = np.where(reference == 0, 1e-10, reference)

            kl = stats.entropy(reference_clipped, quantized)

            if kl < best_kl:
                best_kl = kl
                best_threshold = self.bin_edges[i]

        return -best_threshold, best_threshold

2.4 Impact on Perplexity

The most common metric for measuring quantization quality is Perplexity (PPL).

import torch
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, text: str, device: str = "cuda"):
    """Compute perplexity"""
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(device)

    max_length = 1024
    stride = 512

    nlls = []
    prev_end_loc = 0

    for begin_loc in range(0, input_ids.size(1), stride):
        end_loc = min(begin_loc + max_length, input_ids.size(1))
        trg_len = end_loc - prev_end_loc

        input_ids_chunk = input_ids[:, begin_loc:end_loc]
        target_ids = input_ids_chunk.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids_chunk, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood)
        prev_end_loc = end_loc

        if end_loc == input_ids.size(1):
            break

    ppl = torch.exp(torch.stack(nlls).mean())
    return ppl.item()

# Example PPL comparison
# FP16:          PPL ~5.68
# INT8:          PPL ~5.71 (~0.5% increase)
# INT4 (GPTQ):   PPL ~5.89 (~3.7% increase)
# INT4 (naive):  PPL ~6.52 (~14.8% increase)

3. Quantization-Aware Training (QAT)

QAT simulates quantization during training so the model adapts to quantization noise.

3.1 Fake Quantization

Simulates quantization effects in FP32 instead of actual INT8 operations.

import torch
import torch.nn as nn
import torch.nn.functional as F

class FakeQuantize(nn.Module):
    """Fake quantization module"""

    def __init__(self, num_bits: int = 8, symmetric: bool = True):
        super().__init__()
        self.num_bits = num_bits
        self.symmetric = symmetric

        self.register_buffer('scale', torch.tensor(1.0))
        self.register_buffer('zero_point', torch.tensor(0))
        self.register_buffer('fake_quant_enabled', torch.tensor(1))

        if symmetric:
            self.q_min = -(2 ** (num_bits - 1))
            self.q_max = 2 ** (num_bits - 1) - 1
        else:
            self.q_min = 0
            self.q_max = 2 ** num_bits - 1

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.fake_quant_enabled[0] == 0:
            return x

        # Update scale with exponential moving average during training
        if self.training:
            with torch.no_grad():
                if self.symmetric:
                    max_abs = x.abs().max()
                    new_scale = max_abs / self.q_max
                else:
                    new_scale = (x.max() - x.min()) / (self.q_max - self.q_min)

                self.scale.copy_(0.9 * self.scale + 0.1 * new_scale)

        # Fake quantize: quantize then dequantize
        x_scaled = x / self.scale
        x_clipped = torch.clamp(x_scaled, self.q_min, self.q_max)
        x_rounded = torch.round(x_clipped)
        x_dequant = x_rounded * self.scale

        return x_dequant

3.2 STE (Straight-Through Estimator)

class STERound(torch.autograd.Function):
    """Straight-Through Estimator for round()"""

    @staticmethod
    def forward(ctx, x):
        return torch.round(x)

    @staticmethod
    def backward(ctx, grad_output):
        # Pass gradient through round() unchanged (identity approximation)
        return grad_output

class STEClamp(torch.autograd.Function):
    """Straight-Through Estimator for clamp()"""

    @staticmethod
    def forward(ctx, x, min_val, max_val):
        ctx.save_for_backward(x)
        ctx.min_val = min_val
        ctx.max_val = max_val
        return torch.clamp(x, min_val, max_val)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        # Pass gradient only within clamp range
        grad = grad_output * ((x >= ctx.min_val) & (x <= ctx.max_val)).float()
        return grad, None, None

class QATLinear(nn.Module):
    """Linear layer with QAT applied"""

    def __init__(self, in_features, out_features, num_bits=8):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.weight_fake_quant = FakeQuantize(num_bits=num_bits)
        self.act_fake_quant = FakeQuantize(num_bits=num_bits, symmetric=False)

    def forward(self, x):
        # Activation quantization
        x_q = self.act_fake_quant(x)
        # Weight quantization
        w_q = self.weight_fake_quant(self.linear.weight)
        # FP32 compute (INT8 in actual deployment)
        return F.linear(x_q, w_q, self.linear.bias)

3.3 When is QAT Needed?

  • When PTQ quality loss is too high: Especially effective for small models (BERT-small, etc.)
  • Quantizing to INT4 or lower: Essential for extreme compression
  • Precision-sensitive tasks: Object detection, ASR, etc.
# QAT training workflow
import torch.optim as optim
from torch.quantization import prepare_qat, convert

def train_qat_model(model, train_loader, num_epochs=10):
    """QAT training example"""

    model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
    model_prepared = prepare_qat(model.train())

    optimizer = optim.Adam(model_prepared.parameters(), lr=1e-5)

    for epoch in range(num_epochs):
        for batch in train_loader:
            inputs, labels = batch
            outputs = model_prepared(inputs)
            loss = F.cross_entropy(outputs, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Convert to INT8 model
    model_prepared.eval()
    model_quantized = convert(model_prepared)

    return model_quantized

4. PyTorch Quantization API

4.1 torch.ao.quantization

PyTorch's official quantization API.

import torch
from torch.ao.quantization import (
    get_default_qconfig,
    get_default_qat_qconfig,
    prepare,
    prepare_qat,
    convert
)

# Static quantization (PTQ)
def static_quantization_example():
    """Static quantization example"""
    model = MyModel()
    model.eval()

    # Backend config (fbgemm: x86, qnnpack: ARM)
    model.qconfig = get_default_qconfig('fbgemm')

    # Prepare for calibration
    model_prepared = prepare(model)

    # Collect statistics from calibration data
    with torch.no_grad():
        for data in calibration_loader:
            model_prepared(data)

    # Convert to INT8 model
    model_quantized = convert(model_prepared)

    return model_quantized

# Dynamic quantization (effective for LSTM, Linear)
def dynamic_quantization_example():
    """Dynamic quantization example"""
    model = MyModel()

    model_quantized = torch.quantization.quantize_dynamic(
        model,
        {nn.Linear, nn.LSTM},  # Layer types to quantize
        dtype=torch.qint8
    )

    return model_quantized

4.2 FX Graph Mode Quantization

A more flexible and powerful quantization approach.

from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping

def fx_quantization_example(model, calibration_data):
    """FX Graph Mode quantization"""
    model.eval()

    qconfig_mapping = QConfigMapping().set_global(
        get_default_qconfig('fbgemm')
    )

    example_inputs = (torch.randn(1, 3, 224, 224),)

    model_prepared = prepare_fx(
        model,
        qconfig_mapping,
        example_inputs
    )

    with torch.no_grad():
        for batch in calibration_data:
            model_prepared(batch)

    model_quantized = convert_fx(model_prepared)

    return model_quantized

5. GPTQ: Accurate Post-Training Quantization

GPTQ, published in 2022, is an LLM-specific quantization algorithm that minimizes quality loss even at INT4. (arXiv:2209.05433)

5.1 GPTQ Algorithm Principles

GPTQ is based on OBQ (Optimal Brain Quantization). The core idea: quantize weights layer by layer sequentially, then compensate for quantization errors in the already-quantized weights by updating the remaining weights.

OBQ error minimization objective:

argmin_Q ||WX - QX||_F^2

Where W is the original weight, Q is the quantized weight, and X is the input activation.

Hessian-based weight update:

After quantizing each weight, the resulting error is propagated to the remaining weights using the inverse Hessian H^(-1).

import torch
import math

def gptq_quantize_weight(weight: torch.Tensor,
                          hessian: torch.Tensor,
                          num_bits: int = 4,
                          group_size: int = 128,
                          damp_percent: float = 0.01):
    """
    Quantize weights using the GPTQ algorithm

    Args:
        weight: [out_features, in_features] weight matrix
        hessian: [in_features, in_features] Hessian matrix (H = 2 * X @ X.T)
        num_bits: quantization bit count
        group_size: group size
        damp_percent: damping ratio for Hessian stabilization
    """
    W = weight.clone().float()
    n_rows, n_cols = W.shape

    # Hessian damping (numerical stability)
    H = hessian.clone().float()
    dead_cols = torch.diag(H) == 0
    H[dead_cols, dead_cols] = 1
    W[:, dead_cols] = 0

    damp = damp_percent * H.diag().mean()
    H.diagonal().add_(damp)

    # Inverse Hessian via Cholesky decomposition
    H_inv = torch.linalg.cholesky(H)
    H_inv = torch.cholesky_inverse(H_inv)
    H_inv = torch.linalg.cholesky(H_inv, upper=True)

    Q = torch.zeros_like(W)
    Losses = torch.zeros_like(W)

    q_max = 2 ** (num_bits - 1) - 1

    for col_idx in range(n_cols):
        w_col = W[:, col_idx]
        h_inv_diag = H_inv[col_idx, col_idx]

        # Compute per-group scale
        if group_size != -1 and col_idx % group_size == 0:
            group_end = min(col_idx + group_size, n_cols)
            w_group = W[:, col_idx:group_end]
            max_abs = w_group.abs().max(dim=1)[0].unsqueeze(1)
            scale = max_abs / q_max
            scale = torch.clamp(scale, min=1e-8)

        # Quantize
        q_col = torch.clamp(torch.round(w_col / scale.squeeze()), -q_max, q_max)
        q_col = q_col * scale.squeeze()
        Q[:, col_idx] = q_col

        # Quantization error
        err = (w_col - q_col) / h_inv_diag
        Losses[:, col_idx] = err ** 2 / 2

        # Propagate error to remaining weights (the key step!)
        W[:, col_idx + 1:] -= err.unsqueeze(1) * H_inv[col_idx, col_idx + 1:].unsqueeze(0)

    return Q, Losses

5.2 Using AutoGPTQ

Practical GPTQ quantization uses the AutoGPTQ library.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

def quantize_with_gptq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128
):
    """Quantize model with AutoGPTQ"""

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    quantize_config = BaseQuantizeConfig(
        bits=bits,
        group_size=group_size,
        damp_percent=0.01,
        desc_act=False,
        sym=True,
        true_sequential=True
    )

    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config
    )

    # Prepare calibration data
    from datasets import load_dataset
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    calibration_data = []
    for text in dataset["text"][:128]:
        if len(text.strip()) > 50:
            encoded = tokenizer(
                text.strip(),
                return_tensors="pt",
                max_length=2048,
                truncation=True
            )
            calibration_data.append(encoded["input_ids"].squeeze())

    print(f"Starting GPTQ {bits}bit quantization...")
    model.quantize(calibration_data)

    model.save_quantized(output_dir, use_safetensors=True)
    tokenizer.save_pretrained(output_dir)

    print(f"Quantization complete: {output_dir}")
    return model, tokenizer


def load_gptq_model(model_dir: str, device: str = "cuda"):
    """Load GPTQ quantized model"""

    model = AutoGPTQForCausalLM.from_quantized(
        model_dir,
        device=device,
        use_triton=False,
        disable_exllama=False,
        inject_fused_attention=True,
        inject_fused_mlp=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer

6. AWQ: Activation-aware Weight Quantization

AWQ, published in 2023, analyzes activation distributions to protect important weight channels. (arXiv:2306.00978)

6.1 Differences from GPTQ

FeatureGPTQAWQ
ApproachHessian-based error compensationActivation-based scaling
Calibration dataRequired (128+ samples)Required (32+ samples)
SpeedSlow (1–4 hours)Fast (tens of minutes)
QualityExcellentExcellent (comparable or better)
Key featurePer-channel optimizationActivation outlier handling

6.2 Using AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

def quantize_with_awq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128
):
    """Quantize model with AutoAWQ"""

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )

    model = AutoAWQForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        use_cache=False
    )

    quant_config = {
        "zero_point": True,
        "q_group_size": group_size,
        "w_bit": bits,
        "version": "GEMM"
    }

    print(f"Starting AWQ {bits}bit quantization...")
    model.quantize(tokenizer, quant_config=quant_config)

    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)

    print(f"AWQ quantization complete: {output_dir}")
    return model

def load_awq_model(model_dir: str, device: str = "cuda"):
    """Load AWQ quantized model"""

    model = AutoAWQForCausalLM.from_quantized(
        model_dir,
        fuse_layers=True,
        trust_remote_code=True,
        safetensors=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer

7. GGUF/GGML: The llama.cpp Ecosystem

GGUF (GPT-Generated Unified Format) is the model format for the llama.cpp project, enabling efficient LLM execution even on CPU.

7.1 Understanding GGUF

GGUF was introduced in 2023 to replace GGML. It stores model metadata, hyperparameters, and tokenizer information in a single file.

7.2 Quantization Levels Comparison

FormatBitsMemory (7B)PPL IncreaseRecommended Use
Q2_K2.62.8 GBHighExtreme compression
Q3_K_S3.03.3 GBMediumMemory saving
Q4_04.03.8 GBLowBalanced
Q4_K_M4.14.1 GBVery lowGeneral recommendation
Q5_05.04.7 GBMinimalHigh quality
Q5_K_M5.14.8 GBMinimalHigh quality recommended
Q6_K6.05.5 GBNearly noneNear FP16
Q8_08.07.2 GBNoneReference use
F1616.013.5 GBNoneBaseline

K-quants (Q4_K_M, Q5_K_M, etc.) keep some layers at higher precision to improve quality.

7.3 Building and Using llama.cpp

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

# CPU-only build
cmake -B build
cmake --build build --config Release -j $(nproc)

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
    --model meta-llama/Llama-2-7b-hf \
    --outfile llama2-7b-f16.gguf \
    --outtype f16

# Quantize to Q4_K_M
./build/bin/llama-quantize \
    llama2-7b-f16.gguf \
    llama2-7b-q4_k_m.gguf \
    Q4_K_M

# Run inference
./build/bin/llama-cli \
    -m llama2-7b-q4_k_m.gguf \
    -p "The future of AI is" \
    -n 100 \
    --ctx-size 4096 \
    --threads 8 \
    --n-gpu-layers 35

7.4 Python Bindings (llama-cpp-python)

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./llama2-7b-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    n_threads=8,
    verbose=False
)

# Text generation
output = llm(
    "Once upon a time",
    max_tokens=200,
    temperature=0.7,
    top_p=0.9,
    stop=["</s>", "\n\n"]
)

print(output["choices"][0]["text"])

# Chat completion format
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response["choices"][0]["message"]["content"])

# Streaming output
for chunk in llm.create_chat_completion(
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True
):
    delta = chunk["choices"][0].get("delta", {})
    if "content" in delta:
        print(delta["content"], end="", flush=True)

8. bitsandbytes: LLM Quantization Library

bitsandbytes, developed by Tim Dettmers, integrates seamlessly with HuggingFace transformers.

8.1 LLM.int8() — 8-bit Mixed Precision

LLM.int8() handles activation outliers in FP16 during matrix multiplication while using INT8 for the rest.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load INT8 model
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

def print_model_size(model, label):
    """Print model memory usage"""
    total_params = sum(p.numel() for p in model.parameters())
    total_bytes = sum(
        p.numel() * p.element_size() for p in model.parameters()
    )
    print(f"{label}: {total_params/1e9:.2f}B params, {total_bytes/1e9:.2f} GB")

print_model_size(model_8bit, "INT8 model")
# INT8 model: 6.74B params, ~7.0 GB

8.2 4-bit Quantization (Used in QLoRA)

import bitsandbytes as bnb
from transformers import BitsAndBytesConfig

# NF4 quantization config (QLoRA)
bnb_config_nf4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Double quantization
)

# FP4 quantization config
bnb_config_fp4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Load model
model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config_nf4,
    device_map="auto"
)

print_model_size(model_4bit, "NF4 model")
# NF4 model: 6.74B params, ~4.0 GB (with double quantization)

# QLoRA fine-tuning setup
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_4bit = prepare_model_for_kbit_training(model_4bit)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model_lora = get_peft_model(model_4bit, lora_config)
model_lora.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,607,232 || trainable%: 0.1197

8.3 NF4 vs FP4

NF4 (Normal Float 4)

  • Non-linear 4-bit quantization assuming a normal distribution
  • Leverages the observation that weight distributions are approximately normal
  • Better representational power at the same bit count

FP4 (Float 4)

  • Floating-point based 4-bit
  • Can represent wider ranges

9. SmoothQuant: W8A8 Quantization

SmoothQuant quantizes both weights (W) and activations (A) to INT8 for faster inference.

9.1 The Activation Outlier Problem

LLM activations exhibit very large values (outliers) in specific channels, making W8A8 quantization challenging.

9.2 Migration Scaling

SmoothQuant's key insight: transfer the difficulty from activations to weights.

Y = (X * diag(s)^(-1)) * (diag(s) * W)
  = X_smooth * W_smooth
def smooth_quantize(
    model,
    calibration_samples,
    alpha: float = 0.5
):
    """
    Apply SmoothQuant

    Args:
        alpha: migration strength (0=weights only, 1=activations only)
               Recommended: 0.5 (equal distribution)
    """

    act_scales = {}

    def collect_scales(name):
        def hook(module, input, output):
            inp = input[0].detach()
            if inp.dim() == 3:
                inp = inp.reshape(-1, inp.size(-1))

            channel_max = inp.abs().max(dim=0)[0]

            if name not in act_scales:
                act_scales[name] = channel_max
            else:
                act_scales[name] = torch.maximum(act_scales[name], channel_max)
        return hook

    handles = []
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            handles.append(module.register_forward_hook(collect_scales(name)))

    with torch.no_grad():
        for sample in calibration_samples:
            model(**sample)

    for h in handles:
        h.remove()

    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear) and name in act_scales:
            act_scale = act_scales[name]
            weight_scale = module.weight.abs().max(dim=0)[0]

            # Compute migration scale
            smooth_scale = (act_scale ** alpha) / (weight_scale ** (1 - alpha))
            smooth_scale = torch.clamp(smooth_scale, min=1e-5)

            # Apply scale to weights
            module.weight.data = module.weight.data / smooth_scale.unsqueeze(0)

    return model, act_scales

10. SpQR: Sparse Quantization Representation

SpQR stores important weights (outliers) in FP16 separately and quantizes the remainder to low precision.

import torch

def spqr_quantize(weight: torch.Tensor,
                   num_bits: int = 3,
                   outlier_threshold_percentile: float = 1.0):
    """
    SpQR quantization (simplified version)

    Core: store top p% outliers as FP16, quantize rest to low bits
    """

    threshold = torch.quantile(weight.abs(), 1 - outlier_threshold_percentile / 100)
    outlier_mask = weight.abs() > threshold

    # Store outliers (FP16)
    outlier_values = weight.clone()
    outlier_values[~outlier_mask] = 0

    # Quantize remainder
    regular_weight = weight.clone()
    regular_weight[outlier_mask] = 0

    q_max = 2 ** (num_bits - 1) - 1
    group_size = 16

    rows, cols = regular_weight.shape
    regular_grouped = regular_weight.reshape(-1, group_size)

    max_abs = regular_grouped.abs().max(dim=1, keepdim=True)[0]
    scales = max_abs / q_max
    scales = torch.clamp(scales, min=1e-8)

    q = torch.clamp(torch.round(regular_grouped / scales), -q_max, q_max).to(torch.int8)
    regular_dequant = (scales * q.float()).reshape(rows, cols)

    reconstructed = regular_dequant + outlier_values

    error = (weight - reconstructed).abs().mean().item()
    outlier_memory = outlier_mask.sum().item() * 2
    regular_memory = (~outlier_mask).sum().item() * (num_bits / 8)
    total_memory = outlier_memory + regular_memory
    original_memory = weight.numel() * weight.element_size()
    compression_ratio = original_memory / total_memory

    print(f"Outlier ratio: {outlier_mask.float().mean():.2%}")
    print(f"Mean reconstruction error: {error:.6f}")
    print(f"Compression ratio: {compression_ratio:.2f}x")

    return q, scales, outlier_values, outlier_mask

11. Quantization Benchmark Comparison

11.1 Llama-2-7B Benchmark

import time
import torch
import GPUtil

def benchmark_quantization(model, tokenizer, device="cuda", num_runs=50):
    """Benchmark quantized model"""

    prompt = "The history of artificial intelligence began"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    if device == "cuda":
        torch.cuda.synchronize()
        gpu = GPUtil.getGPUs()[0]
        memory_used_gb = gpu.memoryUsed / 1024

    # Warmup
    with torch.no_grad():
        for _ in range(5):
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False
            )

    # Measure speed
    if device == "cuda":
        torch.cuda.synchronize()
    start = time.time()

    with torch.no_grad():
        for _ in range(num_runs):
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False
            )

    if device == "cuda":
        torch.cuda.synchronize()
    elapsed = time.time() - start

    avg_time = elapsed / num_runs
    tokens_per_second = 50 / avg_time

    return {
        "memory_gb": memory_used_gb,
        "avg_time_ms": avg_time * 1000,
        "tokens_per_second": tokens_per_second
    }

# Example results (A100 80GB, Llama-2-7B)
benchmark_results = {
    "FP16": {"memory_gb": 13.5, "tokens_per_second": 52.3, "ppl": 5.68},
    "INT8 (bitsandbytes)": {"memory_gb": 7.8, "tokens_per_second": 38.1, "ppl": 5.71},
    "INT4 GPTQ": {"memory_gb": 4.5, "tokens_per_second": 65.2, "ppl": 5.89},
    "INT4 AWQ": {"memory_gb": 4.3, "tokens_per_second": 68.7, "ppl": 5.86},
    "Q4_K_M (GGUF, CPU)": {"memory_gb": 4.1, "tokens_per_second": 45.2, "ppl": 5.91},
    "INT4 NF4": {"memory_gb": 4.0, "tokens_per_second": 31.5, "ppl": 5.94},
}

print("=" * 80)
print(f"{'Method':<25} {'Memory (GB)':<12} {'Tok/s':<12} {'PPL':<8}")
print("=" * 80)
for method, stats in benchmark_results.items():
    print(f"{method:<25} {stats['memory_gb']:<12.1f} {stats['tokens_per_second']:<12.1f} {stats['ppl']:<8.2f}")

12. Practical Guide: Choosing the Right Quantization Method

12.1 Strategy by Model Size

Small models under 7B:

  • GGUF Q4_K_M: optimal for local CPU execution
  • AWQ INT4: recommended for GPU server deployment
  • FP16 viable if memory allows (under 24GB GPU)

Mid-size models 13B–30B:

  • GPTQ INT4 or AWQ INT4: runs on a single 24GB GPU
  • GGUF Q4_K_M: can run in 16GB RAM

Large models 70B+:

  • GPTQ INT4: runs on a single A100 80GB
  • GPTQ INT2: for extreme compression
  • Multi-GPU + Tensor Parallel combination

12.2 Strategy by Task

def recommend_quantization(
    task: str,
    model_size_b: float,
    gpu_memory_gb: float,
    cpu_only: bool = False,
    fine_tuning_needed: bool = False
):
    """Recommend quantization based on task and environment"""

    recommendations = []

    if cpu_only:
        recommendations.append({
            "method": "GGUF Q4_K_M",
            "reason": "Optimized for CPU inference, based on llama.cpp",
            "library": "llama-cpp-python"
        })
        return recommendations

    if fine_tuning_needed:
        recommendations.append({
            "method": "bitsandbytes NF4 + QLoRA",
            "reason": "Fine-tuning capable, LoRA adapter training with ~4GB overhead",
            "library": "bitsandbytes + peft"
        })
        return recommendations

    fp16_memory = model_size_b * 2
    int8_memory = model_size_b * 1
    int4_memory = model_size_b * 0.5

    if fp16_memory <= gpu_memory_gb * 0.8:
        recommendations.append({
            "method": "FP16 (baseline)",
            "reason": "Memory is sufficient, best quality",
            "memory_gb": fp16_memory
        })

    if int8_memory <= gpu_memory_gb * 0.8:
        if task in ["chat", "completion", "summarization"]:
            recommendations.append({
                "method": "AWQ INT8",
                "reason": "Optimal balance of quality and speed",
                "library": "autoawq",
                "memory_gb": int8_memory
            })

    if int4_memory <= gpu_memory_gb * 0.8:
        recommendations.append({
            "method": "AWQ INT4",
            "reason": "Fast inference, excellent quality",
            "library": "autoawq",
            "memory_gb": int4_memory
        })
        recommendations.append({
            "method": "GPTQ INT4",
            "reason": "Best INT4 quality, slower quantization process",
            "library": "auto-gptq",
            "memory_gb": int4_memory
        })

    return recommendations

# Example usage
recommendations = recommend_quantization(
    task="chat",
    model_size_b=7.0,
    gpu_memory_gb=16.0,
    fine_tuning_needed=False
)

for rec in recommendations:
    print(f"\nMethod: {rec['method']}")
    print(f"Reason: {rec['reason']}")
    if 'library' in rec:
        print(f"Library: {rec['library']}")
    if 'memory_gb' in rec:
        print(f"Expected memory: {rec['memory_gb']:.1f} GB")

12.3 Complete Quantization Pipeline

import torch
import os
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

class QuantizationPipeline:
    """Unified quantization pipeline"""

    def __init__(self, model_name: str, output_base_dir: str):
        self.model_name = model_name
        self.output_base_dir = output_base_dir
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        os.makedirs(output_base_dir, exist_ok=True)

    def quantize_gptq(self, bits: int = 4, group_size: int = 128):
        """GPTQ quantization"""
        output_dir = os.path.join(self.output_base_dir, f"gptq-{bits}bit")

        config = BaseQuantizeConfig(
            bits=bits,
            group_size=group_size,
            sym=True,
            desc_act=False
        )

        model = AutoGPTQForCausalLM.from_pretrained(
            self.model_name,
            quantize_config=config
        )

        calibration_data = self._prepare_calibration_data()
        model.quantize(calibration_data)
        model.save_quantized(output_dir)
        self.tokenizer.save_pretrained(output_dir)

        print(f"GPTQ {bits}bit saved: {output_dir}")
        return output_dir

    def quantize_awq(self, bits: int = 4, group_size: int = 128):
        """AWQ quantization"""
        output_dir = os.path.join(self.output_base_dir, f"awq-{bits}bit")

        model = AutoAWQForCausalLM.from_pretrained(
            self.model_name,
            low_cpu_mem_usage=True
        )

        quant_config = {
            "zero_point": True,
            "q_group_size": group_size,
            "w_bit": bits,
            "version": "GEMM"
        }

        model.quantize(self.tokenizer, quant_config=quant_config)
        model.save_quantized(output_dir)
        self.tokenizer.save_pretrained(output_dir)

        print(f"AWQ {bits}bit saved: {output_dir}")
        return output_dir

    def _prepare_calibration_data(self, num_samples: int = 128):
        """Prepare calibration data"""
        from datasets import load_dataset

        dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

        data = []
        for text in dataset["text"]:
            if len(text.strip()) > 50:
                encoded = self.tokenizer(
                    text.strip(),
                    return_tensors="pt",
                    max_length=2048,
                    truncation=True
                )
                data.append(encoded["input_ids"].squeeze())
                if len(data) >= num_samples:
                    break

        return data

Conclusion

Model quantization is a foundational technology for democratizing LLMs. To summarize what we covered:

  1. Fundamentals: The math behind compressing FP32 to INT4 (scale, zero_point)
  2. PTQ vs QAT: PTQ is practical without retraining; QAT is essential for extreme compression
  3. GPTQ: Best INT4 quality via Hessian-based error compensation
  4. AWQ: Fast and efficient quantization based on activation distributions
  5. GGUF: Optimized for CPU execution, multiple quality levels available
  6. bitsandbytes: HuggingFace integration, essential for QLoRA fine-tuning

Recommended strategies:

  • Local execution: GGUF Q4_K_M
  • GPU server deployment: AWQ 4-bit
  • Quality-critical scenarios: GPTQ 4-bit or FP16
  • Fine-tuning needed: bitsandbytes NF4 + QLoRA

Quantization technology is evolving rapidly, with 2-bit methods like QuIP# and AQLM emerging. The journey toward smaller, faster models continues.

References