💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

As deep learning models grow increasingly large, inference costs and memory requirements have skyrocketed. GPT-3 has 175B parameters, Llama 3 has 70B, and storing them in full FP32 precision requires 700GB and 280GB of memory respectively — completely infeasible for typical GPUs.

**Model Quantization** is the key technology to solve this problem. By compressing 32-bit floating-point (FP32) weights to 8-bit or 4-bit integers, it reduces memory by 4–8x and improves inference speed by 2–4x. Remarkably, quality loss is minimal.

In this guide, we thoroughly explore quantization from mathematical foundations to the latest techniques: GPTQ, AWQ, GGUF, and bitsandbytes.

1. Quantization Fundamentals: Understanding Number Representations

1.1 Floating-Point Formats

Understanding the floating-point formats used in modern deep learning is the starting point for quantization.

**FP32 (Float32)**

- Sign (1 bit) + Exponent (8 bits) + Mantissa (23 bits) = 32 bits total

- Range: approximately -3.4e38 to 3.4e38

- Precision: ~7 decimal digits

**FP16 (Float16)**

- Sign (1 bit) + Exponent (5 bits) + Mantissa (10 bits) = 16 bits total

- Range: -65504 to 65504 (much narrower than FP32)

- Precision: ~3 decimal digits

- Risk of overflow during training; requires gradient scaling

**BF16 (Brain Float16)**

- Sign (1 bit) + Exponent (8 bits) + Mantissa (7 bits) = 16 bits total

- Maintains the same exponent range as FP32 while reducing mantissa bits

- No overflow risk, safer for deep learning training

- Developed by Google Brain, natively supported on modern GPUs (A100, H100)

Check memory size of each data type

x_fp32 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float32)

x_fp16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float16)

x_bf16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.bfloat16)

print(f"FP32: {x_fp32.element_size()} bytes per element") # 4 bytes

print(f"FP16: {x_fp16.element_size()} bytes per element") # 2 bytes

print(f"BF16: {x_bf16.element_size()} bytes per element") # 2 bytes

Memory calculation for a 7B parameter model

params = 7e9

fp32_memory_gb = params * 4 / 1e9

fp16_memory_gb = params * 2 / 1e9

int8_memory_gb = params * 1 / 1e9

int4_memory_gb = params * 0.5 / 1e9

print(f"\n7B Model Memory Requirements:")

print(f"FP32: {fp32_memory_gb:.1f} GB") # 28.0 GB

print(f"FP16: {fp16_memory_gb:.1f} GB") # 14.0 GB

print(f"INT8: {int8_memory_gb:.1f} GB") # 7.0 GB

print(f"INT4: {int4_memory_gb:.1f} GB") # 3.5 GB

1.2 Integer Representations

The core of quantization is mapping floating-point values to integers.

**INT8**: -128 to 127 (signed) or 0 to 255 (unsigned)

**INT4**: -8 to 7 (signed) or 0 to 15 (unsigned)

**INT2**: -2 to 1 (signed) or 0 to 3 (unsigned)

1.3 Quantization Formula

The fundamental formula to convert a floating-point value x to integer q:

q = clamp(round(x / scale) + zero_point, q_min, q_max)

Dequantization:

x_approx = scale * (q - zero_point)

Where:

- scale: quantization scale factor (scale = (max_val - min_val) / (q_max - q_min))

- zero_point: the offset representing which real value integer 0 corresponds to

- q_min, q_max: integer range bounds (-128, 127 for INT8)

def symmetric_quantize(x: torch.Tensor, num_bits: int = 8):

"""Symmetric quantization implementation"""

q_max = 2 ** (num_bits - 1) - 1 # 127 for INT8

q_min = -q_max # -127

Compute scale

max_abs = x.abs().max()

scale = max_abs / q_max

Quantize

q = torch.clamp(torch.round(x / scale), q_min, q_max).to(torch.int8)

return q, scale

def asymmetric_quantize(x: torch.Tensor, num_bits: int = 8):

"""Asymmetric quantization implementation"""

q_max = 2 ** num_bits - 1 # 255 for UINT8

q_min = 0

Compute scale and zero_point

min_val = x.min()

max_val = x.max()

scale = (max_val - min_val) / (q_max - q_min)

zero_point = q_min - torch.round(min_val / scale)

zero_point = torch.clamp(zero_point, q_min, q_max).to(torch.int32)

Quantize

q = torch.clamp(torch.round(x / scale) + zero_point, q_min, q_max).to(torch.uint8)

return q, scale, zero_point

def dequantize(q: torch.Tensor, scale: torch.Tensor, zero_point: torch.Tensor = None):

"""Dequantization"""

if zero_point is None:

return scale * q.float()

return scale * (q.float() - zero_point.float())

Test

x = torch.randn(100)

print(f"Original data range: [{x.min():.4f}, {x.max():.4f}]")

Symmetric quantization

q_sym, scale_sym = symmetric_quantize(x)

x_reconstructed_sym = dequantize(q_sym, scale_sym)

error_sym = (x - x_reconstructed_sym).abs().mean()

print(f"Symmetric quantization mean error: {error_sym:.6f}")

Asymmetric quantization

q_asym, scale_asym, zp_asym = asymmetric_quantize(x)

x_reconstructed_asym = dequantize(q_asym, scale_asym, zp_asym)

error_asym = (x - x_reconstructed_asym).abs().mean()

print(f"Asymmetric quantization mean error: {error_asym:.6f}")

1.4 Symmetric vs Asymmetric Quantization

**Symmetric Quantization**

- zero_point = 0

- Symmetric positive/negative range

- Suitable for weights (mostly zero-centered distribution)

- Simpler computation: x_approx = scale \* q

**Asymmetric Quantization**

- zero_point != 0

- Can represent arbitrary ranges

- Suitable for activations (always non-negative after ReLU)

- More complex computation: x_approx = scale \* (q - zero_point)

1.5 Quantization Granularity

Determines how many parameters share a single scale/zero_point.

**Per-Tensor**: One scale for the entire tensor

- Minimal memory overhead

- Largest precision loss

**Per-Channel (Per-Row/Column)**: Individual scale per channel

- Separate scale for each row/column of the weight matrix

- Effectively handles distribution differences across channels

**Per-Group (Per-Block)**: Individual scale per fixed-size group

- Typical group_size = 128

- Compromise between per-channel and per-tensor

- Commonly used in GPTQ and AWQ

def per_group_quantize(weight: torch.Tensor, group_size: int = 128, num_bits: int = 4):

"""Per-Group quantization implementation"""

rows, cols = weight.shape

Split into groups

weight_grouped = weight.reshape(-1, group_size)

Max/min per group

max_vals = weight_grouped.max(dim=1, keepdim=True)[0]

min_vals = weight_grouped.min(dim=1, keepdim=True)[0]

q_max = 2 ** num_bits - 1 # 15 for INT4

Compute scales

scales = (max_vals - min_vals) / q_max

zero_points = torch.round(-min_vals / scales)

Quantize

q = torch.clamp(torch.round(weight_grouped / scales) + zero_points, 0, q_max)

Dequantize

weight_dequant = scales * (q - zero_points)

weight_dequant = weight_dequant.reshape(rows, cols)

return q, scales, zero_points, weight_dequant

Example: Transformer weight quantization

weight = torch.randn(4096, 4096) # Llama-style weight

q, scales, zp, weight_dequant = per_group_quantize(weight, group_size=128, num_bits=4)

error = (weight - weight_dequant).abs().mean()

print(f"Per-Group INT4 quantization mean error: {error:.6f}")

print(f"Compression ratio: {weight.element_size() * weight.numel() / (q.numel() / 2 + scales.numel() * 4):.2f}x")

2. Post-Training Quantization (PTQ)

PTQ quantizes an already-trained model without retraining — the most practical approach and most widely used.

2.1 Calibration Dataset

PTQ uses a small calibration dataset to determine appropriate scale and zero_point values.

from transformers import AutoModelForCausalLM, AutoTokenizer

from datasets import load_dataset

def collect_calibration_data(model_name: str, num_samples: int = 128):

"""Collect calibration data"""

tokenizer = AutoTokenizer.from_pretrained(model_name)

WikiText-2 or C4 dataset is commonly used

dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

texts = []

for item in dataset:

if len(item['text'].strip()) > 100:

texts.append(item['text'].strip())

if len(texts) >= num_samples:

break

Tokenize

encoded = [

tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)

for text in texts

]

return encoded

def collect_activation_stats(model, calibration_data, layer_name: str):

"""Collect activation statistics for a specific layer"""

stats = {"min": float("inf"), "max": float("-inf")}

def hook_fn(module, input, output):

with torch.no_grad():

act = output.detach().float()

stats["min"] = min(stats["min"], act.min().item())

stats["max"] = max(stats["max"], act.max().item())

Register hook

target_layer = dict(model.named_modules())[layer_name]

handle = target_layer.register_forward_hook(hook_fn)

Run calibration data

model.eval()

with torch.no_grad():

for batch in calibration_data[:32]:

model(**batch)

handle.remove()

return stats

2.2 Min-Max Calibration

The simplest method: uses the global minimum and maximum values from calibration data.

class MinMaxCalibrator:

"""Min-Max calibrator"""

def __init__(self):

self.min_val = float("inf")

self.max_val = float("-inf")

def update(self, tensor: torch.Tensor):

self.min_val = min(self.min_val, tensor.min().item())

self.max_val = max(self.max_val, tensor.max().item())

def compute_scale_zp(self, num_bits: int = 8, symmetric: bool = True):

q_max = 2 ** (num_bits - 1) - 1 if symmetric else 2 ** num_bits - 1

if symmetric:

max_abs = max(abs(self.min_val), abs(self.max_val))

scale = max_abs / q_max

zero_point = 0

else:

scale = (self.max_val - self.min_val) / q_max

zero_point = -round(self.min_val / scale)

return scale, zero_point

2.3 Histogram Calibration

To reduce the impact of outliers, finds the optimal range based on the distribution histogram.

from scipy import stats

class HistogramCalibrator:

"""Histogram-based calibrator (minimizes KL Divergence)"""

def __init__(self, num_bins: int = 2048):

self.num_bins = num_bins

self.histogram = None

self.bin_edges = None

def update(self, tensor: torch.Tensor):

data = tensor.detach().float().numpy().flatten()

if self.histogram is None:

self.histogram, self.bin_edges = np.histogram(data, bins=self.num_bins)

else:

new_hist, _ = np.histogram(data, bins=self.bin_edges)

self.histogram += new_hist

def compute_optimal_range(self, num_bits: int = 8):

"""Search for optimal range minimizing KL Divergence"""

num_quantized_bins = 2 ** num_bits - 1

best_kl = float("inf")

best_threshold = None

for i in range(num_quantized_bins, len(self.histogram)):

reference = self.histogram[:i].copy().astype(float)

reference /= reference.sum()

quantized = np.zeros(i)

bin_size = i / num_quantized_bins

for j in range(num_quantized_bins):

start = int(j * bin_size)

end = int((j + 1) * bin_size)

quantized[start:end] = reference[start:end].sum() / (end - start)

quantized = np.where(quantized == 0, 1e-10, quantized)

reference_clipped = np.where(reference == 0, 1e-10, reference)

kl = stats.entropy(reference_clipped, quantized)

if kl < best_kl:

best_kl = kl

best_threshold = self.bin_edges[i]

return -best_threshold, best_threshold

2.4 Impact on Perplexity

The most common metric for measuring quantization quality is Perplexity (PPL).

from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, text: str, device: str = "cuda"):

"""Compute perplexity"""

encodings = tokenizer(text, return_tensors="pt")

input_ids = encodings.input_ids.to(device)

max_length = 1024

stride = 512

nlls = []

prev_end_loc = 0

for begin_loc in range(0, input_ids.size(1), stride):

end_loc = min(begin_loc + max_length, input_ids.size(1))

trg_len = end_loc - prev_end_loc

input_ids_chunk = input_ids[:, begin_loc:end_loc]

target_ids = input_ids_chunk.clone()

target_ids[:, :-trg_len] = -100

with torch.no_grad():

outputs = model(input_ids_chunk, labels=target_ids)

neg_log_likelihood = outputs.loss

nlls.append(neg_log_likelihood)

prev_end_loc = end_loc

if end_loc == input_ids.size(1):

break

ppl = torch.exp(torch.stack(nlls).mean())

return ppl.item()

Example PPL comparison

FP16: PPL ~5.68

INT8: PPL ~5.71 (~0.5% increase)

INT4 (GPTQ): PPL ~5.89 (~3.7% increase)

INT4 (naive): PPL ~6.52 (~14.8% increase)

3. Quantization-Aware Training (QAT)

QAT simulates quantization during training so the model adapts to quantization noise.

3.1 Fake Quantization

Simulates quantization effects in FP32 instead of actual INT8 operations.

class FakeQuantize(nn.Module):

"""Fake quantization module"""

def __init__(self, num_bits: int = 8, symmetric: bool = True):

super().__init__()

self.num_bits = num_bits

self.symmetric = symmetric

self.register_buffer('scale', torch.tensor(1.0))

self.register_buffer('zero_point', torch.tensor(0))

self.register_buffer('fake_quant_enabled', torch.tensor(1))

if symmetric:

self.q_min = -(2 ** (num_bits - 1))

self.q_max = 2 ** (num_bits - 1) - 1

else:

self.q_min = 0

self.q_max = 2 ** num_bits - 1

def forward(self, x: torch.Tensor) -> torch.Tensor:

if self.fake_quant_enabled[0] == 0:

return x

Update scale with exponential moving average during training

if self.training:

with torch.no_grad():

if self.symmetric:

max_abs = x.abs().max()

new_scale = max_abs / self.q_max

else:

new_scale = (x.max() - x.min()) / (self.q_max - self.q_min)

self.scale.copy_(0.9 * self.scale + 0.1 * new_scale)

Fake quantize: quantize then dequantize

x_scaled = x / self.scale

x_clipped = torch.clamp(x_scaled, self.q_min, self.q_max)

x_rounded = torch.round(x_clipped)

x_dequant = x_rounded * self.scale

return x_dequant

3.2 STE (Straight-Through Estimator)

class STERound(torch.autograd.Function):

"""Straight-Through Estimator for round()"""

@staticmethod

def forward(ctx, x):

return torch.round(x)

@staticmethod

def backward(ctx, grad_output):

Pass gradient through round() unchanged (identity approximation)

return grad_output

class STEClamp(torch.autograd.Function):

"""Straight-Through Estimator for clamp()"""

@staticmethod

def forward(ctx, x, min_val, max_val):

ctx.save_for_backward(x)

ctx.min_val = min_val

ctx.max_val = max_val

return torch.clamp(x, min_val, max_val)

@staticmethod

def backward(ctx, grad_output):

x, = ctx.saved_tensors

Pass gradient only within clamp range

grad = grad_output * ((x >= ctx.min_val) & (x <= ctx.max_val)).float()

return grad, None, None

class QATLinear(nn.Module):

"""Linear layer with QAT applied"""

def __init__(self, in_features, out_features, num_bits=8):

super().__init__()

self.linear = nn.Linear(in_features, out_features)

self.weight_fake_quant = FakeQuantize(num_bits=num_bits)

self.act_fake_quant = FakeQuantize(num_bits=num_bits, symmetric=False)

def forward(self, x):

Activation quantization

x_q = self.act_fake_quant(x)

Weight quantization

w_q = self.weight_fake_quant(self.linear.weight)

FP32 compute (INT8 in actual deployment)

return F.linear(x_q, w_q, self.linear.bias)

3.3 When is QAT Needed?

- **When PTQ quality loss is too high**: Especially effective for small models (BERT-small, etc.)

- **Quantizing to INT4 or lower**: Essential for extreme compression

- **Precision-sensitive tasks**: Object detection, ASR, etc.

QAT training workflow

from torch.quantization import prepare_qat, convert

def train_qat_model(model, train_loader, num_epochs=10):

"""QAT training example"""

model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

model_prepared = prepare_qat(model.train())

optimizer = optim.Adam(model_prepared.parameters(), lr=1e-5)

for epoch in range(num_epochs):

for batch in train_loader:

inputs, labels = batch

outputs = model_prepared(inputs)

loss = F.cross_entropy(outputs, labels)

optimizer.zero_grad()

loss.backward()

optimizer.step()

Convert to INT8 model

model_prepared.eval()

model_quantized = convert(model_prepared)

return model_quantized

4. PyTorch Quantization API

4.1 torch.ao.quantization

PyTorch's official quantization API.

from torch.ao.quantization import (

get_default_qconfig,

get_default_qat_qconfig,

prepare,

prepare_qat,

convert

)

Static quantization (PTQ)

def static_quantization_example():

"""Static quantization example"""

model = MyModel()

model.eval()

Backend config (fbgemm: x86, qnnpack: ARM)

model.qconfig = get_default_qconfig('fbgemm')

Prepare for calibration

model_prepared = prepare(model)

Collect statistics from calibration data

with torch.no_grad():

for data in calibration_loader:

model_prepared(data)

Convert to INT8 model

model_quantized = convert(model_prepared)

return model_quantized

Dynamic quantization (effective for LSTM, Linear)

def dynamic_quantization_example():

"""Dynamic quantization example"""

model = MyModel()

model_quantized = torch.quantization.quantize_dynamic(

model,

{nn.Linear, nn.LSTM}, # Layer types to quantize

dtype=torch.qint8

)

return model_quantized

4.2 FX Graph Mode Quantization

A more flexible and powerful quantization approach.

from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

from torch.ao.quantization import QConfigMapping

def fx_quantization_example(model, calibration_data):

"""FX Graph Mode quantization"""

model.eval()

qconfig_mapping = QConfigMapping().set_global(

get_default_qconfig('fbgemm')

)

example_inputs = (torch.randn(1, 3, 224, 224),)

model_prepared = prepare_fx(

model,

qconfig_mapping,

example_inputs

)

with torch.no_grad():

for batch in calibration_data:

model_prepared(batch)

model_quantized = convert_fx(model_prepared)

return model_quantized

5. GPTQ: Accurate Post-Training Quantization

GPTQ, published in 2022, is an LLM-specific quantization algorithm that minimizes quality loss even at INT4. (arXiv:2209.05433)

5.1 GPTQ Algorithm Principles

GPTQ is based on OBQ (Optimal Brain Quantization). The core idea: quantize weights layer by layer sequentially, then compensate for quantization errors in the already-quantized weights by updating the remaining weights.

**OBQ error minimization objective**:

argmin_Q ||WX - QX||_F^2

Where W is the original weight, Q is the quantized weight, and X is the input activation.

**Hessian-based weight update**:

After quantizing each weight, the resulting error is propagated to the remaining weights using the inverse Hessian H^(-1).

def gptq_quantize_weight(weight: torch.Tensor,

hessian: torch.Tensor,

num_bits: int = 4,

group_size: int = 128,

damp_percent: float = 0.01):

"""

Quantize weights using the GPTQ algorithm

Args:

weight: [out_features, in_features] weight matrix

hessian: [in_features, in_features] Hessian matrix (H = 2 * X @ X.T)

num_bits: quantization bit count

group_size: group size

damp_percent: damping ratio for Hessian stabilization

"""

W = weight.clone().float()

n_rows, n_cols = W.shape

Hessian damping (numerical stability)

H = hessian.clone().float()

dead_cols = torch.diag(H) == 0

H[dead_cols, dead_cols] = 1

W[:, dead_cols] = 0

damp = damp_percent * H.diag().mean()

H.diagonal().add_(damp)

Inverse Hessian via Cholesky decomposition

H_inv = torch.linalg.cholesky(H)

H_inv = torch.cholesky_inverse(H_inv)

H_inv = torch.linalg.cholesky(H_inv, upper=True)

Q = torch.zeros_like(W)

Losses = torch.zeros_like(W)

q_max = 2 ** (num_bits - 1) - 1

for col_idx in range(n_cols):

w_col = W[:, col_idx]

h_inv_diag = H_inv[col_idx, col_idx]

Compute per-group scale

if group_size != -1 and col_idx % group_size == 0:

group_end = min(col_idx + group_size, n_cols)

w_group = W[:, col_idx:group_end]

max_abs = w_group.abs().max(dim=1)[0].unsqueeze(1)

scale = max_abs / q_max

scale = torch.clamp(scale, min=1e-8)

Quantize

q_col = torch.clamp(torch.round(w_col / scale.squeeze()), -q_max, q_max)

q_col = q_col * scale.squeeze()

Q[:, col_idx] = q_col

Quantization error

err = (w_col - q_col) / h_inv_diag

Losses[:, col_idx] = err ** 2 / 2

Propagate error to remaining weights (the key step!)

W[:, col_idx + 1:] -= err.unsqueeze(1) * H_inv[col_idx, col_idx + 1:].unsqueeze(0)

return Q, Losses

5.2 Using AutoGPTQ

Practical GPTQ quantization uses the AutoGPTQ library.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

from transformers import AutoTokenizer

def quantize_with_gptq(

model_name: str,

output_dir: str,

bits: int = 4,

group_size: int = 128

"""Quantize model with AutoGPTQ"""

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

quantize_config = BaseQuantizeConfig(

bits=bits,

group_size=group_size,

damp_percent=0.01,

desc_act=False,

sym=True,

true_sequential=True

)

model = AutoGPTQForCausalLM.from_pretrained(

model_name,

quantize_config=quantize_config

)

Prepare calibration data

from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

calibration_data = []

for text in dataset["text"][:128]:

if len(text.strip()) > 50:

encoded = tokenizer(

text.strip(),

return_tensors="pt",

max_length=2048,

truncation=True

)

calibration_data.append(encoded["input_ids"].squeeze())

print(f"Starting GPTQ {bits}bit quantization...")

model.quantize(calibration_data)

model.save_quantized(output_dir, use_safetensors=True)

tokenizer.save_pretrained(output_dir)

print(f"Quantization complete: {output_dir}")

return model, tokenizer

def load_gptq_model(model_dir: str, device: str = "cuda"):

"""Load GPTQ quantized model"""

model = AutoGPTQForCausalLM.from_quantized(

model_dir,

device=device,

use_triton=False,

disable_exllama=False,

inject_fused_attention=True,

inject_fused_mlp=True

)

tokenizer = AutoTokenizer.from_pretrained(model_dir)

return model, tokenizer

6. AWQ: Activation-aware Weight Quantization

AWQ, published in 2023, analyzes activation distributions to protect important weight channels. (arXiv:2306.00978)

6.1 Differences from GPTQ

| Feature | GPTQ | AWQ |

| ---------------- | -------------------------------- | -------------------------------- |

| Approach | Hessian-based error compensation | Activation-based scaling |

| Calibration data | Required (128+ samples) | Required (32+ samples) |

| Speed | Slow (1–4 hours) | Fast (tens of minutes) |

| Quality | Excellent | Excellent (comparable or better) |

| Key feature | Per-channel optimization | Activation outlier handling |

6.2 Using AutoAWQ

from awq import AutoAWQForCausalLM

from transformers import AutoTokenizer

def quantize_with_awq(

model_name: str,

output_dir: str,

bits: int = 4,

group_size: int = 128

"""Quantize model with AutoAWQ"""

tokenizer = AutoTokenizer.from_pretrained(

model_name,

trust_remote_code=True

)

model = AutoAWQForCausalLM.from_pretrained(

model_name,

low_cpu_mem_usage=True,

use_cache=False

)

quant_config = {

"zero_point": True,

"q_group_size": group_size,

"w_bit": bits,

"version": "GEMM"

}

print(f"Starting AWQ {bits}bit quantization...")

model.quantize(tokenizer, quant_config=quant_config)

model.save_quantized(output_dir)

tokenizer.save_pretrained(output_dir)

print(f"AWQ quantization complete: {output_dir}")

return model

def load_awq_model(model_dir: str, device: str = "cuda"):

"""Load AWQ quantized model"""

model = AutoAWQForCausalLM.from_quantized(

model_dir,

fuse_layers=True,

trust_remote_code=True,

safetensors=True

)

tokenizer = AutoTokenizer.from_pretrained(model_dir)

return model, tokenizer

7. GGUF/GGML: The llama.cpp Ecosystem

GGUF (GPT-Generated Unified Format) is the model format for the llama.cpp project, enabling efficient LLM execution even on CPU.

7.1 Understanding GGUF

GGUF was introduced in 2023 to replace GGML. It stores model metadata, hyperparameters, and tokenizer information in a single file.

7.2 Quantization Levels Comparison

| ------ | ---- | ----------- | ------------ | ------------------------ |

| Q2_K | 2.6 | 2.8 GB | High | Extreme compression |

| Q3_K_S | 3.0 | 3.3 GB | Medium | Memory saving |

| Q4_0 | 4.0 | 3.8 GB | Low | Balanced |

| Q4_K_M | 4.1 | 4.1 GB | Very low | General recommendation |

| Q5_0 | 5.0 | 4.7 GB | Minimal | High quality |

| Q5_K_M | 5.1 | 4.8 GB | Minimal | High quality recommended |

| Q6_K | 6.0 | 5.5 GB | Nearly none | Near FP16 |

| Q8_0 | 8.0 | 7.2 GB | None | Reference use |

| F16 | 16.0 | 13.5 GB | None | Baseline |

K-quants (Q4_K_M, Q5_K_M, etc.) keep some layers at higher precision to improve quality.

7.3 Building and Using llama.cpp

Clone and build llama.cpp

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

Build with CUDA support

cmake -B build -DGGML_CUDA=ON

cmake --build build --config Release -j $(nproc)

CPU-only build

cmake -B build

cmake --build build --config Release -j $(nproc)

Convert HuggingFace model to GGUF

python convert_hf_to_gguf.py \

--model meta-llama/Llama-2-7b-hf \

--outfile llama2-7b-f16.gguf \

--outtype f16

Quantize to Q4_K_M

./build/bin/llama-quantize \

llama2-7b-f16.gguf \

llama2-7b-q4_k_m.gguf \

Q4_K_M

Run inference

./build/bin/llama-cli \

-m llama2-7b-q4_k_m.gguf \

-p "The future of AI is" \

-n 100 \

--ctx-size 4096 \

--threads 8 \

--n-gpu-layers 35

7.4 Python Bindings (llama-cpp-python)

from llama_cpp import Llama

Load model

llm = Llama(

model_path="./llama2-7b-q4_k_m.gguf",

n_ctx=4096,

n_gpu_layers=35,

n_threads=8,

verbose=False

)

Text generation

output = llm(

"Once upon a time",

max_tokens=200,

temperature=0.7,

top_p=0.9,

stop=["</s>", "\n\n"]

)

print(output["choices"][0]["text"])

Chat completion format

response = llm.create_chat_completion(

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "What is machine learning?"}

max_tokens=500,

temperature=0.7

)

print(response["choices"][0]["message"]["content"])

Streaming output

for chunk in llm.create_chat_completion(

messages=[{"role": "user", "content": "Tell me a joke"}],

stream=True

delta = chunk["choices"][0].get("delta", {})

if "content" in delta:

print(delta["content"], end="", flush=True)

8. bitsandbytes: LLM Quantization Library

bitsandbytes, developed by Tim Dettmers, integrates seamlessly with HuggingFace transformers.

8.1 LLM.int8() — 8-bit Mixed Precision

LLM.int8() handles activation outliers in FP16 during matrix multiplication while using INT8 for the rest.

from transformers import AutoModelForCausalLM, AutoTokenizer

Load INT8 model

model_8bit = AutoModelForCausalLM.from_pretrained(

"meta-llama/Llama-2-7b-hf",

load_in_8bit=True,

device_map="auto"

)

def print_model_size(model, label):

"""Print model memory usage"""

total_params = sum(p.numel() for p in model.parameters())

total_bytes = sum(

p.numel() * p.element_size() for p in model.parameters()

)

print(f"{label}: {total_params/1e9:.2f}B params, {total_bytes/1e9:.2f} GB")

print_model_size(model_8bit, "INT8 model")

INT8 model: 6.74B params, ~7.0 GB

8.2 4-bit Quantization (Used in QLoRA)

from transformers import BitsAndBytesConfig

NF4 quantization config (QLoRA)

bnb_config_nf4 = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch.bfloat16,

bnb_4bit_use_double_quant=True, # Double quantization

)

FP4 quantization config

bnb_config_fp4 = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type="fp4",

bnb_4bit_compute_dtype=torch.float16,

)

Load model

model_4bit = AutoModelForCausalLM.from_pretrained(

"meta-llama/Llama-2-7b-hf",

quantization_config=bnb_config_nf4,

device_map="auto"

)

print_model_size(model_4bit, "NF4 model")

NF4 model: 6.74B params, ~4.0 GB (with double quantization)

QLoRA fine-tuning setup

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_4bit = prepare_model_for_kbit_training(model_4bit)

lora_config = LoraConfig(

r=64,

lora_alpha=16,

target_modules=["q_proj", "v_proj"],

lora_dropout=0.05,

bias="none",

task_type="CAUSAL_LM"

)

model_lora = get_peft_model(model_4bit, lora_config)

model_lora.print_trainable_parameters()

trainable params: 4,194,304 || all params: 3,504,607,232 || trainable%: 0.1197

8.3 NF4 vs FP4

**NF4 (Normal Float 4)**

- Non-linear 4-bit quantization assuming a normal distribution

- Leverages the observation that weight distributions are approximately normal

- Better representational power at the same bit count

**FP4 (Float 4)**

- Floating-point based 4-bit

- Can represent wider ranges

9. SmoothQuant: W8A8 Quantization

SmoothQuant quantizes both weights (W) and activations (A) to INT8 for faster inference.

9.1 The Activation Outlier Problem

LLM activations exhibit very large values (outliers) in specific channels, making W8A8 quantization challenging.

9.2 Migration Scaling

SmoothQuant's key insight: transfer the difficulty from activations to weights.

Y = (X * diag(s)^(-1)) * (diag(s) * W)

= X_smooth * W_smooth

def smooth_quantize(

model,

calibration_samples,

alpha: float = 0.5

"""

Apply SmoothQuant

Args:

alpha: migration strength (0=weights only, 1=activations only)

Recommended: 0.5 (equal distribution)

"""

act_scales = {}

def collect_scales(name):

def hook(module, input, output):

inp = input[0].detach()

if inp.dim() == 3:

inp = inp.reshape(-1, inp.size(-1))

channel_max = inp.abs().max(dim=0)[0]

if name not in act_scales:

act_scales[name] = channel_max

else:

act_scales[name] = torch.maximum(act_scales[name], channel_max)

return hook

handles = []

for name, module in model.named_modules():

if isinstance(module, torch.nn.Linear):

handles.append(module.register_forward_hook(collect_scales(name)))

with torch.no_grad():

for sample in calibration_samples:

model(**sample)

for h in handles:

h.remove()

for name, module in model.named_modules():

if isinstance(module, torch.nn.Linear) and name in act_scales:

act_scale = act_scales[name]

weight_scale = module.weight.abs().max(dim=0)[0]

Compute migration scale

smooth_scale = (act_scale ** alpha) / (weight_scale ** (1 - alpha))

smooth_scale = torch.clamp(smooth_scale, min=1e-5)

Apply scale to weights

module.weight.data = module.weight.data / smooth_scale.unsqueeze(0)

return model, act_scales

10. SpQR: Sparse Quantization Representation

SpQR stores important weights (outliers) in FP16 separately and quantizes the remainder to low precision.

def spqr_quantize(weight: torch.Tensor,

num_bits: int = 3,

outlier_threshold_percentile: float = 1.0):

"""

SpQR quantization (simplified version)

Core: store top p% outliers as FP16, quantize rest to low bits

"""

threshold = torch.quantile(weight.abs(), 1 - outlier_threshold_percentile / 100)

outlier_mask = weight.abs() > threshold

Store outliers (FP16)

outlier_values = weight.clone()

outlier_values[~outlier_mask] = 0

Quantize remainder

regular_weight = weight.clone()

regular_weight[outlier_mask] = 0

q_max = 2 ** (num_bits - 1) - 1

group_size = 16

rows, cols = regular_weight.shape

regular_grouped = regular_weight.reshape(-1, group_size)

max_abs = regular_grouped.abs().max(dim=1, keepdim=True)[0]

scales = max_abs / q_max

scales = torch.clamp(scales, min=1e-8)

q = torch.clamp(torch.round(regular_grouped / scales), -q_max, q_max).to(torch.int8)

regular_dequant = (scales * q.float()).reshape(rows, cols)

reconstructed = regular_dequant + outlier_values

error = (weight - reconstructed).abs().mean().item()

outlier_memory = outlier_mask.sum().item() * 2

regular_memory = (~outlier_mask).sum().item() * (num_bits / 8)

total_memory = outlier_memory + regular_memory

original_memory = weight.numel() * weight.element_size()

compression_ratio = original_memory / total_memory

print(f"Outlier ratio: {outlier_mask.float().mean():.2%}")

print(f"Mean reconstruction error: {error:.6f}")

print(f"Compression ratio: {compression_ratio:.2f}x")

return q, scales, outlier_values, outlier_mask

11. Quantization Benchmark Comparison

11.1 Llama-2-7B Benchmark

def benchmark_quantization(model, tokenizer, device="cuda", num_runs=50):

"""Benchmark quantized model"""

prompt = "The history of artificial intelligence began"

inputs = tokenizer(prompt, return_tensors="pt").to(device)

if device == "cuda":

torch.cuda.synchronize()

gpu = GPUtil.getGPUs()[0]

memory_used_gb = gpu.memoryUsed / 1024

Warmup

with torch.no_grad():

for _ in range(5):

outputs = model.generate(

**inputs,

max_new_tokens=50,

do_sample=False

)

Measure speed

if device == "cuda":

torch.cuda.synchronize()

start = time.time()

with torch.no_grad():

for _ in range(num_runs):

outputs = model.generate(

**inputs,

max_new_tokens=50,

do_sample=False

)

if device == "cuda":

torch.cuda.synchronize()

elapsed = time.time() - start

avg_time = elapsed / num_runs

tokens_per_second = 50 / avg_time

return {

"memory_gb": memory_used_gb,

"avg_time_ms": avg_time * 1000,

"tokens_per_second": tokens_per_second

}

Example results (A100 80GB, Llama-2-7B)

benchmark_results = {

"FP16": {"memory_gb": 13.5, "tokens_per_second": 52.3, "ppl": 5.68},

"INT8 (bitsandbytes)": {"memory_gb": 7.8, "tokens_per_second": 38.1, "ppl": 5.71},

"INT4 GPTQ": {"memory_gb": 4.5, "tokens_per_second": 65.2, "ppl": 5.89},

"INT4 AWQ": {"memory_gb": 4.3, "tokens_per_second": 68.7, "ppl": 5.86},

"Q4_K_M (GGUF, CPU)": {"memory_gb": 4.1, "tokens_per_second": 45.2, "ppl": 5.91},

"INT4 NF4": {"memory_gb": 4.0, "tokens_per_second": 31.5, "ppl": 5.94},

}

print("=" * 80)

print(f"{'Method':<25} {'Memory (GB)':<12} {'Tok/s':<12} {'PPL':<8}")

print("=" * 80)

for method, stats in benchmark_results.items():

print(f"{method:<25} {stats['memory_gb']:<12.1f} {stats['tokens_per_second']:<12.1f} {stats['ppl']:<8.2f}")

12. Practical Guide: Choosing the Right Quantization Method

12.1 Strategy by Model Size

**Small models under 7B**:

- GGUF Q4_K_M: optimal for local CPU execution

- AWQ INT4: recommended for GPU server deployment

- FP16 viable if memory allows (under 24GB GPU)

**Mid-size models 13B–30B**:

- GPTQ INT4 or AWQ INT4: runs on a single 24GB GPU

- GGUF Q4_K_M: can run in 16GB RAM

**Large models 70B+**:

- GPTQ INT4: runs on a single A100 80GB

- GPTQ INT2: for extreme compression

- Multi-GPU + Tensor Parallel combination

12.2 Strategy by Task

def recommend_quantization(

task: str,

model_size_b: float,

gpu_memory_gb: float,

cpu_only: bool = False,

fine_tuning_needed: bool = False

"""Recommend quantization based on task and environment"""

recommendations = []

if cpu_only:

recommendations.append({

"method": "GGUF Q4_K_M",

"reason": "Optimized for CPU inference, based on llama.cpp",

"library": "llama-cpp-python"

})

return recommendations

if fine_tuning_needed:

recommendations.append({

"method": "bitsandbytes NF4 + QLoRA",

"reason": "Fine-tuning capable, LoRA adapter training with ~4GB overhead",

"library": "bitsandbytes + peft"

})

return recommendations

fp16_memory = model_size_b * 2

int8_memory = model_size_b * 1

int4_memory = model_size_b * 0.5

if fp16_memory <= gpu_memory_gb * 0.8:

recommendations.append({

"method": "FP16 (baseline)",

"reason": "Memory is sufficient, best quality",

"memory_gb": fp16_memory

})

if int8_memory <= gpu_memory_gb * 0.8:

if task in ["chat", "completion", "summarization"]:

recommendations.append({

"method": "AWQ INT8",

"reason": "Optimal balance of quality and speed",

"library": "autoawq",

"memory_gb": int8_memory

})

if int4_memory <= gpu_memory_gb * 0.8:

recommendations.append({

"method": "AWQ INT4",

"reason": "Fast inference, excellent quality",

"library": "autoawq",

"memory_gb": int4_memory

})

recommendations.append({

"method": "GPTQ INT4",

"reason": "Best INT4 quality, slower quantization process",

"library": "auto-gptq",

"memory_gb": int4_memory

})

return recommendations

Example usage

recommendations = recommend_quantization(

task="chat",

model_size_b=7.0,

gpu_memory_gb=16.0,

fine_tuning_needed=False

)

for rec in recommendations:

print(f"\nMethod: {rec['method']}")

print(f"Reason: {rec['reason']}")

if 'library' in rec:

print(f"Library: {rec['library']}")

if 'memory_gb' in rec:

print(f"Expected memory: {rec['memory_gb']:.1f} GB")

12.3 Complete Quantization Pipeline

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

from awq import AutoAWQForCausalLM

from transformers import AutoTokenizer

class QuantizationPipeline:

"""Unified quantization pipeline"""

def __init__(self, model_name: str, output_base_dir: str):

self.model_name = model_name

self.output_base_dir = output_base_dir

self.tokenizer = AutoTokenizer.from_pretrained(model_name)

os.makedirs(output_base_dir, exist_ok=True)

def quantize_gptq(self, bits: int = 4, group_size: int = 128):

"""GPTQ quantization"""

output_dir = os.path.join(self.output_base_dir, f"gptq-{bits}bit")

config = BaseQuantizeConfig(

bits=bits,

group_size=group_size,

sym=True,

desc_act=False

)

model = AutoGPTQForCausalLM.from_pretrained(

self.model_name,

quantize_config=config

)

calibration_data = self._prepare_calibration_data()

model.quantize(calibration_data)

model.save_quantized(output_dir)

self.tokenizer.save_pretrained(output_dir)

print(f"GPTQ {bits}bit saved: {output_dir}")

return output_dir

def quantize_awq(self, bits: int = 4, group_size: int = 128):

"""AWQ quantization"""

output_dir = os.path.join(self.output_base_dir, f"awq-{bits}bit")

model = AutoAWQForCausalLM.from_pretrained(

self.model_name,

low_cpu_mem_usage=True

)

quant_config = {

"zero_point": True,

"q_group_size": group_size,

"w_bit": bits,

"version": "GEMM"

}

model.quantize(self.tokenizer, quant_config=quant_config)

model.save_quantized(output_dir)

self.tokenizer.save_pretrained(output_dir)

print(f"AWQ {bits}bit saved: {output_dir}")

return output_dir

def _prepare_calibration_data(self, num_samples: int = 128):

"""Prepare calibration data"""

from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

data = []

for text in dataset["text"]:

if len(text.strip()) > 50:

encoded = self.tokenizer(

text.strip(),

return_tensors="pt",

max_length=2048,

truncation=True

)

data.append(encoded["input_ids"].squeeze())

if len(data) >= num_samples:

break

return data

Conclusion

Model quantization is a foundational technology for democratizing LLMs. To summarize what we covered:

1. **Fundamentals**: The math behind compressing FP32 to INT4 (scale, zero_point)

2. **PTQ vs QAT**: PTQ is practical without retraining; QAT is essential for extreme compression

3. **GPTQ**: Best INT4 quality via Hessian-based error compensation

4. **AWQ**: Fast and efficient quantization based on activation distributions

5. **GGUF**: Optimized for CPU execution, multiple quality levels available

6. **bitsandbytes**: HuggingFace integration, essential for QLoRA fine-tuning

**Recommended strategies**:

- Local execution: GGUF Q4_K_M

- GPU server deployment: AWQ 4-bit

- Quality-critical scenarios: GPTQ 4-bit or FP16

- Fine-tuning needed: bitsandbytes NF4 + QLoRA

Quantization technology is evolving rapidly, with 2-bit methods like QuIP# and AQLM emerging. The journey toward smaller, faster models continues.

References

- GPTQ: [arXiv:2209.05433](https://arxiv.org/abs/2209.05433)

- AWQ: [arXiv:2306.00978](https://arxiv.org/abs/2306.00978)

- llama.cpp: [github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)

- bitsandbytes: [github.com/TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes)

- SmoothQuant: [arXiv:2211.10438](https://arxiv.org/abs/2211.10438)

- SpQR: [arXiv:2306.03078](https://arxiv.org/abs/2306.03078)

- PyTorch Quantization: [pytorch.org/docs/stable/quantization.html](https://pytorch.org/docs/stable/quantization.html)