Split View: 딥러닝 모델 양자화 완전 정복: INT8, INT4, GPTQ, AWQ, GGUF 마스터하기

딥러닝 모델 양자화 완전 정복: INT8, INT4, GPTQ, AWQ, GGUF 마스터하기

들어가며

딥러닝 모델이 점점 거대해지면서 추론(Inference) 비용과 메모리 요구량이 폭발적으로 증가했습니다. GPT-3는 175B 파라미터, Llama 3는 70B 파라미터에 달하며, FP32 전정밀도(Full Precision)로 저장하면 각각 700GB, 280GB의 메모리가 필요합니다. 일반 GPU로는 실행조차 불가능한 수준입니다.

**모델 양자화(Model Quantization)**는 이 문제를 해결하는 핵심 기술입니다. 32비트 부동소수점(FP32) 가중치를 8비트, 4비트 정수로 압축하여 메모리를 4~~8배 줄이고 추론 속도를 2~~4배 높입니다. 품질 손실은 놀랍도록 적습니다.

이 글에서는 양자화의 수학적 원리부터 GPTQ, AWQ, GGUF, bitsandbytes 같은 최신 기법까지 완전히 파헤칩니다.

1. 양자화 기초: 수 표현 방식 이해

1.1 부동소수점 표현 (Floating Point)

현대 딥러닝에서 사용되는 부동소수점 형식을 이해하는 것이 양자화의 시작입니다.

FP32 (Float32)

부호(1비트) + 지수(8비트) + 가수(23비트) = 총 32비트
표현 범위: 약 -3.4e38 ~ 3.4e38
정밀도: 약 7자리 소수

FP16 (Float16)

부호(1비트) + 지수(5비트) + 가수(10비트) = 총 16비트
표현 범위: -65504 ~ 65504 (FP32 대비 훨씬 좁음)
정밀도: 약 3자리 소수
오버플로우 위험이 있어 학습 시 gradient scaling 필요

BF16 (Brain Float16)

부호(1비트) + 지수(8비트) + 가수(7비트) = 총 16비트
FP32와 동일한 지수 범위를 유지하면서 가수만 줄임
오버플로우 위험 없음, 딥러닝 학습에 더 안전
Google Brain에서 개발, 최신 GPU(A100, H100)에서 네이티브 지원

import torch
import numpy as np

# 각 데이터 타입의 메모리 크기 확인
x_fp32 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float32)
x_fp16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float16)
x_bf16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.bfloat16)

print(f"FP32: {x_fp32.element_size()} bytes per element")  # 4 bytes
print(f"FP16: {x_fp16.element_size()} bytes per element")  # 2 bytes
print(f"BF16: {x_bf16.element_size()} bytes per element")  # 2 bytes

# 모델 메모리 계산 예시 (7B 파라미터 모델)
params = 7e9
fp32_memory_gb = params * 4 / 1e9
fp16_memory_gb = params * 2 / 1e9
int8_memory_gb = params * 1 / 1e9
int4_memory_gb = params * 0.5 / 1e9

print(f"\n7B 모델 메모리 요구량:")
print(f"FP32: {fp32_memory_gb:.1f} GB")   # 28.0 GB
print(f"FP16: {fp16_memory_gb:.1f} GB")   # 14.0 GB
print(f"INT8: {int8_memory_gb:.1f} GB")   # 7.0 GB
print(f"INT4: {int4_memory_gb:.1f} GB")   # 3.5 GB

1.2 정수형 표현 (Integer)

양자화의 핵심은 부동소수점 값을 정수로 매핑하는 것입니다.

INT8: -128 ~ 127 (부호 있음) 또는 0 ~ 255 (부호 없음) INT4: -8 ~ 7 (부호 있음) 또는 0 ~ 15 (부호 없음) INT2: -2 ~ 1 (부호 있음) 또는 0 ~ 3 (부호 없음)

1.3 양자화 수식

부동소수점 값 x를 정수 q로 변환하는 기본 수식:

q = clamp(round(x / scale) + zero_point, q_min, q_max)

역양자화 (Dequantization):

x_approx = scale * (q - zero_point)

여기서:

scale: 양자화 스케일 팩터 (scale = (max_val - min_val) / (q_max - q_min))
zero_point: 정수 0이 나타내는 실수 값의 오프셋
q_min, q_max: 정수 범위 (-128, 127 for INT8)

import torch
import numpy as np

def symmetric_quantize(x: torch.Tensor, num_bits: int = 8):
    """대칭 양자화 구현"""
    q_max = 2 ** (num_bits - 1) - 1  # INT8의 경우 127
    q_min = -q_max  # -127

    # 스케일 계산
    max_abs = x.abs().max()
    scale = max_abs / q_max

    # 양자화
    q = torch.clamp(torch.round(x / scale), q_min, q_max).to(torch.int8)

    return q, scale

def asymmetric_quantize(x: torch.Tensor, num_bits: int = 8):
    """비대칭 양자화 구현"""
    q_max = 2 ** num_bits - 1  # UINT8의 경우 255
    q_min = 0

    # 스케일과 zero_point 계산
    min_val = x.min()
    max_val = x.max()
    scale = (max_val - min_val) / (q_max - q_min)
    zero_point = q_min - torch.round(min_val / scale)
    zero_point = torch.clamp(zero_point, q_min, q_max).to(torch.int32)

    # 양자화
    q = torch.clamp(torch.round(x / scale) + zero_point, q_min, q_max).to(torch.uint8)

    return q, scale, zero_point

def dequantize(q: torch.Tensor, scale: torch.Tensor, zero_point: torch.Tensor = None):
    """역양자화"""
    if zero_point is None:
        return scale * q.float()
    return scale * (q.float() - zero_point.float())

# 테스트
x = torch.randn(100)
print(f"원본 데이터 범위: [{x.min():.4f}, {x.max():.4f}]")

# 대칭 양자화
q_sym, scale_sym = symmetric_quantize(x)
x_reconstructed_sym = dequantize(q_sym, scale_sym)
error_sym = (x - x_reconstructed_sym).abs().mean()
print(f"대칭 양자화 평균 오차: {error_sym:.6f}")

# 비대칭 양자화
q_asym, scale_asym, zp_asym = asymmetric_quantize(x)
x_reconstructed_asym = dequantize(q_asym, scale_asym, zp_asym)
error_asym = (x - x_reconstructed_asym).abs().mean()
print(f"비대칭 양자화 평균 오차: {error_asym:.6f}")

1.4 대칭 vs 비대칭 양자화

대칭 양자화(Symmetric Quantization)

zero_point = 0
양수/음수 범위가 대칭
가중치 양자화에 적합 (대부분 0 중심 분포)
연산이 단순: x_approx = scale * q

비대칭 양자화(Asymmetric Quantization)

zero_point != 0
임의의 범위 표현 가능
활성화 양자화에 적합 (ReLU 이후 항상 양수)
연산이 복잡: x_approx = scale * (q - zero_point)

1.5 양자화 그래뉼레이티 (Quantization Granularity)

같은 scale/zero_point를 얼마나 많은 파라미터에 적용할지 결정합니다.

Per-Tensor: 전체 텐서에 하나의 scale 사용

메모리 오버헤드 최소
정밀도 손실 가장 큼

Per-Channel (Per-Row/Column): 각 채널마다 개별 scale

가중치 행렬의 각 행/열에 별도 scale
채널별 분포 차이를 효과적으로 처리

Per-Group (Per-Block): 일정 크기 그룹마다 개별 scale

group_size = 128 이 일반적
Per-Channel과 Per-Tensor의 절충점
GPTQ, AWQ에서 주로 사용

import torch

def per_group_quantize(weight: torch.Tensor, group_size: int = 128, num_bits: int = 4):
    """Per-Group 양자화 구현"""
    rows, cols = weight.shape

    # 그룹 단위로 분할
    weight_grouped = weight.reshape(-1, group_size)

    # 각 그룹의 최대/최소값
    max_vals = weight_grouped.max(dim=1, keepdim=True)[0]
    min_vals = weight_grouped.min(dim=1, keepdim=True)[0]

    q_max = 2 ** num_bits - 1  # 15 for INT4

    # 스케일 계산
    scales = (max_vals - min_vals) / q_max
    zero_points = torch.round(-min_vals / scales)

    # 양자화
    q = torch.clamp(torch.round(weight_grouped / scales) + zero_points, 0, q_max)

    # 역양자화
    weight_dequant = scales * (q - zero_points)
    weight_dequant = weight_dequant.reshape(rows, cols)

    return q, scales, zero_points, weight_dequant

# 예시: Transformer 가중치 양자화
weight = torch.randn(4096, 4096)  # Llama-style weight
q, scales, zp, weight_dequant = per_group_quantize(weight, group_size=128, num_bits=4)

error = (weight - weight_dequant).abs().mean()
print(f"Per-Group INT4 양자화 평균 오차: {error:.6f}")
print(f"압축률: {weight.element_size() * weight.numel() / (q.numel() / 2 + scales.numel() * 4):.2f}x")

2. Post-Training Quantization (PTQ)

PTQ는 이미 학습된 모델을 재학습 없이 양자화하는 방법입니다. 실용성이 높아 가장 많이 사용됩니다.

2.1 보정 데이터 (Calibration Dataset)

PTQ는 소량의 보정 데이터를 사용하여 적절한 scale/zero_point를 결정합니다.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

def collect_calibration_data(model_name: str, num_samples: int = 128):
    """보정 데이터 수집"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # WikiText-2 또는 C4 데이터셋 사용이 일반적
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    texts = []
    for item in dataset:
        if len(item['text'].strip()) > 100:
            texts.append(item['text'].strip())
        if len(texts) >= num_samples:
            break

    # 토크나이즈
    encoded = [
        tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
        for text in texts
    ]

    return encoded

# 보정 데이터로 활성화 통계 수집
def collect_activation_stats(model, calibration_data, layer_name: str):
    """특정 레이어의 활성화 통계 수집"""
    stats = {"min": float("inf"), "max": float("-inf"), "histogram": []}

    def hook_fn(module, input, output):
        with torch.no_grad():
            act = output.detach().float()
            stats["min"] = min(stats["min"], act.min().item())
            stats["max"] = max(stats["max"], act.max().item())

    # 훅 등록
    target_layer = dict(model.named_modules())[layer_name]
    handle = target_layer.register_forward_hook(hook_fn)

    # 보정 데이터 실행
    model.eval()
    with torch.no_grad():
        for batch in calibration_data[:32]:
            model(**batch)

    handle.remove()
    return stats

2.2 최소-최대 보정 (Min-Max Calibration)

가장 단순한 방법으로, 보정 데이터 전체의 최솟값과 최댓값을 사용합니다.

class MinMaxCalibrator:
    """최소-최대 보정기"""

    def __init__(self):
        self.min_val = float("inf")
        self.max_val = float("-inf")

    def update(self, tensor: torch.Tensor):
        self.min_val = min(self.min_val, tensor.min().item())
        self.max_val = max(self.max_val, tensor.max().item())

    def compute_scale_zp(self, num_bits: int = 8, symmetric: bool = True):
        q_max = 2 ** (num_bits - 1) - 1 if symmetric else 2 ** num_bits - 1

        if symmetric:
            max_abs = max(abs(self.min_val), abs(self.max_val))
            scale = max_abs / q_max
            zero_point = 0
        else:
            scale = (self.max_val - self.min_val) / q_max
            zero_point = -round(self.min_val / scale)

        return scale, zero_point

2.3 히스토그램 보정 (Histogram Calibration)

아웃라이어의 영향을 줄이기 위해 분포 히스토그램을 기반으로 최적 범위를 찾습니다.

import numpy as np
from scipy import stats

class HistogramCalibrator:
    """히스토그램 기반 보정기 (KL Divergence 최소화)"""

    def __init__(self, num_bins: int = 2048):
        self.num_bins = num_bins
        self.histogram = None
        self.bin_edges = None

    def update(self, tensor: torch.Tensor):
        data = tensor.detach().float().numpy().flatten()

        if self.histogram is None:
            self.histogram, self.bin_edges = np.histogram(data, bins=self.num_bins)
        else:
            new_hist, _ = np.histogram(data, bins=self.bin_edges)
            self.histogram += new_hist

    def compute_optimal_range(self, num_bits: int = 8):
        """KL Divergence를 최소화하는 최적 범위 탐색"""
        num_quantized_bins = 2 ** num_bits - 1

        best_kl = float("inf")
        best_threshold = None

        # 다양한 threshold 탐색
        for i in range(num_quantized_bins, len(self.histogram)):
            # 히스토그램을 num_quantized_bins 개로 압축
            reference = self.histogram[:i].copy().astype(float)
            reference /= reference.sum()

            # KL Divergence 계산 (근사)
            quantized = np.zeros(i)
            bin_size = i / num_quantized_bins

            for j in range(num_quantized_bins):
                start = int(j * bin_size)
                end = int((j + 1) * bin_size)
                quantized[start:end] = reference[start:end].sum() / (end - start)

            # 0인 구간 처리
            quantized = np.where(quantized == 0, 1e-10, quantized)
            reference_clipped = np.where(reference == 0, 1e-10, reference)

            kl = stats.entropy(reference_clipped, quantized)

            if kl < best_kl:
                best_kl = kl
                best_threshold = self.bin_edges[i]

        return -best_threshold, best_threshold

2.4 퍼플렉시티에 미치는 영향

양자화 품질을 측정하는 가장 일반적인 지표는 퍼플렉시티(Perplexity, PPL)입니다.

import torch
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, text: str, device: str = "cuda"):
    """퍼플렉시티 계산"""
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(device)

    max_length = 1024
    stride = 512

    nlls = []
    prev_end_loc = 0

    for begin_loc in range(0, input_ids.size(1), stride):
        end_loc = min(begin_loc + max_length, input_ids.size(1))
        trg_len = end_loc - prev_end_loc

        input_ids_chunk = input_ids[:, begin_loc:end_loc]
        target_ids = input_ids_chunk.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids_chunk, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood)
        prev_end_loc = end_loc

        if end_loc == input_ids.size(1):
            break

    ppl = torch.exp(torch.stack(nlls).mean())
    return ppl.item()

# 모델별 PPL 비교 예시
# FP16: PPL ≈ 5.68
# INT8: PPL ≈ 5.71 (약 0.5% 증가)
# INT4 (GPTQ): PPL ≈ 5.89 (약 3.7% 증가)
# INT4 (naive): PPL ≈ 6.52 (약 14.8% 증가)

3. Quantization-Aware Training (QAT)

QAT는 학습 중에 양자화를 시뮬레이션하여 모델이 양자화 노이즈에 적응하도록 합니다.

3.1 가짜 양자화 (Fake Quantization)

실제 INT8 연산 대신 FP32로 양자화 효과를 시뮬레이션합니다.

import torch
import torch.nn as nn
import torch.nn.functional as F

class FakeQuantize(nn.Module):
    """가짜 양자화 모듈"""

    def __init__(self, num_bits: int = 8, symmetric: bool = True):
        super().__init__()
        self.num_bits = num_bits
        self.symmetric = symmetric

        self.register_buffer('scale', torch.tensor(1.0))
        self.register_buffer('zero_point', torch.tensor(0))
        self.register_buffer('fake_quant_enabled', torch.tensor(1))

        if symmetric:
            self.q_min = -(2 ** (num_bits - 1))
            self.q_max = 2 ** (num_bits - 1) - 1
        else:
            self.q_min = 0
            self.q_max = 2 ** num_bits - 1

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.fake_quant_enabled[0] == 0:
            return x

        # 스케일 업데이트 (이동 평균)
        if self.training:
            with torch.no_grad():
                if self.symmetric:
                    max_abs = x.abs().max()
                    new_scale = max_abs / self.q_max
                else:
                    new_scale = (x.max() - x.min()) / (self.q_max - self.q_min)

                # 지수 이동 평균으로 스케일 업데이트
                self.scale.copy_(0.9 * self.scale + 0.1 * new_scale)

        # 가짜 양자화: 양자화 후 역양자화
        x_scaled = x / self.scale
        x_clipped = torch.clamp(x_scaled, self.q_min, self.q_max)
        x_rounded = torch.round(x_clipped)
        x_dequant = x_rounded * self.scale

        return x_dequant


### 3.2 STE (Straight-Through Estimator)

```python
class STERound(torch.autograd.Function):
    """Straight-Through Estimator for round()"""

    @staticmethod
    def forward(ctx, x):
        return torch.round(x)

    @staticmethod
    def backward(ctx, grad_output):
        # 역전파 시 round()를 통과하여 gradient 전달 (항등함수로 근사)
        return grad_output

class STEClamp(torch.autograd.Function):
    """Straight-Through Estimator for clamp()"""

    @staticmethod
    def forward(ctx, x, min_val, max_val):
        ctx.save_for_backward(x)
        ctx.min_val = min_val
        ctx.max_val = max_val
        return torch.clamp(x, min_val, max_val)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        # clamp 범위 내에서만 gradient 전달
        grad = grad_output * ((x >= ctx.min_val) & (x <= ctx.max_val)).float()
        return grad, None, None

class QATLinear(nn.Module):
    """QAT를 적용한 Linear 레이어"""

    def __init__(self, in_features, out_features, num_bits=8):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.weight_fake_quant = FakeQuantize(num_bits=num_bits)
        self.act_fake_quant = FakeQuantize(num_bits=num_bits, symmetric=False)

    def forward(self, x):
        # 활성화 양자화
        x_q = self.act_fake_quant(x)
        # 가중치 양자화
        w_q = self.weight_fake_quant(self.linear.weight)
        # FP32 연산 (실제로는 INT8)
        return F.linear(x_q, w_q, self.linear.bias)

3.3 언제 QAT가 필요한가?

PTQ로 품질 손실이 너무 클 때: 특히 작은 모델(BERT-small 등)에서 효과적
INT4 이하로 양자화할 때: 극단적인 압축에서 품질 유지에 필수
특수 태스크: Object detection, ASR 등 정밀도에 민감한 태스크

# QAT 학습 워크플로
import torch.optim as optim
from torch.quantization import prepare_qat, convert

def train_qat_model(model, train_loader, num_epochs=10):
    """QAT 학습 예시"""

    # QAT 준비
    model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
    model_prepared = prepare_qat(model.train())

    optimizer = optim.Adam(model_prepared.parameters(), lr=1e-5)

    for epoch in range(num_epochs):
        for batch in train_loader:
            inputs, labels = batch
            outputs = model_prepared(inputs)
            loss = F.cross_entropy(outputs, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # INT8 모델로 변환
    model_prepared.eval()
    model_quantized = convert(model_prepared)

    return model_quantized

4. PyTorch 양자화 API

4.1 torch.ao.quantization

PyTorch의 공식 양자화 API입니다.

import torch
from torch.ao.quantization import (
    get_default_qconfig,
    get_default_qat_qconfig,
    prepare,
    prepare_qat,
    convert
)

# 정적 양자화 (PTQ)
def static_quantization_example():
    """정적 양자화 예시"""
    model = MyModel()
    model.eval()

    # 백엔드 설정 (fbgemm: x86, qnnpack: ARM)
    model.qconfig = get_default_qconfig('fbgemm')

    # 보정 준비
    model_prepared = prepare(model)

    # 보정 데이터로 통계 수집
    with torch.no_grad():
        for data in calibration_loader:
            model_prepared(data)

    # INT8 모델로 변환
    model_quantized = convert(model_prepared)

    return model_quantized

# 동적 양자화 (LSTM, Linear에 효과적)
def dynamic_quantization_example():
    """동적 양자화 예시"""
    model = MyModel()

    model_quantized = torch.quantization.quantize_dynamic(
        model,
        {nn.Linear, nn.LSTM},  # 양자화할 레이어 타입
        dtype=torch.qint8
    )

    return model_quantized

4.2 FX Graph Mode Quantization

더 유연하고 강력한 양자화 방법입니다.

from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping

def fx_quantization_example(model, calibration_data):
    """FX Graph Mode 양자화"""
    model.eval()

    # QConfig 설정
    qconfig_mapping = QConfigMapping().set_global(
        get_default_qconfig('fbgemm')
    )

    # 예시 입력
    example_inputs = (torch.randn(1, 3, 224, 224),)

    # FX 그래프 기반 준비
    model_prepared = prepare_fx(
        model,
        qconfig_mapping,
        example_inputs
    )

    # 보정
    with torch.no_grad():
        for batch in calibration_data:
            model_prepared(batch)

    # 변환
    model_quantized = convert_fx(model_prepared)

    return model_quantized

5. GPTQ: Accurate Post-Training Quantization

GPTQ는 2022년 발표된 LLM 특화 양자화 알고리즘으로, INT4 양자화에서도 품질 손실을 최소화합니다. (arXiv:2209.05433)

5.1 GPTQ 알고리즘 원리

GPTQ는 OBQ(Optimal Brain Quantization)를 기반으로 합니다. 핵심 아이디어는 레이어별로 가중치를 순차적으로 양자화하면서, 이미 양자화된 가중치의 오차를 나머지 가중치에 보정하는 것입니다.

OBQ 오차 최소화 목적함수:

argmin_Q ||WX - QX||_F^2

여기서 W는 원본 가중치, Q는 양자화된 가중치, X는 입력 활성화입니다.

헤시안 기반 가중치 업데이트:

각 가중치를 양자화한 후 발생하는 오차를 헤시안 역행렬 H^(-1)을 이용해 나머지 가중치에 전파합니다.

# GPTQ 핵심 알고리즘 구현 (단순화 버전)
import torch
import math

def gptq_quantize_weight(weight: torch.Tensor,
                          hessian: torch.Tensor,
                          num_bits: int = 4,
                          group_size: int = 128,
                          damp_percent: float = 0.01):
    """
    GPTQ 알고리즘으로 가중치 양자화

    Args:
        weight: [out_features, in_features] 가중치 행렬
        hessian: [in_features, in_features] 헤시안 행렬 (H = 2 * X @ X.T)
        num_bits: 양자화 비트 수
        group_size: 그룹 크기
        damp_percent: 헤시안 안정화를 위한 댐핑 비율
    """
    W = weight.clone().float()
    n_rows, n_cols = W.shape

    # 헤시안 댐핑 (수치 안정성)
    H = hessian.clone().float()
    dead_cols = torch.diag(H) == 0
    H[dead_cols, dead_cols] = 1
    W[:, dead_cols] = 0

    damp = damp_percent * H.diag().mean()
    H.diagonal().add_(damp)

    # 헤시안 역행렬 (Cholesky 분해 이용)
    H_inv = torch.linalg.cholesky(H)
    H_inv = torch.cholesky_inverse(H_inv)
    H_inv = torch.linalg.cholesky(H_inv, upper=True)

    Q = torch.zeros_like(W)
    Losses = torch.zeros_like(W)

    q_max = 2 ** (num_bits - 1) - 1

    for col_idx in range(n_cols):
        w_col = W[:, col_idx]  # 현재 컬럼의 가중치
        h_inv_diag = H_inv[col_idx, col_idx]  # 헤시안 역행렬의 대각 요소

        # 그룹별 스케일 계산
        if group_size != -1 and col_idx % group_size == 0:
            group_end = min(col_idx + group_size, n_cols)
            w_group = W[:, col_idx:group_end]
            max_abs = w_group.abs().max(dim=1)[0].unsqueeze(1)
            scale = max_abs / q_max
            scale = torch.clamp(scale, min=1e-8)

        # 양자화
        q_col = torch.clamp(torch.round(w_col / scale.squeeze()), -q_max, q_max)
        q_col = q_col * scale.squeeze()
        Q[:, col_idx] = q_col

        # 양자화 오차
        err = (w_col - q_col) / h_inv_diag
        Losses[:, col_idx] = err ** 2 / 2

        # 오차를 나머지 가중치에 전파 (핵심!)
        W[:, col_idx + 1:] -= err.unsqueeze(1) * H_inv[col_idx, col_idx + 1:].unsqueeze(0)

    return Q, Losses


def collect_hessian(model_layer, calibration_data, device='cuda'):
    """보정 데이터로 헤시안 수집"""
    hessians = {}

    def make_hook(name):
        def hook(module, input, output):
            inp = input[0].detach().float()
            if inp.dim() == 3:
                inp = inp.reshape(-1, inp.size(-1))

            if name not in hessians:
                hessians[name] = torch.zeros(inp.size(1), inp.size(1), device=device)

            hessians[name] += 2 * inp.T @ inp
        return hook

    handles = []
    for name, module in model_layer.named_modules():
        if isinstance(module, torch.nn.Linear):
            handles.append(module.register_forward_hook(make_hook(name)))

    with torch.no_grad():
        for batch in calibration_data:
            model_layer(batch.to(device))

    for h in handles:
        h.remove()

    return hessians

5.2 AutoGPTQ 사용법

실용적인 GPTQ 양자화는 AutoGPTQ 라이브러리를 사용합니다.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

def quantize_with_gptq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128
):
    """AutoGPTQ로 모델 양자화"""

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    # 양자화 설정
    quantize_config = BaseQuantizeConfig(
        bits=bits,              # 4 또는 8
        group_size=group_size,  # 128 권장
        damp_percent=0.01,      # 헤시안 댐핑
        desc_act=False,         # 활성화 재정렬 (품질 향상, 속도 감소)
        sym=True,               # 대칭 양자화
        true_sequential=True    # 순차적 레이어 양자화
    )

    # 모델 로드
    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config
    )

    # 보정 데이터 준비
    from datasets import load_dataset
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    calibration_data = []
    for text in dataset["text"][:128]:
        if len(text.strip()) > 50:
            encoded = tokenizer(
                text.strip(),
                return_tensors="pt",
                max_length=2048,
                truncation=True
            )
            calibration_data.append(encoded["input_ids"].squeeze())

    # GPTQ 양자화 실행
    print(f"GPTQ {bits}bit 양자화 시작...")
    model.quantize(calibration_data)

    # 저장
    model.save_quantized(output_dir, use_safetensors=True)
    tokenizer.save_pretrained(output_dir)

    print(f"양자화 완료: {output_dir}")
    return model, tokenizer


def load_gptq_model(model_dir: str, device: str = "cuda"):
    """GPTQ 양자화 모델 로드"""

    model = AutoGPTQForCausalLM.from_quantized(
        model_dir,
        device=device,
        use_triton=False,       # Triton 커널 사용 여부
        disable_exllama=False,  # ExLlama 커널 사용 (속도 향상)
        inject_fused_attention=True,
        inject_fused_mlp=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer

# 사용 예시
# model, tokenizer = quantize_with_gptq("meta-llama/Llama-2-7b-hf", "./llama2-7b-gptq-4bit")
# model, tokenizer = load_gptq_model("./llama2-7b-gptq-4bit")

6. AWQ: Activation-aware Weight Quantization

AWQ는 2023년 발표된 기법으로, 활성화 분포를 분석하여 중요한 가중치 채널을 보호합니다. (arXiv:2306.00978)

6.1 GPTQ와의 차이

항목	GPTQ	AWQ
접근 방식	헤시안 기반 오차 보정	활성화 기반 스케일링
보정 데이터	필요 (128+ 샘플)	필요 (32+ 샘플)
속도	느림 (1-4시간)	빠름 (수십 분)
품질	우수	우수 (비슷하거나 더 좋음)
특징	채널별 최적화	활성화 아웃라이어 처리

6.2 AWQ 핵심 아이디어

LLM 가중치에는 중요한 채널이 존재합니다. 이 채널들의 활성화 크기가 크며, 양자화 시 오차가 크면 전체 성능에 큰 영향을 미칩니다. AWQ는 중요 채널의 가중치를 스케일 팩터로 확대하여 양자화 오차를 줄입니다.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

def quantize_with_awq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128
):
    """AutoAWQ로 모델 양자화"""

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )

    model = AutoAWQForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        use_cache=False
    )

    # AWQ 양자화 설정
    quant_config = {
        "zero_point": True,   # 비대칭 양자화
        "q_group_size": group_size,
        "w_bit": bits,
        "version": "GEMM"     # GEMM 또는 GEMV (작은 배치에 최적)
    }

    # 양자화 실행
    print(f"AWQ {bits}bit 양자화 시작...")
    model.quantize(tokenizer, quant_config=quant_config)

    # 저장
    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)

    print(f"AWQ 양자화 완료: {output_dir}")
    return model

def load_awq_model(model_dir: str, device: str = "cuda"):
    """AWQ 양자화 모델 로드"""

    model = AutoAWQForCausalLM.from_quantized(
        model_dir,
        fuse_layers=True,       # 레이어 퓨전으로 속도 향상
        trust_remote_code=True,
        safetensors=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer

# Hugging Face transformers와 통합
from transformers import AutoModelForCausalLM

def load_awq_with_transformers(model_dir: str):
    """transformers로 AWQ 모델 로드"""
    model = AutoModelForCausalLM.from_pretrained(
        model_dir,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    return model, tokenizer

7. GGUF/GGML: llama.cpp 생태계

GGUF(GPT-Generated Unified Format)는 llama.cpp 프로젝트의 모델 포맷으로, CPU에서도 효율적으로 LLM을 실행할 수 있습니다.

7.1 GGUF 포맷 이해

GGUF는 2023년에 GGML 포맷을 대체하여 도입되었습니다. 모델 메타데이터, 하이퍼파라미터, 토크나이저 정보를 단일 파일에 포함합니다.

GGUF 파일 구조:
┌─────────────────────────────┐
│ 매직 넘버 (GGUF)            │
│ 버전                        │
│ 텐서 개수                   │
│ 메타데이터 KV 쌍            │
│  - 모델 아키텍처            │
│  - 컨텍스트 길이            │
│  - 어텐션 헤드 수           │
│  - 임베딩 차원              │
├─────────────────────────────┤
│ 텐서 인포 (이름, 타입, 형태) │
├─────────────────────────────┤
│ 텐서 데이터                 │
└─────────────────────────────┘

7.2 양자화 수준 비교

포맷	비트	메모리(7B)	PPL 증가	권장 용도
Q2_K	2.6	2.8 GB	높음	극단적 압축
Q3_K_S	3.0	3.3 GB	중간	메모리 절약
Q4_0	4.0	3.8 GB	낮음	균형
Q4_K_M	4.1	4.1 GB	매우 낮음	일반 권장
Q5_0	5.0	4.7 GB	최소	고품질
Q5_K_M	5.1	4.8 GB	최소	고품질 권장
Q6_K	6.0	5.5 GB	거의 없음	FP16 근접
Q8_0	8.0	7.2 GB	없음	참조용
F16	16.0	13.5 GB	없음	기준선

K-quants (Q4_K_M, Q5_K_M 등)는 레이어의 일부를 더 높은 정밀도로 유지하여 품질을 향상시킵니다.

7.3 llama.cpp 빌드 및 사용

# llama.cpp 클론 및 빌드
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# CUDA 지원 빌드
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

# CPU 전용 빌드
cmake -B build
cmake --build build --config Release -j $(nproc)

# HuggingFace 모델을 GGUF로 변환
python convert_hf_to_gguf.py \
    --model meta-llama/Llama-2-7b-hf \
    --outfile llama2-7b-f16.gguf \
    --outtype f16

# GGUF 양자화 (Q4_K_M)
./build/bin/llama-quantize \
    llama2-7b-f16.gguf \
    llama2-7b-q4_k_m.gguf \
    Q4_K_M

# 추론 실행
./build/bin/llama-cli \
    -m llama2-7b-q4_k_m.gguf \
    -p "The future of AI is" \
    -n 100 \
    --ctx-size 4096 \
    --threads 8 \
    --n-gpu-layers 35

7.4 Python 바인딩 (llama-cpp-python)

from llama_cpp import Llama

# 모델 로드
llm = Llama(
    model_path="./llama2-7b-q4_k_m.gguf",
    n_ctx=4096,          # 컨텍스트 길이
    n_gpu_layers=35,     # GPU로 오프로드할 레이어 수 (-1: 전체)
    n_threads=8,         # CPU 스레드 수
    verbose=False
)

# 텍스트 생성
output = llm(
    "Once upon a time",
    max_tokens=200,
    temperature=0.7,
    top_p=0.9,
    stop=["</s>", "\n\n"]
)

print(output["choices"][0]["text"])

# 채팅 완성 형식
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response["choices"][0]["message"]["content"])

# 스트리밍 출력
for chunk in llm.create_chat_completion(
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True
):
    delta = chunk["choices"][0].get("delta", {})
    if "content" in delta:
        print(delta["content"], end="", flush=True)

8. bitsandbytes: LLM 양자화 라이브러리

bitsandbytes는 Tim Dettmers가 개발한 라이브러리로, HuggingFace transformers와 완벽히 통합됩니다.

8.1 LLM.int8() - 8비트 혼합 정밀도

LLM.int8()은 행렬 곱셈 중 활성화 아웃라이어를 FP16으로 처리하고 나머지는 INT8을 사용합니다.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# INT8 모델 로드
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

# 메모리 사용량 확인
def print_model_size(model, label):
    """모델 메모리 사용량 출력"""
    total_params = sum(p.numel() for p in model.parameters())
    total_bytes = sum(
        p.numel() * p.element_size() for p in model.parameters()
    )
    print(f"{label}: {total_params/1e9:.2f}B params, {total_bytes/1e9:.2f} GB")

print_model_size(model_8bit, "INT8 모델")
# INT8 모델: 6.74B params, ~7.0 GB

8.2 4비트 양자화 (QLoRA에서 사용)

import bitsandbytes as bnb
from transformers import BitsAndBytesConfig

# NF4 양자화 설정 (QLoRA)
bnb_config_nf4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NF4 또는 FP4
    bnb_4bit_compute_dtype=torch.bfloat16, # 연산 시 데이터 타입
    bnb_4bit_use_double_quant=True,       # 이중 양자화 (양자화 상수도 양자화)
)

# FP4 양자화 설정
bnb_config_fp4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_compute_dtype=torch.float16,
)

# 모델 로드
model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config_nf4,
    device_map="auto"
)

print_model_size(model_4bit, "NF4 모델")
# NF4 모델: 6.74B params, ~4.0 GB (이중 양자화 포함)

# QLoRA 파인튜닝 설정
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_4bit = prepare_model_for_kbit_training(model_4bit)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model_lora = get_peft_model(model_4bit, lora_config)
model_lora.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,607,232 || trainable%: 0.1197

8.3 NF4 vs FP4

NF4 (Normal Float 4)

정규 분포를 가정한 비선형 4비트 양자화
가중치 분포가 정규분포에 가깝다는 점을 활용
같은 비트 수에서 더 좋은 표현력

FP4 (Float 4)

부동소수점 기반 4비트
더 넓은 범위 표현 가능

import numpy as np
import matplotlib.pyplot as plt

# NF4 양자화 포인트 시각화
def get_nf4_quantization_points():
    """NF4 16개 양자화 포인트"""
    # 정규 분포의 1/16 분위수
    nf4_points = []
    for i in range(16):
        quantile = (i + 0.5) / 16
        nf4_points.append(scipy.stats.norm.ppf(quantile))

    # 정규화
    max_val = max(abs(p) for p in nf4_points)
    nf4_points = [p / max_val for p in nf4_points]

    return nf4_points

# NF4: [-1.0, -0.6961, -0.5250, -0.3949, -0.2844, -0.1848, -0.0911, 0.0000,
#        0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]

9. SmoothQuant: W8A8 양자화

SmoothQuant는 가중치(W)와 활성화(A) 모두 INT8로 양자화하여 더 빠른 추론을 달성합니다.

9.1 활성화 아웃라이어 문제

LLM의 활성화 분포는 특정 채널에서 매우 큰 값(아웃라이어)이 발생합니다. 이로 인해 W8A8 양자화가 어렵습니다.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def analyze_activation_outliers(model, tokenizer, text: str, threshold: float = 100.0):
    """활성화 아웃라이어 분석"""

    activations = {}

    def make_hook(name):
        def hook(module, input, output):
            act = output.detach().float()
            max_val = act.abs().max().item()
            outlier_ratio = (act.abs() > threshold).float().mean().item()
            activations[name] = {
                "max": max_val,
                "outlier_ratio": outlier_ratio,
                "std": act.std().item()
            }
        return hook

    handles = []
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            handles.append(module.register_forward_hook(make_hook(name)))

    input_ids = tokenizer(text, return_tensors="pt").input_ids.cuda()

    with torch.no_grad():
        model(input_ids)

    for h in handles:
        h.remove()

    # 아웃라이어가 많은 레이어 순으로 정렬
    sorted_acts = sorted(
        activations.items(),
        key=lambda x: x[1]["max"],
        reverse=True
    )

    print("아웃라이어가 큰 레이어 Top 10:")
    for name, stats in sorted_acts[:10]:
        print(f"  {name}: max={stats['max']:.1f}, outlier_ratio={stats['outlier_ratio']:.3%}")

    return activations

9.2 마이그레이션 스케일링

SmoothQuant의 핵심: 활성화의 어려움을 가중치로 이전합니다.

Y = (X * diag(s)^(-1)) * (diag(s) * W)
  = X_smooth * W_smooth

def smooth_quantize(
    model,
    calibration_samples,
    alpha: float = 0.5
):
    """
    SmoothQuant 적용

    Args:
        alpha: 마이그레이션 강도 (0=가중치만, 1=활성화만)
               권장값: 0.5 (균등 분배)
    """

    # 활성화 통계 수집
    act_scales = {}

    def collect_scales(name):
        def hook(module, input, output):
            inp = input[0].detach()
            if inp.dim() == 3:
                inp = inp.reshape(-1, inp.size(-1))

            channel_max = inp.abs().max(dim=0)[0]

            if name not in act_scales:
                act_scales[name] = channel_max
            else:
                act_scales[name] = torch.maximum(act_scales[name], channel_max)
        return hook

    handles = []
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            handles.append(module.register_forward_hook(collect_scales(name)))

    with torch.no_grad():
        for sample in calibration_samples:
            model(**sample)

    for h in handles:
        h.remove()

    # 스케일 계산 및 적용
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear) and name in act_scales:
            act_scale = act_scales[name]
            weight_scale = module.weight.abs().max(dim=0)[0]

            # 마이그레이션 스케일 계산
            smooth_scale = (act_scale ** alpha) / (weight_scale ** (1 - alpha))
            smooth_scale = torch.clamp(smooth_scale, min=1e-5)

            # 가중치에 스케일 적용
            module.weight.data = module.weight.data / smooth_scale.unsqueeze(0)

            # 이전 레이어 (LayerNorm 등)의 출력 스케일에 역스케일 적용
            # (실제 구현에서는 이전 레이어를 찾아서 수정)

    return model, act_scales

10. SpQR: 희소-양자화 표현

SpQR는 중요한 가중치(아웃라이어)는 FP16으로 별도 저장하고 나머지는 저정밀도로 양자화합니다.

import torch

def spqr_quantize(weight: torch.Tensor,
                   num_bits: int = 3,
                   outlier_threshold_percentile: float = 1.0):
    """
    SpQR 양자화 (단순화 버전)

    핵심: 상위 p% 아웃라이어를 FP16으로 저장, 나머지는 저비트 양자화
    """

    # 아웃라이어 임계값 계산
    threshold = torch.quantile(weight.abs(), 1 - outlier_threshold_percentile / 100)

    # 아웃라이어 마스크
    outlier_mask = weight.abs() > threshold

    # 아웃라이어 저장 (FP16)
    outlier_values = weight.clone()
    outlier_values[~outlier_mask] = 0

    # 나머지 양자화
    regular_weight = weight.clone()
    regular_weight[outlier_mask] = 0

    # Per-group 양자화 적용
    q_max = 2 ** (num_bits - 1) - 1
    group_size = 16

    rows, cols = regular_weight.shape
    regular_grouped = regular_weight.reshape(-1, group_size)

    max_abs = regular_grouped.abs().max(dim=1, keepdim=True)[0]
    scales = max_abs / q_max
    scales = torch.clamp(scales, min=1e-8)

    q = torch.clamp(torch.round(regular_grouped / scales), -q_max, q_max).to(torch.int8)
    regular_dequant = (scales * q.float()).reshape(rows, cols)

    # 최종 재구성
    reconstructed = regular_dequant + outlier_values

    error = (weight - reconstructed).abs().mean().item()

    # 메모리 사용량 계산
    outlier_memory = outlier_mask.sum().item() * 2  # FP16 = 2 bytes
    regular_memory = (~outlier_mask).sum().item() * (num_bits / 8)
    total_memory = outlier_memory + regular_memory
    original_memory = weight.numel() * weight.element_size()
    compression_ratio = original_memory / total_memory

    print(f"아웃라이어 비율: {outlier_mask.float().mean():.2%}")
    print(f"평균 재구성 오차: {error:.6f}")
    print(f"압축률: {compression_ratio:.2f}x")

    return q, scales, outlier_values, outlier_mask

11. 양자화 벤치마크 비교

11.1 Llama-2-7B 기준 비교

import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import psutil
import GPUtil

def benchmark_quantization(model, tokenizer, device="cuda", num_runs=50):
    """양자화 모델 벤치마크"""

    prompt = "The history of artificial intelligence began"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # 메모리 사용량
    if device == "cuda":
        torch.cuda.synchronize()
        gpu = GPUtil.getGPUs()[0]
        memory_used_gb = gpu.memoryUsed / 1024
    else:
        memory_used_gb = psutil.virtual_memory().used / 1e9

    # 워밍업
    with torch.no_grad():
        for _ in range(5):
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False
            )

    # 속도 측정
    if device == "cuda":
        torch.cuda.synchronize()
    start = time.time()

    with torch.no_grad():
        for _ in range(num_runs):
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False
            )

    if device == "cuda":
        torch.cuda.synchronize()
    elapsed = time.time() - start

    avg_time = elapsed / num_runs
    tokens_per_second = 50 / avg_time

    return {
        "memory_gb": memory_used_gb,
        "avg_time_ms": avg_time * 1000,
        "tokens_per_second": tokens_per_second
    }

# 결과 예시 (A100 80GB 기준, Llama-2-7B)
benchmark_results = {
    "FP16": {"memory_gb": 13.5, "tokens_per_second": 52.3, "ppl": 5.68},
    "INT8 (bitsandbytes)": {"memory_gb": 7.8, "tokens_per_second": 38.1, "ppl": 5.71},
    "INT4 GPTQ": {"memory_gb": 4.5, "tokens_per_second": 65.2, "ppl": 5.89},
    "INT4 AWQ": {"memory_gb": 4.3, "tokens_per_second": 68.7, "ppl": 5.86},
    "Q4_K_M (GGUF)": {"memory_gb": 4.1, "tokens_per_second": 45.2, "ppl": 5.91},  # CPU
    "INT4 NF4": {"memory_gb": 4.0, "tokens_per_second": 31.5, "ppl": 5.94},
}

print("=" * 80)
print(f"{'방법':<25} {'메모리(GB)':<12} {'토큰/초':<12} {'PPL':<8}")
print("=" * 80)
for method, stats in benchmark_results.items():
    print(f"{method:<25} {stats['memory_gb']:<12.1f} {stats['tokens_per_second']:<12.1f} {stats['ppl']:<8.2f}")

12. 실전 가이드: 최적 양자화 방법 선택

12.1 모델 크기에 따른 전략

7B 이하 소형 모델:

GGUF Q4_K_M: 로컬 CPU 실행에 최적
AWQ INT4: GPU 서버 배포에 권장
FP16도 고려 가능 (24GB GPU 이하)

13B-30B 중형 모델:

GPTQ INT4 또는 AWQ INT4: 24GB GPU 1장에 실행 가능
GGUF Q4_K_M: 16GB RAM에서도 실행 가능

70B 이상 대형 모델:

GPTQ INT4: A100 80GB 1장에 실행 가능
GPTQ INT2: 극단적 압축 필요 시
멀티 GPU + Tensor Parallel 조합

12.2 태스크에 따른 전략

def recommend_quantization(
    task: str,
    model_size_b: float,
    gpu_memory_gb: float,
    cpu_only: bool = False,
    fine_tuning_needed: bool = False
):
    """태스크와 환경에 따른 양자화 추천"""

    recommendations = []

    if cpu_only:
        recommendations.append({
            "method": "GGUF Q4_K_M",
            "reason": "CPU 추론에 최적화, llama.cpp 기반",
            "library": "llama-cpp-python"
        })
        return recommendations

    if fine_tuning_needed:
        recommendations.append({
            "method": "bitsandbytes NF4 + QLoRA",
            "reason": "파인튜닝 가능, 4GB 추가 메모리로 LoRA 어댑터 학습",
            "library": "bitsandbytes + peft"
        })
        return recommendations

    # 메모리 요구량 계산
    fp16_memory = model_size_b * 2  # FP16 = 2 bytes per param
    int8_memory = model_size_b * 1  # INT8 = 1 byte per param
    int4_memory = model_size_b * 0.5  # INT4 = 0.5 bytes per param

    if fp16_memory <= gpu_memory_gb * 0.8:
        recommendations.append({
            "method": "FP16 (기본)",
            "reason": "메모리 여유 있음, 최고 품질",
            "memory_gb": fp16_memory
        })

    if int8_memory <= gpu_memory_gb * 0.8:
        if task in ["chat", "completion", "summarization"]:
            recommendations.append({
                "method": "AWQ INT8",
                "reason": "품질과 속도의 최적 균형",
                "library": "autoawq",
                "memory_gb": int8_memory
            })

    if int4_memory <= gpu_memory_gb * 0.8:
        recommendations.append({
            "method": "AWQ INT4",
            "reason": "고속 추론, 우수한 품질",
            "library": "autoawq",
            "memory_gb": int4_memory
        })
        recommendations.append({
            "method": "GPTQ INT4",
            "reason": "최고의 INT4 품질, 느린 양자화",
            "library": "auto-gptq",
            "memory_gb": int4_memory
        })

    return recommendations

# 사용 예시
recommendations = recommend_quantization(
    task="chat",
    model_size_b=7.0,
    gpu_memory_gb=16.0,
    fine_tuning_needed=False
)

for rec in recommendations:
    print(f"\n방법: {rec['method']}")
    print(f"이유: {rec['reason']}")
    if 'library' in rec:
        print(f"라이브러리: {rec['library']}")
    if 'memory_gb' in rec:
        print(f"예상 메모리: {rec['memory_gb']:.1f} GB")

12.3 완전한 양자화 파이프라인

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from awq import AutoAWQForCausalLM
import json
import os

class QuantizationPipeline:
    """통합 양자화 파이프라인"""

    def __init__(self, model_name: str, output_base_dir: str):
        self.model_name = model_name
        self.output_base_dir = output_base_dir
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        os.makedirs(output_base_dir, exist_ok=True)

    def quantize_gptq(self, bits: int = 4, group_size: int = 128):
        """GPTQ 양자화"""
        output_dir = os.path.join(self.output_base_dir, f"gptq-{bits}bit")

        config = BaseQuantizeConfig(
            bits=bits,
            group_size=group_size,
            sym=True,
            desc_act=False
        )

        model = AutoGPTQForCausalLM.from_pretrained(
            self.model_name,
            quantize_config=config
        )

        # 보정 데이터
        calibration_data = self._prepare_calibration_data()

        model.quantize(calibration_data)
        model.save_quantized(output_dir)
        self.tokenizer.save_pretrained(output_dir)

        print(f"GPTQ {bits}bit 저장: {output_dir}")
        return output_dir

    def quantize_awq(self, bits: int = 4, group_size: int = 128):
        """AWQ 양자화"""
        output_dir = os.path.join(self.output_base_dir, f"awq-{bits}bit")

        model = AutoAWQForCausalLM.from_pretrained(
            self.model_name,
            low_cpu_mem_usage=True
        )

        quant_config = {
            "zero_point": True,
            "q_group_size": group_size,
            "w_bit": bits,
            "version": "GEMM"
        }

        model.quantize(self.tokenizer, quant_config=quant_config)
        model.save_quantized(output_dir)
        self.tokenizer.save_pretrained(output_dir)

        print(f"AWQ {bits}bit 저장: {output_dir}")
        return output_dir

    def _prepare_calibration_data(self, num_samples: int = 128):
        """보정 데이터 준비"""
        from datasets import load_dataset

        dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

        data = []
        for text in dataset["text"]:
            if len(text.strip()) > 50:
                encoded = self.tokenizer(
                    text.strip(),
                    return_tensors="pt",
                    max_length=2048,
                    truncation=True
                )
                data.append(encoded["input_ids"].squeeze())
                if len(data) >= num_samples:
                    break

        return data

    def evaluate_all(self, test_text: str = None):
        """모든 양자화 모델 평가"""
        if test_text is None:
            from datasets import load_dataset
            dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
            test_text = " ".join(dataset["text"][:10])

        results = {}

        # FP16 기준선
        model_fp16 = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

        # PPL 계산
        from transformers import pipeline

        # 각 모델별 평가 결과 출력
        print("\n=== 양자화 평가 결과 ===")
        print(f"모델: {self.model_name}")
        print(f"{'방법':<20} {'PPL':<10} {'메모리(GB)':<12}")
        print("-" * 42)

        return results


# 전체 파이프라인 실행
pipeline = QuantizationPipeline(
    model_name="meta-llama/Llama-2-7b-hf",
    output_base_dir="./quantized_models"
)

# GPTQ 4bit 양자화
gptq_dir = pipeline.quantize_gptq(bits=4)

# AWQ 4bit 양자화
awq_dir = pipeline.quantize_awq(bits=4)

마무리

모델 양자화는 LLM 민주화의 핵심 기술입니다. 이 가이드에서 다룬 내용을 정리하면:

기초 이해: FP32 → INT4로 압축하는 수학적 원리 (scale, zero_point)
PTQ vs QAT: 재학습 없는 PTQ가 실용적, QAT는 극단적 압축에 필수
GPTQ: 헤시안 기반 오차 보정으로 최고의 INT4 품질
AWQ: 활성화 분포 기반으로 빠르고 효율적인 양자화
GGUF: CPU 실행에 최적, 다양한 품질 수준 지원
bitsandbytes: HuggingFace 통합, QLoRA 파인튜닝에 필수

추천 전략:

로컬 실행: GGUF Q4_K_M
GPU 서버 배포: AWQ 4bit
고품질이 중요한 경우: GPTQ 4bit 또는 FP16
파인튜닝 필요: bitsandbytes NF4 + QLoRA

양자화 기술은 빠르게 발전하고 있으며, QuIP#, AQLM 같은 2-비트 양자화 기법도 등장하고 있습니다. 모델을 더 작고 빠르게 만드는 이 여정은 계속됩니다.

참고 자료

GPTQ: arXiv:2209.05433
AWQ: arXiv:2306.00978
llama.cpp: github.com/ggerganov/llama.cpp
bitsandbytes: github.com/TimDettmers/bitsandbytes
SmoothQuant: arXiv:2211.10438
SpQR: arXiv:2306.03078
PyTorch 양자화: pytorch.org/docs/stable/quantization.html

Deep Learning Model Quantization Complete Guide: Master INT8, INT4, GPTQ, AWQ, GGUF

Introduction

As deep learning models grow increasingly large, inference costs and memory requirements have skyrocketed. GPT-3 has 175B parameters, Llama 3 has 70B, and storing them in full FP32 precision requires 700GB and 280GB of memory respectively — completely infeasible for typical GPUs.

Model Quantization is the key technology to solve this problem. By compressing 32-bit floating-point (FP32) weights to 8-bit or 4-bit integers, it reduces memory by 4–8x and improves inference speed by 2–4x. Remarkably, quality loss is minimal.

In this guide, we thoroughly explore quantization from mathematical foundations to the latest techniques: GPTQ, AWQ, GGUF, and bitsandbytes.

1. Quantization Fundamentals: Understanding Number Representations

1.1 Floating-Point Formats

Understanding the floating-point formats used in modern deep learning is the starting point for quantization.

FP32 (Float32)

Sign (1 bit) + Exponent (8 bits) + Mantissa (23 bits) = 32 bits total
Range: approximately -3.4e38 to 3.4e38
Precision: ~7 decimal digits

FP16 (Float16)

Sign (1 bit) + Exponent (5 bits) + Mantissa (10 bits) = 16 bits total
Range: -65504 to 65504 (much narrower than FP32)
Precision: ~3 decimal digits
Risk of overflow during training; requires gradient scaling

BF16 (Brain Float16)

Sign (1 bit) + Exponent (8 bits) + Mantissa (7 bits) = 16 bits total
Maintains the same exponent range as FP32 while reducing mantissa bits
No overflow risk, safer for deep learning training
Developed by Google Brain, natively supported on modern GPUs (A100, H100)

import torch
import numpy as np

# Check memory size of each data type
x_fp32 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float32)
x_fp16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float16)
x_bf16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.bfloat16)

print(f"FP32: {x_fp32.element_size()} bytes per element")  # 4 bytes
print(f"FP16: {x_fp16.element_size()} bytes per element")  # 2 bytes
print(f"BF16: {x_bf16.element_size()} bytes per element")  # 2 bytes

# Memory calculation for a 7B parameter model
params = 7e9
fp32_memory_gb = params * 4 / 1e9
fp16_memory_gb = params * 2 / 1e9
int8_memory_gb = params * 1 / 1e9
int4_memory_gb = params * 0.5 / 1e9

print(f"\n7B Model Memory Requirements:")
print(f"FP32: {fp32_memory_gb:.1f} GB")   # 28.0 GB
print(f"FP16: {fp16_memory_gb:.1f} GB")   # 14.0 GB
print(f"INT8: {int8_memory_gb:.1f} GB")   # 7.0 GB
print(f"INT4: {int4_memory_gb:.1f} GB")   # 3.5 GB

1.2 Integer Representations

The core of quantization is mapping floating-point values to integers.

INT8: -128 to 127 (signed) or 0 to 255 (unsigned) INT4: -8 to 7 (signed) or 0 to 15 (unsigned) INT2: -2 to 1 (signed) or 0 to 3 (unsigned)

1.3 Quantization Formula

The fundamental formula to convert a floating-point value x to integer q:

q = clamp(round(x / scale) + zero_point, q_min, q_max)

Dequantization:

x_approx = scale * (q - zero_point)

Where:

scale: quantization scale factor (scale = (max_val - min_val) / (q_max - q_min))
zero_point: the offset representing which real value integer 0 corresponds to
q_min, q_max: integer range bounds (-128, 127 for INT8)

import torch
import numpy as np

def symmetric_quantize(x: torch.Tensor, num_bits: int = 8):
    """Symmetric quantization implementation"""
    q_max = 2 ** (num_bits - 1) - 1  # 127 for INT8
    q_min = -q_max  # -127

    # Compute scale
    max_abs = x.abs().max()
    scale = max_abs / q_max

    # Quantize
    q = torch.clamp(torch.round(x / scale), q_min, q_max).to(torch.int8)

    return q, scale

def asymmetric_quantize(x: torch.Tensor, num_bits: int = 8):
    """Asymmetric quantization implementation"""
    q_max = 2 ** num_bits - 1  # 255 for UINT8
    q_min = 0

    # Compute scale and zero_point
    min_val = x.min()
    max_val = x.max()
    scale = (max_val - min_val) / (q_max - q_min)
    zero_point = q_min - torch.round(min_val / scale)
    zero_point = torch.clamp(zero_point, q_min, q_max).to(torch.int32)

    # Quantize
    q = torch.clamp(torch.round(x / scale) + zero_point, q_min, q_max).to(torch.uint8)

    return q, scale, zero_point

def dequantize(q: torch.Tensor, scale: torch.Tensor, zero_point: torch.Tensor = None):
    """Dequantization"""
    if zero_point is None:
        return scale * q.float()
    return scale * (q.float() - zero_point.float())

# Test
x = torch.randn(100)
print(f"Original data range: [{x.min():.4f}, {x.max():.4f}]")

# Symmetric quantization
q_sym, scale_sym = symmetric_quantize(x)
x_reconstructed_sym = dequantize(q_sym, scale_sym)
error_sym = (x - x_reconstructed_sym).abs().mean()
print(f"Symmetric quantization mean error: {error_sym:.6f}")

# Asymmetric quantization
q_asym, scale_asym, zp_asym = asymmetric_quantize(x)
x_reconstructed_asym = dequantize(q_asym, scale_asym, zp_asym)
error_asym = (x - x_reconstructed_asym).abs().mean()
print(f"Asymmetric quantization mean error: {error_asym:.6f}")

1.4 Symmetric vs Asymmetric Quantization

Symmetric Quantization

zero_point = 0
Symmetric positive/negative range
Suitable for weights (mostly zero-centered distribution)
Simpler computation: x_approx = scale * q

Asymmetric Quantization

zero_point != 0
Can represent arbitrary ranges
Suitable for activations (always non-negative after ReLU)
More complex computation: x_approx = scale * (q - zero_point)

1.5 Quantization Granularity

Determines how many parameters share a single scale/zero_point.

Per-Tensor: One scale for the entire tensor

Minimal memory overhead
Largest precision loss

Per-Channel (Per-Row/Column): Individual scale per channel

Separate scale for each row/column of the weight matrix
Effectively handles distribution differences across channels

Per-Group (Per-Block): Individual scale per fixed-size group

Typical group_size = 128
Compromise between per-channel and per-tensor
Commonly used in GPTQ and AWQ

import torch

def per_group_quantize(weight: torch.Tensor, group_size: int = 128, num_bits: int = 4):
    """Per-Group quantization implementation"""
    rows, cols = weight.shape

    # Split into groups
    weight_grouped = weight.reshape(-1, group_size)

    # Max/min per group
    max_vals = weight_grouped.max(dim=1, keepdim=True)[0]
    min_vals = weight_grouped.min(dim=1, keepdim=True)[0]

    q_max = 2 ** num_bits - 1  # 15 for INT4

    # Compute scales
    scales = (max_vals - min_vals) / q_max
    zero_points = torch.round(-min_vals / scales)

    # Quantize
    q = torch.clamp(torch.round(weight_grouped / scales) + zero_points, 0, q_max)

    # Dequantize
    weight_dequant = scales * (q - zero_points)
    weight_dequant = weight_dequant.reshape(rows, cols)

    return q, scales, zero_points, weight_dequant

# Example: Transformer weight quantization
weight = torch.randn(4096, 4096)  # Llama-style weight
q, scales, zp, weight_dequant = per_group_quantize(weight, group_size=128, num_bits=4)

error = (weight - weight_dequant).abs().mean()
print(f"Per-Group INT4 quantization mean error: {error:.6f}")
print(f"Compression ratio: {weight.element_size() * weight.numel() / (q.numel() / 2 + scales.numel() * 4):.2f}x")

2. Post-Training Quantization (PTQ)

PTQ quantizes an already-trained model without retraining — the most practical approach and most widely used.

2.1 Calibration Dataset

PTQ uses a small calibration dataset to determine appropriate scale and zero_point values.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

def collect_calibration_data(model_name: str, num_samples: int = 128):
    """Collect calibration data"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # WikiText-2 or C4 dataset is commonly used
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    texts = []
    for item in dataset:
        if len(item['text'].strip()) > 100:
            texts.append(item['text'].strip())
        if len(texts) >= num_samples:
            break

    # Tokenize
    encoded = [
        tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
        for text in texts
    ]

    return encoded

def collect_activation_stats(model, calibration_data, layer_name: str):
    """Collect activation statistics for a specific layer"""
    stats = {"min": float("inf"), "max": float("-inf")}

    def hook_fn(module, input, output):
        with torch.no_grad():
            act = output.detach().float()
            stats["min"] = min(stats["min"], act.min().item())
            stats["max"] = max(stats["max"], act.max().item())

    # Register hook
    target_layer = dict(model.named_modules())[layer_name]
    handle = target_layer.register_forward_hook(hook_fn)

    # Run calibration data
    model.eval()
    with torch.no_grad():
        for batch in calibration_data[:32]:
            model(**batch)

    handle.remove()
    return stats

2.2 Min-Max Calibration

The simplest method: uses the global minimum and maximum values from calibration data.

class MinMaxCalibrator:
    """Min-Max calibrator"""

    def __init__(self):
        self.min_val = float("inf")
        self.max_val = float("-inf")

    def update(self, tensor: torch.Tensor):
        self.min_val = min(self.min_val, tensor.min().item())
        self.max_val = max(self.max_val, tensor.max().item())

    def compute_scale_zp(self, num_bits: int = 8, symmetric: bool = True):
        q_max = 2 ** (num_bits - 1) - 1 if symmetric else 2 ** num_bits - 1

        if symmetric:
            max_abs = max(abs(self.min_val), abs(self.max_val))
            scale = max_abs / q_max
            zero_point = 0
        else:
            scale = (self.max_val - self.min_val) / q_max
            zero_point = -round(self.min_val / scale)

        return scale, zero_point

2.3 Histogram Calibration

To reduce the impact of outliers, finds the optimal range based on the distribution histogram.

import numpy as np
from scipy import stats

class HistogramCalibrator:
    """Histogram-based calibrator (minimizes KL Divergence)"""

    def __init__(self, num_bins: int = 2048):
        self.num_bins = num_bins
        self.histogram = None
        self.bin_edges = None

    def update(self, tensor: torch.Tensor):
        data = tensor.detach().float().numpy().flatten()

        if self.histogram is None:
            self.histogram, self.bin_edges = np.histogram(data, bins=self.num_bins)
        else:
            new_hist, _ = np.histogram(data, bins=self.bin_edges)
            self.histogram += new_hist

    def compute_optimal_range(self, num_bits: int = 8):
        """Search for optimal range minimizing KL Divergence"""
        num_quantized_bins = 2 ** num_bits - 1

        best_kl = float("inf")
        best_threshold = None

        for i in range(num_quantized_bins, len(self.histogram)):
            reference = self.histogram[:i].copy().astype(float)
            reference /= reference.sum()

            quantized = np.zeros(i)
            bin_size = i / num_quantized_bins

            for j in range(num_quantized_bins):
                start = int(j * bin_size)
                end = int((j + 1) * bin_size)
                quantized[start:end] = reference[start:end].sum() / (end - start)

            quantized = np.where(quantized == 0, 1e-10, quantized)
            reference_clipped = np.where(reference == 0, 1e-10, reference)

            kl = stats.entropy(reference_clipped, quantized)

            if kl < best_kl:
                best_kl = kl
                best_threshold = self.bin_edges[i]

        return -best_threshold, best_threshold

2.4 Impact on Perplexity

The most common metric for measuring quantization quality is Perplexity (PPL).

import torch
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, text: str, device: str = "cuda"):
    """Compute perplexity"""
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(device)

    max_length = 1024
    stride = 512

    nlls = []
    prev_end_loc = 0

    for begin_loc in range(0, input_ids.size(1), stride):
        end_loc = min(begin_loc + max_length, input_ids.size(1))
        trg_len = end_loc - prev_end_loc

        input_ids_chunk = input_ids[:, begin_loc:end_loc]
        target_ids = input_ids_chunk.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids_chunk, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood)
        prev_end_loc = end_loc

        if end_loc == input_ids.size(1):
            break

    ppl = torch.exp(torch.stack(nlls).mean())
    return ppl.item()

# Example PPL comparison
# FP16:          PPL ~5.68
# INT8:          PPL ~5.71 (~0.5% increase)
# INT4 (GPTQ):   PPL ~5.89 (~3.7% increase)
# INT4 (naive):  PPL ~6.52 (~14.8% increase)

3. Quantization-Aware Training (QAT)

QAT simulates quantization during training so the model adapts to quantization noise.

3.1 Fake Quantization

Simulates quantization effects in FP32 instead of actual INT8 operations.

import torch
import torch.nn as nn
import torch.nn.functional as F

class FakeQuantize(nn.Module):
    """Fake quantization module"""

    def __init__(self, num_bits: int = 8, symmetric: bool = True):
        super().__init__()
        self.num_bits = num_bits
        self.symmetric = symmetric

        self.register_buffer('scale', torch.tensor(1.0))
        self.register_buffer('zero_point', torch.tensor(0))
        self.register_buffer('fake_quant_enabled', torch.tensor(1))

        if symmetric:
            self.q_min = -(2 ** (num_bits - 1))
            self.q_max = 2 ** (num_bits - 1) - 1
        else:
            self.q_min = 0
            self.q_max = 2 ** num_bits - 1

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.fake_quant_enabled[0] == 0:
            return x

        # Update scale with exponential moving average during training
        if self.training:
            with torch.no_grad():
                if self.symmetric:
                    max_abs = x.abs().max()
                    new_scale = max_abs / self.q_max
                else:
                    new_scale = (x.max() - x.min()) / (self.q_max - self.q_min)

                self.scale.copy_(0.9 * self.scale + 0.1 * new_scale)

        # Fake quantize: quantize then dequantize
        x_scaled = x / self.scale
        x_clipped = torch.clamp(x_scaled, self.q_min, self.q_max)
        x_rounded = torch.round(x_clipped)
        x_dequant = x_rounded * self.scale

        return x_dequant

3.2 STE (Straight-Through Estimator)

class STERound(torch.autograd.Function):
    """Straight-Through Estimator for round()"""

    @staticmethod
    def forward(ctx, x):
        return torch.round(x)

    @staticmethod
    def backward(ctx, grad_output):
        # Pass gradient through round() unchanged (identity approximation)
        return grad_output

class STEClamp(torch.autograd.Function):
    """Straight-Through Estimator for clamp()"""

    @staticmethod
    def forward(ctx, x, min_val, max_val):
        ctx.save_for_backward(x)
        ctx.min_val = min_val
        ctx.max_val = max_val
        return torch.clamp(x, min_val, max_val)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        # Pass gradient only within clamp range
        grad = grad_output * ((x >= ctx.min_val) & (x <= ctx.max_val)).float()
        return grad, None, None

class QATLinear(nn.Module):
    """Linear layer with QAT applied"""

    def __init__(self, in_features, out_features, num_bits=8):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.weight_fake_quant = FakeQuantize(num_bits=num_bits)
        self.act_fake_quant = FakeQuantize(num_bits=num_bits, symmetric=False)

    def forward(self, x):
        # Activation quantization
        x_q = self.act_fake_quant(x)
        # Weight quantization
        w_q = self.weight_fake_quant(self.linear.weight)
        # FP32 compute (INT8 in actual deployment)
        return F.linear(x_q, w_q, self.linear.bias)

3.3 When is QAT Needed?

When PTQ quality loss is too high: Especially effective for small models (BERT-small, etc.)
Quantizing to INT4 or lower: Essential for extreme compression
Precision-sensitive tasks: Object detection, ASR, etc.

# QAT training workflow
import torch.optim as optim
from torch.quantization import prepare_qat, convert

def train_qat_model(model, train_loader, num_epochs=10):
    """QAT training example"""

    model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
    model_prepared = prepare_qat(model.train())

    optimizer = optim.Adam(model_prepared.parameters(), lr=1e-5)

    for epoch in range(num_epochs):
        for batch in train_loader:
            inputs, labels = batch
            outputs = model_prepared(inputs)
            loss = F.cross_entropy(outputs, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Convert to INT8 model
    model_prepared.eval()
    model_quantized = convert(model_prepared)

    return model_quantized

4. PyTorch Quantization API

4.1 torch.ao.quantization

PyTorch's official quantization API.

import torch
from torch.ao.quantization import (
    get_default_qconfig,
    get_default_qat_qconfig,
    prepare,
    prepare_qat,
    convert
)

# Static quantization (PTQ)
def static_quantization_example():
    """Static quantization example"""
    model = MyModel()
    model.eval()

    # Backend config (fbgemm: x86, qnnpack: ARM)
    model.qconfig = get_default_qconfig('fbgemm')

    # Prepare for calibration
    model_prepared = prepare(model)

    # Collect statistics from calibration data
    with torch.no_grad():
        for data in calibration_loader:
            model_prepared(data)

    # Convert to INT8 model
    model_quantized = convert(model_prepared)

    return model_quantized

# Dynamic quantization (effective for LSTM, Linear)
def dynamic_quantization_example():
    """Dynamic quantization example"""
    model = MyModel()

    model_quantized = torch.quantization.quantize_dynamic(
        model,
        {nn.Linear, nn.LSTM},  # Layer types to quantize
        dtype=torch.qint8
    )

    return model_quantized

4.2 FX Graph Mode Quantization

A more flexible and powerful quantization approach.

from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping

def fx_quantization_example(model, calibration_data):
    """FX Graph Mode quantization"""
    model.eval()

    qconfig_mapping = QConfigMapping().set_global(
        get_default_qconfig('fbgemm')
    )

    example_inputs = (torch.randn(1, 3, 224, 224),)

    model_prepared = prepare_fx(
        model,
        qconfig_mapping,
        example_inputs
    )

    with torch.no_grad():
        for batch in calibration_data:
            model_prepared(batch)

    model_quantized = convert_fx(model_prepared)

    return model_quantized

5. GPTQ: Accurate Post-Training Quantization

GPTQ, published in 2022, is an LLM-specific quantization algorithm that minimizes quality loss even at INT4. (arXiv:2209.05433)

5.1 GPTQ Algorithm Principles

GPTQ is based on OBQ (Optimal Brain Quantization). The core idea: quantize weights layer by layer sequentially, then compensate for quantization errors in the already-quantized weights by updating the remaining weights.

OBQ error minimization objective:

argmin_Q ||WX - QX||_F^2

Where W is the original weight, Q is the quantized weight, and X is the input activation.

Hessian-based weight update:

After quantizing each weight, the resulting error is propagated to the remaining weights using the inverse Hessian H^(-1).

import torch
import math

def gptq_quantize_weight(weight: torch.Tensor,
                          hessian: torch.Tensor,
                          num_bits: int = 4,
                          group_size: int = 128,
                          damp_percent: float = 0.01):
    """
    Quantize weights using the GPTQ algorithm

    Args:
        weight: [out_features, in_features] weight matrix
        hessian: [in_features, in_features] Hessian matrix (H = 2 * X @ X.T)
        num_bits: quantization bit count
        group_size: group size
        damp_percent: damping ratio for Hessian stabilization
    """
    W = weight.clone().float()
    n_rows, n_cols = W.shape

    # Hessian damping (numerical stability)
    H = hessian.clone().float()
    dead_cols = torch.diag(H) == 0
    H[dead_cols, dead_cols] = 1
    W[:, dead_cols] = 0

    damp = damp_percent * H.diag().mean()
    H.diagonal().add_(damp)

    # Inverse Hessian via Cholesky decomposition
    H_inv = torch.linalg.cholesky(H)
    H_inv = torch.cholesky_inverse(H_inv)
    H_inv = torch.linalg.cholesky(H_inv, upper=True)

    Q = torch.zeros_like(W)
    Losses = torch.zeros_like(W)

    q_max = 2 ** (num_bits - 1) - 1

    for col_idx in range(n_cols):
        w_col = W[:, col_idx]
        h_inv_diag = H_inv[col_idx, col_idx]

        # Compute per-group scale
        if group_size != -1 and col_idx % group_size == 0:
            group_end = min(col_idx + group_size, n_cols)
            w_group = W[:, col_idx:group_end]
            max_abs = w_group.abs().max(dim=1)[0].unsqueeze(1)
            scale = max_abs / q_max
            scale = torch.clamp(scale, min=1e-8)

        # Quantize
        q_col = torch.clamp(torch.round(w_col / scale.squeeze()), -q_max, q_max)
        q_col = q_col * scale.squeeze()
        Q[:, col_idx] = q_col

        # Quantization error
        err = (w_col - q_col) / h_inv_diag
        Losses[:, col_idx] = err ** 2 / 2

        # Propagate error to remaining weights (the key step!)
        W[:, col_idx + 1:] -= err.unsqueeze(1) * H_inv[col_idx, col_idx + 1:].unsqueeze(0)

    return Q, Losses

5.2 Using AutoGPTQ

Practical GPTQ quantization uses the AutoGPTQ library.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

def quantize_with_gptq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128
):
    """Quantize model with AutoGPTQ"""

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    quantize_config = BaseQuantizeConfig(
        bits=bits,
        group_size=group_size,
        damp_percent=0.01,
        desc_act=False,
        sym=True,
        true_sequential=True
    )

    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config
    )

    # Prepare calibration data
    from datasets import load_dataset
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    calibration_data = []
    for text in dataset["text"][:128]:
        if len(text.strip()) > 50:
            encoded = tokenizer(
                text.strip(),
                return_tensors="pt",
                max_length=2048,
                truncation=True
            )
            calibration_data.append(encoded["input_ids"].squeeze())

    print(f"Starting GPTQ {bits}bit quantization...")
    model.quantize(calibration_data)

    model.save_quantized(output_dir, use_safetensors=True)
    tokenizer.save_pretrained(output_dir)

    print(f"Quantization complete: {output_dir}")
    return model, tokenizer


def load_gptq_model(model_dir: str, device: str = "cuda"):
    """Load GPTQ quantized model"""

    model = AutoGPTQForCausalLM.from_quantized(
        model_dir,
        device=device,
        use_triton=False,
        disable_exllama=False,
        inject_fused_attention=True,
        inject_fused_mlp=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer

6. AWQ: Activation-aware Weight Quantization

AWQ, published in 2023, analyzes activation distributions to protect important weight channels. (arXiv:2306.00978)

6.1 Differences from GPTQ

Feature	GPTQ	AWQ
Approach	Hessian-based error compensation	Activation-based scaling
Calibration data	Required (128+ samples)	Required (32+ samples)
Speed	Slow (1–4 hours)	Fast (tens of minutes)
Quality	Excellent	Excellent (comparable or better)
Key feature	Per-channel optimization	Activation outlier handling

6.2 Using AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

def quantize_with_awq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128
):
    """Quantize model with AutoAWQ"""

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )

    model = AutoAWQForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        use_cache=False
    )

    quant_config = {
        "zero_point": True,
        "q_group_size": group_size,
        "w_bit": bits,
        "version": "GEMM"
    }

    print(f"Starting AWQ {bits}bit quantization...")
    model.quantize(tokenizer, quant_config=quant_config)

    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)

    print(f"AWQ quantization complete: {output_dir}")
    return model

def load_awq_model(model_dir: str, device: str = "cuda"):
    """Load AWQ quantized model"""

    model = AutoAWQForCausalLM.from_quantized(
        model_dir,
        fuse_layers=True,
        trust_remote_code=True,
        safetensors=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer

7. GGUF/GGML: The llama.cpp Ecosystem

GGUF (GPT-Generated Unified Format) is the model format for the llama.cpp project, enabling efficient LLM execution even on CPU.

7.1 Understanding GGUF

GGUF was introduced in 2023 to replace GGML. It stores model metadata, hyperparameters, and tokenizer information in a single file.

7.2 Quantization Levels Comparison

Format	Bits	Memory (7B)	PPL Increase	Recommended Use
Q2_K	2.6	2.8 GB	High	Extreme compression
Q3_K_S	3.0	3.3 GB	Medium	Memory saving
Q4_0	4.0	3.8 GB	Low	Balanced
Q4_K_M	4.1	4.1 GB	Very low	General recommendation
Q5_0	5.0	4.7 GB	Minimal	High quality
Q5_K_M	5.1	4.8 GB	Minimal	High quality recommended
Q6_K	6.0	5.5 GB	Nearly none	Near FP16
Q8_0	8.0	7.2 GB	None	Reference use
F16	16.0	13.5 GB	None	Baseline

K-quants (Q4_K_M, Q5_K_M, etc.) keep some layers at higher precision to improve quality.

7.3 Building and Using llama.cpp

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

# CPU-only build
cmake -B build
cmake --build build --config Release -j $(nproc)

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
    --model meta-llama/Llama-2-7b-hf \
    --outfile llama2-7b-f16.gguf \
    --outtype f16

# Quantize to Q4_K_M
./build/bin/llama-quantize \
    llama2-7b-f16.gguf \
    llama2-7b-q4_k_m.gguf \
    Q4_K_M

# Run inference
./build/bin/llama-cli \
    -m llama2-7b-q4_k_m.gguf \
    -p "The future of AI is" \
    -n 100 \
    --ctx-size 4096 \
    --threads 8 \
    --n-gpu-layers 35

7.4 Python Bindings (llama-cpp-python)

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./llama2-7b-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    n_threads=8,
    verbose=False
)

# Text generation
output = llm(
    "Once upon a time",
    max_tokens=200,
    temperature=0.7,
    top_p=0.9,
    stop=["</s>", "\n\n"]
)

print(output["choices"][0]["text"])

# Chat completion format
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response["choices"][0]["message"]["content"])

# Streaming output
for chunk in llm.create_chat_completion(
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True
):
    delta = chunk["choices"][0].get("delta", {})
    if "content" in delta:
        print(delta["content"], end="", flush=True)

8. bitsandbytes: LLM Quantization Library

bitsandbytes, developed by Tim Dettmers, integrates seamlessly with HuggingFace transformers.

8.1 LLM.int8() — 8-bit Mixed Precision

LLM.int8() handles activation outliers in FP16 during matrix multiplication while using INT8 for the rest.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load INT8 model
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

def print_model_size(model, label):
    """Print model memory usage"""
    total_params = sum(p.numel() for p in model.parameters())
    total_bytes = sum(
        p.numel() * p.element_size() for p in model.parameters()
    )
    print(f"{label}: {total_params/1e9:.2f}B params, {total_bytes/1e9:.2f} GB")

print_model_size(model_8bit, "INT8 model")
# INT8 model: 6.74B params, ~7.0 GB

8.2 4-bit Quantization (Used in QLoRA)

import bitsandbytes as bnb
from transformers import BitsAndBytesConfig

# NF4 quantization config (QLoRA)
bnb_config_nf4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Double quantization
)

# FP4 quantization config
bnb_config_fp4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Load model
model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config_nf4,
    device_map="auto"
)

print_model_size(model_4bit, "NF4 model")
# NF4 model: 6.74B params, ~4.0 GB (with double quantization)

# QLoRA fine-tuning setup
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_4bit = prepare_model_for_kbit_training(model_4bit)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model_lora = get_peft_model(model_4bit, lora_config)
model_lora.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,607,232 || trainable%: 0.1197

8.3 NF4 vs FP4

NF4 (Normal Float 4)

Non-linear 4-bit quantization assuming a normal distribution
Leverages the observation that weight distributions are approximately normal
Better representational power at the same bit count

FP4 (Float 4)

Floating-point based 4-bit
Can represent wider ranges

9. SmoothQuant: W8A8 Quantization

SmoothQuant quantizes both weights (W) and activations (A) to INT8 for faster inference.

9.1 The Activation Outlier Problem

LLM activations exhibit very large values (outliers) in specific channels, making W8A8 quantization challenging.

9.2 Migration Scaling

SmoothQuant's key insight: transfer the difficulty from activations to weights.

Y = (X * diag(s)^(-1)) * (diag(s) * W)
  = X_smooth * W_smooth

def smooth_quantize(
    model,
    calibration_samples,
    alpha: float = 0.5
):
    """
    Apply SmoothQuant

    Args:
        alpha: migration strength (0=weights only, 1=activations only)
               Recommended: 0.5 (equal distribution)
    """

    act_scales = {}

    def collect_scales(name):
        def hook(module, input, output):
            inp = input[0].detach()
            if inp.dim() == 3:
                inp = inp.reshape(-1, inp.size(-1))

            channel_max = inp.abs().max(dim=0)[0]

            if name not in act_scales:
                act_scales[name] = channel_max
            else:
                act_scales[name] = torch.maximum(act_scales[name], channel_max)
        return hook

    handles = []
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            handles.append(module.register_forward_hook(collect_scales(name)))

    with torch.no_grad():
        for sample in calibration_samples:
            model(**sample)

    for h in handles:
        h.remove()

    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear) and name in act_scales:
            act_scale = act_scales[name]
            weight_scale = module.weight.abs().max(dim=0)[0]

            # Compute migration scale
            smooth_scale = (act_scale ** alpha) / (weight_scale ** (1 - alpha))
            smooth_scale = torch.clamp(smooth_scale, min=1e-5)

            # Apply scale to weights
            module.weight.data = module.weight.data / smooth_scale.unsqueeze(0)

    return model, act_scales

10. SpQR: Sparse Quantization Representation

SpQR stores important weights (outliers) in FP16 separately and quantizes the remainder to low precision.

import torch

def spqr_quantize(weight: torch.Tensor,
                   num_bits: int = 3,
                   outlier_threshold_percentile: float = 1.0):
    """
    SpQR quantization (simplified version)

    Core: store top p% outliers as FP16, quantize rest to low bits
    """

    threshold = torch.quantile(weight.abs(), 1 - outlier_threshold_percentile / 100)
    outlier_mask = weight.abs() > threshold

    # Store outliers (FP16)
    outlier_values = weight.clone()
    outlier_values[~outlier_mask] = 0

    # Quantize remainder
    regular_weight = weight.clone()
    regular_weight[outlier_mask] = 0

    q_max = 2 ** (num_bits - 1) - 1
    group_size = 16

    rows, cols = regular_weight.shape
    regular_grouped = regular_weight.reshape(-1, group_size)

    max_abs = regular_grouped.abs().max(dim=1, keepdim=True)[0]
    scales = max_abs / q_max
    scales = torch.clamp(scales, min=1e-8)

    q = torch.clamp(torch.round(regular_grouped / scales), -q_max, q_max).to(torch.int8)
    regular_dequant = (scales * q.float()).reshape(rows, cols)

    reconstructed = regular_dequant + outlier_values

    error = (weight - reconstructed).abs().mean().item()
    outlier_memory = outlier_mask.sum().item() * 2
    regular_memory = (~outlier_mask).sum().item() * (num_bits / 8)
    total_memory = outlier_memory + regular_memory
    original_memory = weight.numel() * weight.element_size()
    compression_ratio = original_memory / total_memory

    print(f"Outlier ratio: {outlier_mask.float().mean():.2%}")
    print(f"Mean reconstruction error: {error:.6f}")
    print(f"Compression ratio: {compression_ratio:.2f}x")

    return q, scales, outlier_values, outlier_mask

11. Quantization Benchmark Comparison

11.1 Llama-2-7B Benchmark

import time
import torch
import GPUtil

def benchmark_quantization(model, tokenizer, device="cuda", num_runs=50):
    """Benchmark quantized model"""

    prompt = "The history of artificial intelligence began"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    if device == "cuda":
        torch.cuda.synchronize()
        gpu = GPUtil.getGPUs()[0]
        memory_used_gb = gpu.memoryUsed / 1024

    # Warmup
    with torch.no_grad():
        for _ in range(5):
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False
            )

    # Measure speed
    if device == "cuda":
        torch.cuda.synchronize()
    start = time.time()

    with torch.no_grad():
        for _ in range(num_runs):
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False
            )

    if device == "cuda":
        torch.cuda.synchronize()
    elapsed = time.time() - start

    avg_time = elapsed / num_runs
    tokens_per_second = 50 / avg_time

    return {
        "memory_gb": memory_used_gb,
        "avg_time_ms": avg_time * 1000,
        "tokens_per_second": tokens_per_second
    }

# Example results (A100 80GB, Llama-2-7B)
benchmark_results = {
    "FP16": {"memory_gb": 13.5, "tokens_per_second": 52.3, "ppl": 5.68},
    "INT8 (bitsandbytes)": {"memory_gb": 7.8, "tokens_per_second": 38.1, "ppl": 5.71},
    "INT4 GPTQ": {"memory_gb": 4.5, "tokens_per_second": 65.2, "ppl": 5.89},
    "INT4 AWQ": {"memory_gb": 4.3, "tokens_per_second": 68.7, "ppl": 5.86},
    "Q4_K_M (GGUF, CPU)": {"memory_gb": 4.1, "tokens_per_second": 45.2, "ppl": 5.91},
    "INT4 NF4": {"memory_gb": 4.0, "tokens_per_second": 31.5, "ppl": 5.94},
}

print("=" * 80)
print(f"{'Method':<25} {'Memory (GB)':<12} {'Tok/s':<12} {'PPL':<8}")
print("=" * 80)
for method, stats in benchmark_results.items():
    print(f"{method:<25} {stats['memory_gb']:<12.1f} {stats['tokens_per_second']:<12.1f} {stats['ppl']:<8.2f}")

12. Practical Guide: Choosing the Right Quantization Method

12.1 Strategy by Model Size

Small models under 7B:

GGUF Q4_K_M: optimal for local CPU execution
AWQ INT4: recommended for GPU server deployment
FP16 viable if memory allows (under 24GB GPU)

Mid-size models 13B–30B:

GPTQ INT4 or AWQ INT4: runs on a single 24GB GPU
GGUF Q4_K_M: can run in 16GB RAM

Large models 70B+:

GPTQ INT4: runs on a single A100 80GB
GPTQ INT2: for extreme compression
Multi-GPU + Tensor Parallel combination

12.2 Strategy by Task

def recommend_quantization(
    task: str,
    model_size_b: float,
    gpu_memory_gb: float,
    cpu_only: bool = False,
    fine_tuning_needed: bool = False
):
    """Recommend quantization based on task and environment"""

    recommendations = []

    if cpu_only:
        recommendations.append({
            "method": "GGUF Q4_K_M",
            "reason": "Optimized for CPU inference, based on llama.cpp",
            "library": "llama-cpp-python"
        })
        return recommendations

    if fine_tuning_needed:
        recommendations.append({
            "method": "bitsandbytes NF4 + QLoRA",
            "reason": "Fine-tuning capable, LoRA adapter training with ~4GB overhead",
            "library": "bitsandbytes + peft"
        })
        return recommendations

    fp16_memory = model_size_b * 2
    int8_memory = model_size_b * 1
    int4_memory = model_size_b * 0.5

    if fp16_memory <= gpu_memory_gb * 0.8:
        recommendations.append({
            "method": "FP16 (baseline)",
            "reason": "Memory is sufficient, best quality",
            "memory_gb": fp16_memory
        })

    if int8_memory <= gpu_memory_gb * 0.8:
        if task in ["chat", "completion", "summarization"]:
            recommendations.append({
                "method": "AWQ INT8",
                "reason": "Optimal balance of quality and speed",
                "library": "autoawq",
                "memory_gb": int8_memory
            })

    if int4_memory <= gpu_memory_gb * 0.8:
        recommendations.append({
            "method": "AWQ INT4",
            "reason": "Fast inference, excellent quality",
            "library": "autoawq",
            "memory_gb": int4_memory
        })
        recommendations.append({
            "method": "GPTQ INT4",
            "reason": "Best INT4 quality, slower quantization process",
            "library": "auto-gptq",
            "memory_gb": int4_memory
        })

    return recommendations

# Example usage
recommendations = recommend_quantization(
    task="chat",
    model_size_b=7.0,
    gpu_memory_gb=16.0,
    fine_tuning_needed=False
)

for rec in recommendations:
    print(f"\nMethod: {rec['method']}")
    print(f"Reason: {rec['reason']}")
    if 'library' in rec:
        print(f"Library: {rec['library']}")
    if 'memory_gb' in rec:
        print(f"Expected memory: {rec['memory_gb']:.1f} GB")

12.3 Complete Quantization Pipeline

import torch
import os
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

class QuantizationPipeline:
    """Unified quantization pipeline"""

    def __init__(self, model_name: str, output_base_dir: str):
        self.model_name = model_name
        self.output_base_dir = output_base_dir
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        os.makedirs(output_base_dir, exist_ok=True)

    def quantize_gptq(self, bits: int = 4, group_size: int = 128):
        """GPTQ quantization"""
        output_dir = os.path.join(self.output_base_dir, f"gptq-{bits}bit")

        config = BaseQuantizeConfig(
            bits=bits,
            group_size=group_size,
            sym=True,
            desc_act=False
        )

        model = AutoGPTQForCausalLM.from_pretrained(
            self.model_name,
            quantize_config=config
        )

        calibration_data = self._prepare_calibration_data()
        model.quantize(calibration_data)
        model.save_quantized(output_dir)
        self.tokenizer.save_pretrained(output_dir)

        print(f"GPTQ {bits}bit saved: {output_dir}")
        return output_dir

    def quantize_awq(self, bits: int = 4, group_size: int = 128):
        """AWQ quantization"""
        output_dir = os.path.join(self.output_base_dir, f"awq-{bits}bit")

        model = AutoAWQForCausalLM.from_pretrained(
            self.model_name,
            low_cpu_mem_usage=True
        )

        quant_config = {
            "zero_point": True,
            "q_group_size": group_size,
            "w_bit": bits,
            "version": "GEMM"
        }

        model.quantize(self.tokenizer, quant_config=quant_config)
        model.save_quantized(output_dir)
        self.tokenizer.save_pretrained(output_dir)

        print(f"AWQ {bits}bit saved: {output_dir}")
        return output_dir

    def _prepare_calibration_data(self, num_samples: int = 128):
        """Prepare calibration data"""
        from datasets import load_dataset

        dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

        data = []
        for text in dataset["text"]:
            if len(text.strip()) > 50:
                encoded = self.tokenizer(
                    text.strip(),
                    return_tensors="pt",
                    max_length=2048,
                    truncation=True
                )
                data.append(encoded["input_ids"].squeeze())
                if len(data) >= num_samples:
                    break

        return data

Conclusion

Model quantization is a foundational technology for democratizing LLMs. To summarize what we covered:

Fundamentals: The math behind compressing FP32 to INT4 (scale, zero_point)
PTQ vs QAT: PTQ is practical without retraining; QAT is essential for extreme compression
GPTQ: Best INT4 quality via Hessian-based error compensation
AWQ: Fast and efficient quantization based on activation distributions
GGUF: Optimized for CPU execution, multiple quality levels available
bitsandbytes: HuggingFace integration, essential for QLoRA fine-tuning

Recommended strategies:

Local execution: GGUF Q4_K_M
GPU server deployment: AWQ 4-bit
Quality-critical scenarios: GPTQ 4-bit or FP16
Fine-tuning needed: bitsandbytes NF4 + QLoRA

Quantization technology is evolving rapidly, with 2-bit methods like QuIP# and AQLM emerging. The journey toward smaller, faster models continues.

References

GPTQ: arXiv:2209.05433
AWQ: arXiv:2306.00978
llama.cpp: github.com/ggerganov/llama.cpp
bitsandbytes: github.com/TimDettmers/bitsandbytes
SmoothQuant: arXiv:2211.10438
SpQR: arXiv:2306.03078
PyTorch Quantization: pytorch.org/docs/stable/quantization.html