Split View: 딥러닝 모델 양자화 완전 정복: INT8, INT4, GPTQ, AWQ, GGUF 마스터하기
딥러닝 모델 양자화 완전 정복: INT8, INT4, GPTQ, AWQ, GGUF 마스터하기
들어가며
딥러닝 모델이 점점 거대해지면서 추론(Inference) 비용과 메모리 요구량이 폭발적으로 증가했습니다. GPT-3는 175B 파라미터, Llama 3는 70B 파라미터에 달하며, FP32 전정밀도(Full Precision)로 저장하면 각각 700GB, 280GB의 메모리가 필요합니다. 일반 GPU로는 실행조차 불가능한 수준입니다.
**모델 양자화(Model Quantization)**는 이 문제를 해결하는 핵심 기술입니다. 32비트 부동소수점(FP32) 가중치를 8비트, 4비트 정수로 압축하여 메모리를 48배 줄이고 추론 속도를 24배 높입니다. 품질 손실은 놀랍도록 적습니다.
이 글에서는 양자화의 수학적 원리부터 GPTQ, AWQ, GGUF, bitsandbytes 같은 최신 기법까지 완전히 파헤칩니다.
1. 양자화 기초: 수 표현 방식 이해
1.1 부동소수점 표현 (Floating Point)
현대 딥러닝에서 사용되는 부동소수점 형식을 이해하는 것이 양자화의 시작입니다.
FP32 (Float32)
- 부호(1비트) + 지수(8비트) + 가수(23비트) = 총 32비트
- 표현 범위: 약 -3.4e38 ~ 3.4e38
- 정밀도: 약 7자리 소수
FP16 (Float16)
- 부호(1비트) + 지수(5비트) + 가수(10비트) = 총 16비트
- 표현 범위: -65504 ~ 65504 (FP32 대비 훨씬 좁음)
- 정밀도: 약 3자리 소수
- 오버플로우 위험이 있어 학습 시 gradient scaling 필요
BF16 (Brain Float16)
- 부호(1비트) + 지수(8비트) + 가수(7비트) = 총 16비트
- FP32와 동일한 지수 범위를 유지하면서 가수만 줄임
- 오버플로우 위험 없음, 딥러닝 학습에 더 안전
- Google Brain에서 개발, 최신 GPU(A100, H100)에서 네이티브 지원
import torch
import numpy as np
# 각 데이터 타입의 메모리 크기 확인
x_fp32 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float32)
x_fp16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float16)
x_bf16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.bfloat16)
print(f"FP32: {x_fp32.element_size()} bytes per element") # 4 bytes
print(f"FP16: {x_fp16.element_size()} bytes per element") # 2 bytes
print(f"BF16: {x_bf16.element_size()} bytes per element") # 2 bytes
# 모델 메모리 계산 예시 (7B 파라미터 모델)
params = 7e9
fp32_memory_gb = params * 4 / 1e9
fp16_memory_gb = params * 2 / 1e9
int8_memory_gb = params * 1 / 1e9
int4_memory_gb = params * 0.5 / 1e9
print(f"\n7B 모델 메모리 요구량:")
print(f"FP32: {fp32_memory_gb:.1f} GB") # 28.0 GB
print(f"FP16: {fp16_memory_gb:.1f} GB") # 14.0 GB
print(f"INT8: {int8_memory_gb:.1f} GB") # 7.0 GB
print(f"INT4: {int4_memory_gb:.1f} GB") # 3.5 GB
1.2 정수형 표현 (Integer)
양자화의 핵심은 부동소수점 값을 정수로 매핑하는 것입니다.
INT8: -128 ~ 127 (부호 있음) 또는 0 ~ 255 (부호 없음) INT4: -8 ~ 7 (부호 있음) 또는 0 ~ 15 (부호 없음) INT2: -2 ~ 1 (부호 있음) 또는 0 ~ 3 (부호 없음)
1.3 양자화 수식
부동소수점 값 x를 정수 q로 변환하는 기본 수식:
q = clamp(round(x / scale) + zero_point, q_min, q_max)
역양자화 (Dequantization):
x_approx = scale * (q - zero_point)
여기서:
- scale: 양자화 스케일 팩터 (scale = (max_val - min_val) / (q_max - q_min))
- zero_point: 정수 0이 나타내는 실수 값의 오프셋
- q_min, q_max: 정수 범위 (-128, 127 for INT8)
import torch
import numpy as np
def symmetric_quantize(x: torch.Tensor, num_bits: int = 8):
"""대칭 양자화 구현"""
q_max = 2 ** (num_bits - 1) - 1 # INT8의 경우 127
q_min = -q_max # -127
# 스케일 계산
max_abs = x.abs().max()
scale = max_abs / q_max
# 양자화
q = torch.clamp(torch.round(x / scale), q_min, q_max).to(torch.int8)
return q, scale
def asymmetric_quantize(x: torch.Tensor, num_bits: int = 8):
"""비대칭 양자화 구현"""
q_max = 2 ** num_bits - 1 # UINT8의 경우 255
q_min = 0
# 스케일과 zero_point 계산
min_val = x.min()
max_val = x.max()
scale = (max_val - min_val) / (q_max - q_min)
zero_point = q_min - torch.round(min_val / scale)
zero_point = torch.clamp(zero_point, q_min, q_max).to(torch.int32)
# 양자화
q = torch.clamp(torch.round(x / scale) + zero_point, q_min, q_max).to(torch.uint8)
return q, scale, zero_point
def dequantize(q: torch.Tensor, scale: torch.Tensor, zero_point: torch.Tensor = None):
"""역양자화"""
if zero_point is None:
return scale * q.float()
return scale * (q.float() - zero_point.float())
# 테스트
x = torch.randn(100)
print(f"원본 데이터 범위: [{x.min():.4f}, {x.max():.4f}]")
# 대칭 양자화
q_sym, scale_sym = symmetric_quantize(x)
x_reconstructed_sym = dequantize(q_sym, scale_sym)
error_sym = (x - x_reconstructed_sym).abs().mean()
print(f"대칭 양자화 평균 오차: {error_sym:.6f}")
# 비대칭 양자화
q_asym, scale_asym, zp_asym = asymmetric_quantize(x)
x_reconstructed_asym = dequantize(q_asym, scale_asym, zp_asym)
error_asym = (x - x_reconstructed_asym).abs().mean()
print(f"비대칭 양자화 평균 오차: {error_asym:.6f}")
1.4 대칭 vs 비대칭 양자화
대칭 양자화(Symmetric Quantization)
- zero_point = 0
- 양수/음수 범위가 대칭
- 가중치 양자화에 적합 (대부분 0 중심 분포)
- 연산이 단순: x_approx = scale * q
비대칭 양자화(Asymmetric Quantization)
- zero_point != 0
- 임의의 범위 표현 가능
- 활성화 양자화에 적합 (ReLU 이후 항상 양수)
- 연산이 복잡: x_approx = scale * (q - zero_point)
1.5 양자화 그래뉼레이티 (Quantization Granularity)
같은 scale/zero_point를 얼마나 많은 파라미터에 적용할지 결정합니다.
Per-Tensor: 전체 텐서에 하나의 scale 사용
- 메모리 오버헤드 최소
- 정밀도 손실 가장 큼
Per-Channel (Per-Row/Column): 각 채널마다 개별 scale
- 가중치 행렬의 각 행/열에 별도 scale
- 채널별 분포 차이를 효과적으로 처리
Per-Group (Per-Block): 일정 크기 그룹마다 개별 scale
- group_size = 128 이 일반적
- Per-Channel과 Per-Tensor의 절충점
- GPTQ, AWQ에서 주로 사용
import torch
def per_group_quantize(weight: torch.Tensor, group_size: int = 128, num_bits: int = 4):
"""Per-Group 양자화 구현"""
rows, cols = weight.shape
# 그룹 단위로 분할
weight_grouped = weight.reshape(-1, group_size)
# 각 그룹의 최대/최소값
max_vals = weight_grouped.max(dim=1, keepdim=True)[0]
min_vals = weight_grouped.min(dim=1, keepdim=True)[0]
q_max = 2 ** num_bits - 1 # 15 for INT4
# 스케일 계산
scales = (max_vals - min_vals) / q_max
zero_points = torch.round(-min_vals / scales)
# 양자화
q = torch.clamp(torch.round(weight_grouped / scales) + zero_points, 0, q_max)
# 역양자화
weight_dequant = scales * (q - zero_points)
weight_dequant = weight_dequant.reshape(rows, cols)
return q, scales, zero_points, weight_dequant
# 예시: Transformer 가중치 양자화
weight = torch.randn(4096, 4096) # Llama-style weight
q, scales, zp, weight_dequant = per_group_quantize(weight, group_size=128, num_bits=4)
error = (weight - weight_dequant).abs().mean()
print(f"Per-Group INT4 양자화 평균 오차: {error:.6f}")
print(f"압축률: {weight.element_size() * weight.numel() / (q.numel() / 2 + scales.numel() * 4):.2f}x")
2. Post-Training Quantization (PTQ)
PTQ는 이미 학습된 모델을 재학습 없이 양자화하는 방법입니다. 실용성이 높아 가장 많이 사용됩니다.
2.1 보정 데이터 (Calibration Dataset)
PTQ는 소량의 보정 데이터를 사용하여 적절한 scale/zero_point를 결정합니다.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset
def collect_calibration_data(model_name: str, num_samples: int = 128):
"""보정 데이터 수집"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
# WikiText-2 또는 C4 데이터셋 사용이 일반적
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
texts = []
for item in dataset:
if len(item['text'].strip()) > 100:
texts.append(item['text'].strip())
if len(texts) >= num_samples:
break
# 토크나이즈
encoded = [
tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
for text in texts
]
return encoded
# 보정 데이터로 활성화 통계 수집
def collect_activation_stats(model, calibration_data, layer_name: str):
"""특정 레이어의 활성화 통계 수집"""
stats = {"min": float("inf"), "max": float("-inf"), "histogram": []}
def hook_fn(module, input, output):
with torch.no_grad():
act = output.detach().float()
stats["min"] = min(stats["min"], act.min().item())
stats["max"] = max(stats["max"], act.max().item())
# 훅 등록
target_layer = dict(model.named_modules())[layer_name]
handle = target_layer.register_forward_hook(hook_fn)
# 보정 데이터 실행
model.eval()
with torch.no_grad():
for batch in calibration_data[:32]:
model(**batch)
handle.remove()
return stats
2.2 최소-최대 보정 (Min-Max Calibration)
가장 단순한 방법으로, 보정 데이터 전체의 최솟값과 최댓값을 사용합니다.
class MinMaxCalibrator:
"""최소-최대 보정기"""
def __init__(self):
self.min_val = float("inf")
self.max_val = float("-inf")
def update(self, tensor: torch.Tensor):
self.min_val = min(self.min_val, tensor.min().item())
self.max_val = max(self.max_val, tensor.max().item())
def compute_scale_zp(self, num_bits: int = 8, symmetric: bool = True):
q_max = 2 ** (num_bits - 1) - 1 if symmetric else 2 ** num_bits - 1
if symmetric:
max_abs = max(abs(self.min_val), abs(self.max_val))
scale = max_abs / q_max
zero_point = 0
else:
scale = (self.max_val - self.min_val) / q_max
zero_point = -round(self.min_val / scale)
return scale, zero_point
2.3 히스토그램 보정 (Histogram Calibration)
아웃라이어의 영향을 줄이기 위해 분포 히스토그램을 기반으로 최적 범위를 찾습니다.
import numpy as np
from scipy import stats
class HistogramCalibrator:
"""히스토그램 기반 보정기 (KL Divergence 최소화)"""
def __init__(self, num_bins: int = 2048):
self.num_bins = num_bins
self.histogram = None
self.bin_edges = None
def update(self, tensor: torch.Tensor):
data = tensor.detach().float().numpy().flatten()
if self.histogram is None:
self.histogram, self.bin_edges = np.histogram(data, bins=self.num_bins)
else:
new_hist, _ = np.histogram(data, bins=self.bin_edges)
self.histogram += new_hist
def compute_optimal_range(self, num_bits: int = 8):
"""KL Divergence를 최소화하는 최적 범위 탐색"""
num_quantized_bins = 2 ** num_bits - 1
best_kl = float("inf")
best_threshold = None
# 다양한 threshold 탐색
for i in range(num_quantized_bins, len(self.histogram)):
# 히스토그램을 num_quantized_bins 개로 압축
reference = self.histogram[:i].copy().astype(float)
reference /= reference.sum()
# KL Divergence 계산 (근사)
quantized = np.zeros(i)
bin_size = i / num_quantized_bins
for j in range(num_quantized_bins):
start = int(j * bin_size)
end = int((j + 1) * bin_size)
quantized[start:end] = reference[start:end].sum() / (end - start)
# 0인 구간 처리
quantized = np.where(quantized == 0, 1e-10, quantized)
reference_clipped = np.where(reference == 0, 1e-10, reference)
kl = stats.entropy(reference_clipped, quantized)
if kl < best_kl:
best_kl = kl
best_threshold = self.bin_edges[i]
return -best_threshold, best_threshold
2.4 퍼플렉시티에 미치는 영향
양자화 품질을 측정하는 가장 일반적인 지표는 퍼플렉시티(Perplexity, PPL)입니다.
import torch
import math
from transformers import AutoModelForCausalLM, AutoTokenizer
def compute_perplexity(model, tokenizer, text: str, device: str = "cuda"):
"""퍼플렉시티 계산"""
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.to(device)
max_length = 1024
stride = 512
nlls = []
prev_end_loc = 0
for begin_loc in range(0, input_ids.size(1), stride):
end_loc = min(begin_loc + max_length, input_ids.size(1))
trg_len = end_loc - prev_end_loc
input_ids_chunk = input_ids[:, begin_loc:end_loc]
target_ids = input_ids_chunk.clone()
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids_chunk, labels=target_ids)
neg_log_likelihood = outputs.loss
nlls.append(neg_log_likelihood)
prev_end_loc = end_loc
if end_loc == input_ids.size(1):
break
ppl = torch.exp(torch.stack(nlls).mean())
return ppl.item()
# 모델별 PPL 비교 예시
# FP16: PPL ≈ 5.68
# INT8: PPL ≈ 5.71 (약 0.5% 증가)
# INT4 (GPTQ): PPL ≈ 5.89 (약 3.7% 증가)
# INT4 (naive): PPL ≈ 6.52 (약 14.8% 증가)
3. Quantization-Aware Training (QAT)
QAT는 학습 중에 양자화를 시뮬레이션하여 모델이 양자화 노이즈에 적응하도록 합니다.
3.1 가짜 양자화 (Fake Quantization)
실제 INT8 연산 대신 FP32로 양자화 효과를 시뮬레이션합니다.
import torch
import torch.nn as nn
import torch.nn.functional as F
class FakeQuantize(nn.Module):
"""가짜 양자화 모듈"""
def __init__(self, num_bits: int = 8, symmetric: bool = True):
super().__init__()
self.num_bits = num_bits
self.symmetric = symmetric
self.register_buffer('scale', torch.tensor(1.0))
self.register_buffer('zero_point', torch.tensor(0))
self.register_buffer('fake_quant_enabled', torch.tensor(1))
if symmetric:
self.q_min = -(2 ** (num_bits - 1))
self.q_max = 2 ** (num_bits - 1) - 1
else:
self.q_min = 0
self.q_max = 2 ** num_bits - 1
def forward(self, x: torch.Tensor) -> torch.Tensor:
if self.fake_quant_enabled[0] == 0:
return x
# 스케일 업데이트 (이동 평균)
if self.training:
with torch.no_grad():
if self.symmetric:
max_abs = x.abs().max()
new_scale = max_abs / self.q_max
else:
new_scale = (x.max() - x.min()) / (self.q_max - self.q_min)
# 지수 이동 평균으로 스케일 업데이트
self.scale.copy_(0.9 * self.scale + 0.1 * new_scale)
# 가짜 양자화: 양자화 후 역양자화
x_scaled = x / self.scale
x_clipped = torch.clamp(x_scaled, self.q_min, self.q_max)
x_rounded = torch.round(x_clipped)
x_dequant = x_rounded * self.scale
return x_dequant
### 3.2 STE (Straight-Through Estimator)
```python
class STERound(torch.autograd.Function):
"""Straight-Through Estimator for round()"""
@staticmethod
def forward(ctx, x):
return torch.round(x)
@staticmethod
def backward(ctx, grad_output):
# 역전파 시 round()를 통과하여 gradient 전달 (항등함수로 근사)
return grad_output
class STEClamp(torch.autograd.Function):
"""Straight-Through Estimator for clamp()"""
@staticmethod
def forward(ctx, x, min_val, max_val):
ctx.save_for_backward(x)
ctx.min_val = min_val
ctx.max_val = max_val
return torch.clamp(x, min_val, max_val)
@staticmethod
def backward(ctx, grad_output):
x, = ctx.saved_tensors
# clamp 범위 내에서만 gradient 전달
grad = grad_output * ((x >= ctx.min_val) & (x <= ctx.max_val)).float()
return grad, None, None
class QATLinear(nn.Module):
"""QAT를 적용한 Linear 레이어"""
def __init__(self, in_features, out_features, num_bits=8):
super().__init__()
self.linear = nn.Linear(in_features, out_features)
self.weight_fake_quant = FakeQuantize(num_bits=num_bits)
self.act_fake_quant = FakeQuantize(num_bits=num_bits, symmetric=False)
def forward(self, x):
# 활성화 양자화
x_q = self.act_fake_quant(x)
# 가중치 양자화
w_q = self.weight_fake_quant(self.linear.weight)
# FP32 연산 (실제로는 INT8)
return F.linear(x_q, w_q, self.linear.bias)
3.3 언제 QAT가 필요한가?
- PTQ로 품질 손실이 너무 클 때: 특히 작은 모델(BERT-small 등)에서 효과적
- INT4 이하로 양자화할 때: 극단적인 압축에서 품질 유지에 필수
- 특수 태스크: Object detection, ASR 등 정밀도에 민감한 태스크
# QAT 학습 워크플로
import torch.optim as optim
from torch.quantization import prepare_qat, convert
def train_qat_model(model, train_loader, num_epochs=10):
"""QAT 학습 예시"""
# QAT 준비
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = prepare_qat(model.train())
optimizer = optim.Adam(model_prepared.parameters(), lr=1e-5)
for epoch in range(num_epochs):
for batch in train_loader:
inputs, labels = batch
outputs = model_prepared(inputs)
loss = F.cross_entropy(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# INT8 모델로 변환
model_prepared.eval()
model_quantized = convert(model_prepared)
return model_quantized
4. PyTorch 양자화 API
4.1 torch.ao.quantization
PyTorch의 공식 양자화 API입니다.
import torch
from torch.ao.quantization import (
get_default_qconfig,
get_default_qat_qconfig,
prepare,
prepare_qat,
convert
)
# 정적 양자화 (PTQ)
def static_quantization_example():
"""정적 양자화 예시"""
model = MyModel()
model.eval()
# 백엔드 설정 (fbgemm: x86, qnnpack: ARM)
model.qconfig = get_default_qconfig('fbgemm')
# 보정 준비
model_prepared = prepare(model)
# 보정 데이터로 통계 수집
with torch.no_grad():
for data in calibration_loader:
model_prepared(data)
# INT8 모델로 변환
model_quantized = convert(model_prepared)
return model_quantized
# 동적 양자화 (LSTM, Linear에 효과적)
def dynamic_quantization_example():
"""동적 양자화 예시"""
model = MyModel()
model_quantized = torch.quantization.quantize_dynamic(
model,
{nn.Linear, nn.LSTM}, # 양자화할 레이어 타입
dtype=torch.qint8
)
return model_quantized
4.2 FX Graph Mode Quantization
더 유연하고 강력한 양자화 방법입니다.
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping
def fx_quantization_example(model, calibration_data):
"""FX Graph Mode 양자화"""
model.eval()
# QConfig 설정
qconfig_mapping = QConfigMapping().set_global(
get_default_qconfig('fbgemm')
)
# 예시 입력
example_inputs = (torch.randn(1, 3, 224, 224),)
# FX 그래프 기반 준비
model_prepared = prepare_fx(
model,
qconfig_mapping,
example_inputs
)
# 보정
with torch.no_grad():
for batch in calibration_data:
model_prepared(batch)
# 변환
model_quantized = convert_fx(model_prepared)
return model_quantized
5. GPTQ: Accurate Post-Training Quantization
GPTQ는 2022년 발표된 LLM 특화 양자화 알고리즘으로, INT4 양자화에서도 품질 손실을 최소화합니다. (arXiv:2209.05433)
5.1 GPTQ 알고리즘 원리
GPTQ는 OBQ(Optimal Brain Quantization)를 기반으로 합니다. 핵심 아이디어는 레이어별로 가중치를 순차적으로 양자화하면서, 이미 양자화된 가중치의 오차를 나머지 가중치에 보정하는 것입니다.
OBQ 오차 최소화 목적함수:
argmin_Q ||WX - QX||_F^2
여기서 W는 원본 가중치, Q는 양자화된 가중치, X는 입력 활성화입니다.
헤시안 기반 가중치 업데이트:
각 가중치를 양자화한 후 발생하는 오차를 헤시안 역행렬 H^(-1)을 이용해 나머지 가중치에 전파합니다.
# GPTQ 핵심 알고리즘 구현 (단순화 버전)
import torch
import math
def gptq_quantize_weight(weight: torch.Tensor,
hessian: torch.Tensor,
num_bits: int = 4,
group_size: int = 128,
damp_percent: float = 0.01):
"""
GPTQ 알고리즘으로 가중치 양자화
Args:
weight: [out_features, in_features] 가중치 행렬
hessian: [in_features, in_features] 헤시안 행렬 (H = 2 * X @ X.T)
num_bits: 양자화 비트 수
group_size: 그룹 크기
damp_percent: 헤시안 안정화를 위한 댐핑 비율
"""
W = weight.clone().float()
n_rows, n_cols = W.shape
# 헤시안 댐핑 (수치 안정성)
H = hessian.clone().float()
dead_cols = torch.diag(H) == 0
H[dead_cols, dead_cols] = 1
W[:, dead_cols] = 0
damp = damp_percent * H.diag().mean()
H.diagonal().add_(damp)
# 헤시안 역행렬 (Cholesky 분해 이용)
H_inv = torch.linalg.cholesky(H)
H_inv = torch.cholesky_inverse(H_inv)
H_inv = torch.linalg.cholesky(H_inv, upper=True)
Q = torch.zeros_like(W)
Losses = torch.zeros_like(W)
q_max = 2 ** (num_bits - 1) - 1
for col_idx in range(n_cols):
w_col = W[:, col_idx] # 현재 컬럼의 가중치
h_inv_diag = H_inv[col_idx, col_idx] # 헤시안 역행렬의 대각 요소
# 그룹별 스케일 계산
if group_size != -1 and col_idx % group_size == 0:
group_end = min(col_idx + group_size, n_cols)
w_group = W[:, col_idx:group_end]
max_abs = w_group.abs().max(dim=1)[0].unsqueeze(1)
scale = max_abs / q_max
scale = torch.clamp(scale, min=1e-8)
# 양자화
q_col = torch.clamp(torch.round(w_col / scale.squeeze()), -q_max, q_max)
q_col = q_col * scale.squeeze()
Q[:, col_idx] = q_col
# 양자화 오차
err = (w_col - q_col) / h_inv_diag
Losses[:, col_idx] = err ** 2 / 2
# 오차를 나머지 가중치에 전파 (핵심!)
W[:, col_idx + 1:] -= err.unsqueeze(1) * H_inv[col_idx, col_idx + 1:].unsqueeze(0)
return Q, Losses
def collect_hessian(model_layer, calibration_data, device='cuda'):
"""보정 데이터로 헤시안 수집"""
hessians = {}
def make_hook(name):
def hook(module, input, output):
inp = input[0].detach().float()
if inp.dim() == 3:
inp = inp.reshape(-1, inp.size(-1))
if name not in hessians:
hessians[name] = torch.zeros(inp.size(1), inp.size(1), device=device)
hessians[name] += 2 * inp.T @ inp
return hook
handles = []
for name, module in model_layer.named_modules():
if isinstance(module, torch.nn.Linear):
handles.append(module.register_forward_hook(make_hook(name)))
with torch.no_grad():
for batch in calibration_data:
model_layer(batch.to(device))
for h in handles:
h.remove()
return hessians
5.2 AutoGPTQ 사용법
실용적인 GPTQ 양자화는 AutoGPTQ 라이브러리를 사용합니다.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch
def quantize_with_gptq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128
):
"""AutoGPTQ로 모델 양자화"""
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# 양자화 설정
quantize_config = BaseQuantizeConfig(
bits=bits, # 4 또는 8
group_size=group_size, # 128 권장
damp_percent=0.01, # 헤시안 댐핑
desc_act=False, # 활성화 재정렬 (품질 향상, 속도 감소)
sym=True, # 대칭 양자화
true_sequential=True # 순차적 레이어 양자화
)
# 모델 로드
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config
)
# 보정 데이터 준비
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_data = []
for text in dataset["text"][:128]:
if len(text.strip()) > 50:
encoded = tokenizer(
text.strip(),
return_tensors="pt",
max_length=2048,
truncation=True
)
calibration_data.append(encoded["input_ids"].squeeze())
# GPTQ 양자화 실행
print(f"GPTQ {bits}bit 양자화 시작...")
model.quantize(calibration_data)
# 저장
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"양자화 완료: {output_dir}")
return model, tokenizer
def load_gptq_model(model_dir: str, device: str = "cuda"):
"""GPTQ 양자화 모델 로드"""
model = AutoGPTQForCausalLM.from_quantized(
model_dir,
device=device,
use_triton=False, # Triton 커널 사용 여부
disable_exllama=False, # ExLlama 커널 사용 (속도 향상)
inject_fused_attention=True,
inject_fused_mlp=True
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
return model, tokenizer
# 사용 예시
# model, tokenizer = quantize_with_gptq("meta-llama/Llama-2-7b-hf", "./llama2-7b-gptq-4bit")
# model, tokenizer = load_gptq_model("./llama2-7b-gptq-4bit")
6. AWQ: Activation-aware Weight Quantization
AWQ는 2023년 발표된 기법으로, 활성화 분포를 분석하여 중요한 가중치 채널을 보호합니다. (arXiv:2306.00978)
6.1 GPTQ와의 차이
| 항목 | GPTQ | AWQ |
|---|---|---|
| 접근 방식 | 헤시안 기반 오차 보정 | 활성화 기반 스케일링 |
| 보정 데이터 | 필요 (128+ 샘플) | 필요 (32+ 샘플) |
| 속도 | 느림 (1-4시간) | 빠름 (수십 분) |
| 품질 | 우수 | 우수 (비슷하거나 더 좋음) |
| 특징 | 채널별 최적화 | 활성화 아웃라이어 처리 |
6.2 AWQ 핵심 아이디어
LLM 가중치에는 중요한 채널이 존재합니다. 이 채널들의 활성화 크기가 크며, 양자화 시 오차가 크면 전체 성능에 큰 영향을 미칩니다. AWQ는 중요 채널의 가중치를 스케일 팩터로 확대하여 양자화 오차를 줄입니다.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
def quantize_with_awq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128
):
"""AutoAWQ로 모델 양자화"""
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoAWQForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
use_cache=False
)
# AWQ 양자화 설정
quant_config = {
"zero_point": True, # 비대칭 양자화
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM" # GEMM 또는 GEMV (작은 배치에 최적)
}
# 양자화 실행
print(f"AWQ {bits}bit 양자화 시작...")
model.quantize(tokenizer, quant_config=quant_config)
# 저장
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ 양자화 완료: {output_dir}")
return model
def load_awq_model(model_dir: str, device: str = "cuda"):
"""AWQ 양자화 모델 로드"""
model = AutoAWQForCausalLM.from_quantized(
model_dir,
fuse_layers=True, # 레이어 퓨전으로 속도 향상
trust_remote_code=True,
safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
return model, tokenizer
# Hugging Face transformers와 통합
from transformers import AutoModelForCausalLM
def load_awq_with_transformers(model_dir: str):
"""transformers로 AWQ 모델 로드"""
model = AutoModelForCausalLM.from_pretrained(
model_dir,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
return model, tokenizer
7. GGUF/GGML: llama.cpp 생태계
GGUF(GPT-Generated Unified Format)는 llama.cpp 프로젝트의 모델 포맷으로, CPU에서도 효율적으로 LLM을 실행할 수 있습니다.
7.1 GGUF 포맷 이해
GGUF는 2023년에 GGML 포맷을 대체하여 도입되었습니다. 모델 메타데이터, 하이퍼파라미터, 토크나이저 정보를 단일 파일에 포함합니다.
GGUF 파일 구조:
┌─────────────────────────────┐
│ 매직 넘버 (GGUF) │
│ 버전 │
│ 텐서 개수 │
│ 메타데이터 KV 쌍 │
│ - 모델 아키텍처 │
│ - 컨텍스트 길이 │
│ - 어텐션 헤드 수 │
│ - 임베딩 차원 │
├─────────────────────────────┤
│ 텐서 인포 (이름, 타입, 형태) │
├─────────────────────────────┤
│ 텐서 데이터 │
└─────────────────────────────┘
7.2 양자화 수준 비교
| 포맷 | 비트 | 메모리(7B) | PPL 증가 | 권장 용도 |
|---|---|---|---|---|
| Q2_K | 2.6 | 2.8 GB | 높음 | 극단적 압축 |
| Q3_K_S | 3.0 | 3.3 GB | 중간 | 메모리 절약 |
| Q4_0 | 4.0 | 3.8 GB | 낮음 | 균형 |
| Q4_K_M | 4.1 | 4.1 GB | 매우 낮음 | 일반 권장 |
| Q5_0 | 5.0 | 4.7 GB | 최소 | 고품질 |
| Q5_K_M | 5.1 | 4.8 GB | 최소 | 고품질 권장 |
| Q6_K | 6.0 | 5.5 GB | 거의 없음 | FP16 근접 |
| Q8_0 | 8.0 | 7.2 GB | 없음 | 참조용 |
| F16 | 16.0 | 13.5 GB | 없음 | 기준선 |
K-quants (Q4_K_M, Q5_K_M 등)는 레이어의 일부를 더 높은 정밀도로 유지하여 품질을 향상시킵니다.
7.3 llama.cpp 빌드 및 사용
# llama.cpp 클론 및 빌드
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# CUDA 지원 빌드
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
# CPU 전용 빌드
cmake -B build
cmake --build build --config Release -j $(nproc)
# HuggingFace 모델을 GGUF로 변환
python convert_hf_to_gguf.py \
--model meta-llama/Llama-2-7b-hf \
--outfile llama2-7b-f16.gguf \
--outtype f16
# GGUF 양자화 (Q4_K_M)
./build/bin/llama-quantize \
llama2-7b-f16.gguf \
llama2-7b-q4_k_m.gguf \
Q4_K_M
# 추론 실행
./build/bin/llama-cli \
-m llama2-7b-q4_k_m.gguf \
-p "The future of AI is" \
-n 100 \
--ctx-size 4096 \
--threads 8 \
--n-gpu-layers 35
7.4 Python 바인딩 (llama-cpp-python)
from llama_cpp import Llama
# 모델 로드
llm = Llama(
model_path="./llama2-7b-q4_k_m.gguf",
n_ctx=4096, # 컨텍스트 길이
n_gpu_layers=35, # GPU로 오프로드할 레이어 수 (-1: 전체)
n_threads=8, # CPU 스레드 수
verbose=False
)
# 텍스트 생성
output = llm(
"Once upon a time",
max_tokens=200,
temperature=0.7,
top_p=0.9,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
# 채팅 완성 형식
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
max_tokens=500,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
# 스트리밍 출력
for chunk in llm.create_chat_completion(
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True
):
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
print(delta["content"], end="", flush=True)
8. bitsandbytes: LLM 양자화 라이브러리
bitsandbytes는 Tim Dettmers가 개발한 라이브러리로, HuggingFace transformers와 완벽히 통합됩니다.
8.1 LLM.int8() - 8비트 혼합 정밀도
LLM.int8()은 행렬 곱셈 중 활성화 아웃라이어를 FP16으로 처리하고 나머지는 INT8을 사용합니다.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# INT8 모델 로드
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto"
)
# 메모리 사용량 확인
def print_model_size(model, label):
"""모델 메모리 사용량 출력"""
total_params = sum(p.numel() for p in model.parameters())
total_bytes = sum(
p.numel() * p.element_size() for p in model.parameters()
)
print(f"{label}: {total_params/1e9:.2f}B params, {total_bytes/1e9:.2f} GB")
print_model_size(model_8bit, "INT8 모델")
# INT8 모델: 6.74B params, ~7.0 GB
8.2 4비트 양자화 (QLoRA에서 사용)
import bitsandbytes as bnb
from transformers import BitsAndBytesConfig
# NF4 양자화 설정 (QLoRA)
bnb_config_nf4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 또는 FP4
bnb_4bit_compute_dtype=torch.bfloat16, # 연산 시 데이터 타입
bnb_4bit_use_double_quant=True, # 이중 양자화 (양자화 상수도 양자화)
)
# FP4 양자화 설정
bnb_config_fp4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="fp4",
bnb_4bit_compute_dtype=torch.float16,
)
# 모델 로드
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config_nf4,
device_map="auto"
)
print_model_size(model_4bit, "NF4 모델")
# NF4 모델: 6.74B params, ~4.0 GB (이중 양자화 포함)
# QLoRA 파인튜닝 설정
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model_4bit = prepare_model_for_kbit_training(model_4bit)
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model_lora = get_peft_model(model_4bit, lora_config)
model_lora.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,607,232 || trainable%: 0.1197
8.3 NF4 vs FP4
NF4 (Normal Float 4)
- 정규 분포를 가정한 비선형 4비트 양자화
- 가중치 분포가 정규분포에 가깝다는 점을 활용
- 같은 비트 수에서 더 좋은 표현력
FP4 (Float 4)
- 부동소수점 기반 4비트
- 더 넓은 범위 표현 가능
import numpy as np
import matplotlib.pyplot as plt
# NF4 양자화 포인트 시각화
def get_nf4_quantization_points():
"""NF4 16개 양자화 포인트"""
# 정규 분포의 1/16 분위수
nf4_points = []
for i in range(16):
quantile = (i + 0.5) / 16
nf4_points.append(scipy.stats.norm.ppf(quantile))
# 정규화
max_val = max(abs(p) for p in nf4_points)
nf4_points = [p / max_val for p in nf4_points]
return nf4_points
# NF4: [-1.0, -0.6961, -0.5250, -0.3949, -0.2844, -0.1848, -0.0911, 0.0000,
# 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]
9. SmoothQuant: W8A8 양자화
SmoothQuant는 가중치(W)와 활성화(A) 모두 INT8로 양자화하여 더 빠른 추론을 달성합니다.
9.1 활성화 아웃라이어 문제
LLM의 활성화 분포는 특정 채널에서 매우 큰 값(아웃라이어)이 발생합니다. 이로 인해 W8A8 양자화가 어렵습니다.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def analyze_activation_outliers(model, tokenizer, text: str, threshold: float = 100.0):
"""활성화 아웃라이어 분석"""
activations = {}
def make_hook(name):
def hook(module, input, output):
act = output.detach().float()
max_val = act.abs().max().item()
outlier_ratio = (act.abs() > threshold).float().mean().item()
activations[name] = {
"max": max_val,
"outlier_ratio": outlier_ratio,
"std": act.std().item()
}
return hook
handles = []
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
handles.append(module.register_forward_hook(make_hook(name)))
input_ids = tokenizer(text, return_tensors="pt").input_ids.cuda()
with torch.no_grad():
model(input_ids)
for h in handles:
h.remove()
# 아웃라이어가 많은 레이어 순으로 정렬
sorted_acts = sorted(
activations.items(),
key=lambda x: x[1]["max"],
reverse=True
)
print("아웃라이어가 큰 레이어 Top 10:")
for name, stats in sorted_acts[:10]:
print(f" {name}: max={stats['max']:.1f}, outlier_ratio={stats['outlier_ratio']:.3%}")
return activations
9.2 마이그레이션 스케일링
SmoothQuant의 핵심: 활성화의 어려움을 가중치로 이전합니다.
Y = (X * diag(s)^(-1)) * (diag(s) * W)
= X_smooth * W_smooth
def smooth_quantize(
model,
calibration_samples,
alpha: float = 0.5
):
"""
SmoothQuant 적용
Args:
alpha: 마이그레이션 강도 (0=가중치만, 1=활성화만)
권장값: 0.5 (균등 분배)
"""
# 활성화 통계 수집
act_scales = {}
def collect_scales(name):
def hook(module, input, output):
inp = input[0].detach()
if inp.dim() == 3:
inp = inp.reshape(-1, inp.size(-1))
channel_max = inp.abs().max(dim=0)[0]
if name not in act_scales:
act_scales[name] = channel_max
else:
act_scales[name] = torch.maximum(act_scales[name], channel_max)
return hook
handles = []
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
handles.append(module.register_forward_hook(collect_scales(name)))
with torch.no_grad():
for sample in calibration_samples:
model(**sample)
for h in handles:
h.remove()
# 스케일 계산 및 적용
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear) and name in act_scales:
act_scale = act_scales[name]
weight_scale = module.weight.abs().max(dim=0)[0]
# 마이그레이션 스케일 계산
smooth_scale = (act_scale ** alpha) / (weight_scale ** (1 - alpha))
smooth_scale = torch.clamp(smooth_scale, min=1e-5)
# 가중치에 스케일 적용
module.weight.data = module.weight.data / smooth_scale.unsqueeze(0)
# 이전 레이어 (LayerNorm 등)의 출력 스케일에 역스케일 적용
# (실제 구현에서는 이전 레이어를 찾아서 수정)
return model, act_scales
10. SpQR: 희소-양자화 표현
SpQR는 중요한 가중치(아웃라이어)는 FP16으로 별도 저장하고 나머지는 저정밀도로 양자화합니다.
import torch
def spqr_quantize(weight: torch.Tensor,
num_bits: int = 3,
outlier_threshold_percentile: float = 1.0):
"""
SpQR 양자화 (단순화 버전)
핵심: 상위 p% 아웃라이어를 FP16으로 저장, 나머지는 저비트 양자화
"""
# 아웃라이어 임계값 계산
threshold = torch.quantile(weight.abs(), 1 - outlier_threshold_percentile / 100)
# 아웃라이어 마스크
outlier_mask = weight.abs() > threshold
# 아웃라이어 저장 (FP16)
outlier_values = weight.clone()
outlier_values[~outlier_mask] = 0
# 나머지 양자화
regular_weight = weight.clone()
regular_weight[outlier_mask] = 0
# Per-group 양자화 적용
q_max = 2 ** (num_bits - 1) - 1
group_size = 16
rows, cols = regular_weight.shape
regular_grouped = regular_weight.reshape(-1, group_size)
max_abs = regular_grouped.abs().max(dim=1, keepdim=True)[0]
scales = max_abs / q_max
scales = torch.clamp(scales, min=1e-8)
q = torch.clamp(torch.round(regular_grouped / scales), -q_max, q_max).to(torch.int8)
regular_dequant = (scales * q.float()).reshape(rows, cols)
# 최종 재구성
reconstructed = regular_dequant + outlier_values
error = (weight - reconstructed).abs().mean().item()
# 메모리 사용량 계산
outlier_memory = outlier_mask.sum().item() * 2 # FP16 = 2 bytes
regular_memory = (~outlier_mask).sum().item() * (num_bits / 8)
total_memory = outlier_memory + regular_memory
original_memory = weight.numel() * weight.element_size()
compression_ratio = original_memory / total_memory
print(f"아웃라이어 비율: {outlier_mask.float().mean():.2%}")
print(f"평균 재구성 오차: {error:.6f}")
print(f"압축률: {compression_ratio:.2f}x")
return q, scales, outlier_values, outlier_mask
11. 양자화 벤치마크 비교
11.1 Llama-2-7B 기준 비교
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import psutil
import GPUtil
def benchmark_quantization(model, tokenizer, device="cuda", num_runs=50):
"""양자화 모델 벤치마크"""
prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# 메모리 사용량
if device == "cuda":
torch.cuda.synchronize()
gpu = GPUtil.getGPUs()[0]
memory_used_gb = gpu.memoryUsed / 1024
else:
memory_used_gb = psutil.virtual_memory().used / 1e9
# 워밍업
with torch.no_grad():
for _ in range(5):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False
)
# 속도 측정
if device == "cuda":
torch.cuda.synchronize()
start = time.time()
with torch.no_grad():
for _ in range(num_runs):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False
)
if device == "cuda":
torch.cuda.synchronize()
elapsed = time.time() - start
avg_time = elapsed / num_runs
tokens_per_second = 50 / avg_time
return {
"memory_gb": memory_used_gb,
"avg_time_ms": avg_time * 1000,
"tokens_per_second": tokens_per_second
}
# 결과 예시 (A100 80GB 기준, Llama-2-7B)
benchmark_results = {
"FP16": {"memory_gb": 13.5, "tokens_per_second": 52.3, "ppl": 5.68},
"INT8 (bitsandbytes)": {"memory_gb": 7.8, "tokens_per_second": 38.1, "ppl": 5.71},
"INT4 GPTQ": {"memory_gb": 4.5, "tokens_per_second": 65.2, "ppl": 5.89},
"INT4 AWQ": {"memory_gb": 4.3, "tokens_per_second": 68.7, "ppl": 5.86},
"Q4_K_M (GGUF)": {"memory_gb": 4.1, "tokens_per_second": 45.2, "ppl": 5.91}, # CPU
"INT4 NF4": {"memory_gb": 4.0, "tokens_per_second": 31.5, "ppl": 5.94},
}
print("=" * 80)
print(f"{'방법':<25} {'메모리(GB)':<12} {'토큰/초':<12} {'PPL':<8}")
print("=" * 80)
for method, stats in benchmark_results.items():
print(f"{method:<25} {stats['memory_gb']:<12.1f} {stats['tokens_per_second']:<12.1f} {stats['ppl']:<8.2f}")
12. 실전 가이드: 최적 양자화 방법 선택
12.1 모델 크기에 따른 전략
7B 이하 소형 모델:
- GGUF Q4_K_M: 로컬 CPU 실행에 최적
- AWQ INT4: GPU 서버 배포에 권장
- FP16도 고려 가능 (24GB GPU 이하)
13B-30B 중형 모델:
- GPTQ INT4 또는 AWQ INT4: 24GB GPU 1장에 실행 가능
- GGUF Q4_K_M: 16GB RAM에서도 실행 가능
70B 이상 대형 모델:
- GPTQ INT4: A100 80GB 1장에 실행 가능
- GPTQ INT2: 극단적 압축 필요 시
- 멀티 GPU + Tensor Parallel 조합
12.2 태스크에 따른 전략
def recommend_quantization(
task: str,
model_size_b: float,
gpu_memory_gb: float,
cpu_only: bool = False,
fine_tuning_needed: bool = False
):
"""태스크와 환경에 따른 양자화 추천"""
recommendations = []
if cpu_only:
recommendations.append({
"method": "GGUF Q4_K_M",
"reason": "CPU 추론에 최적화, llama.cpp 기반",
"library": "llama-cpp-python"
})
return recommendations
if fine_tuning_needed:
recommendations.append({
"method": "bitsandbytes NF4 + QLoRA",
"reason": "파인튜닝 가능, 4GB 추가 메모리로 LoRA 어댑터 학습",
"library": "bitsandbytes + peft"
})
return recommendations
# 메모리 요구량 계산
fp16_memory = model_size_b * 2 # FP16 = 2 bytes per param
int8_memory = model_size_b * 1 # INT8 = 1 byte per param
int4_memory = model_size_b * 0.5 # INT4 = 0.5 bytes per param
if fp16_memory <= gpu_memory_gb * 0.8:
recommendations.append({
"method": "FP16 (기본)",
"reason": "메모리 여유 있음, 최고 품질",
"memory_gb": fp16_memory
})
if int8_memory <= gpu_memory_gb * 0.8:
if task in ["chat", "completion", "summarization"]:
recommendations.append({
"method": "AWQ INT8",
"reason": "품질과 속도의 최적 균형",
"library": "autoawq",
"memory_gb": int8_memory
})
if int4_memory <= gpu_memory_gb * 0.8:
recommendations.append({
"method": "AWQ INT4",
"reason": "고속 추론, 우수한 품질",
"library": "autoawq",
"memory_gb": int4_memory
})
recommendations.append({
"method": "GPTQ INT4",
"reason": "최고의 INT4 품질, 느린 양자화",
"library": "auto-gptq",
"memory_gb": int4_memory
})
return recommendations
# 사용 예시
recommendations = recommend_quantization(
task="chat",
model_size_b=7.0,
gpu_memory_gb=16.0,
fine_tuning_needed=False
)
for rec in recommendations:
print(f"\n방법: {rec['method']}")
print(f"이유: {rec['reason']}")
if 'library' in rec:
print(f"라이브러리: {rec['library']}")
if 'memory_gb' in rec:
print(f"예상 메모리: {rec['memory_gb']:.1f} GB")
12.3 완전한 양자화 파이프라인
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from awq import AutoAWQForCausalLM
import json
import os
class QuantizationPipeline:
"""통합 양자화 파이프라인"""
def __init__(self, model_name: str, output_base_dir: str):
self.model_name = model_name
self.output_base_dir = output_base_dir
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
os.makedirs(output_base_dir, exist_ok=True)
def quantize_gptq(self, bits: int = 4, group_size: int = 128):
"""GPTQ 양자화"""
output_dir = os.path.join(self.output_base_dir, f"gptq-{bits}bit")
config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
sym=True,
desc_act=False
)
model = AutoGPTQForCausalLM.from_pretrained(
self.model_name,
quantize_config=config
)
# 보정 데이터
calibration_data = self._prepare_calibration_data()
model.quantize(calibration_data)
model.save_quantized(output_dir)
self.tokenizer.save_pretrained(output_dir)
print(f"GPTQ {bits}bit 저장: {output_dir}")
return output_dir
def quantize_awq(self, bits: int = 4, group_size: int = 128):
"""AWQ 양자화"""
output_dir = os.path.join(self.output_base_dir, f"awq-{bits}bit")
model = AutoAWQForCausalLM.from_pretrained(
self.model_name,
low_cpu_mem_usage=True
)
quant_config = {
"zero_point": True,
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM"
}
model.quantize(self.tokenizer, quant_config=quant_config)
model.save_quantized(output_dir)
self.tokenizer.save_pretrained(output_dir)
print(f"AWQ {bits}bit 저장: {output_dir}")
return output_dir
def _prepare_calibration_data(self, num_samples: int = 128):
"""보정 데이터 준비"""
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
data = []
for text in dataset["text"]:
if len(text.strip()) > 50:
encoded = self.tokenizer(
text.strip(),
return_tensors="pt",
max_length=2048,
truncation=True
)
data.append(encoded["input_ids"].squeeze())
if len(data) >= num_samples:
break
return data
def evaluate_all(self, test_text: str = None):
"""모든 양자화 모델 평가"""
if test_text is None:
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
test_text = " ".join(dataset["text"][:10])
results = {}
# FP16 기준선
model_fp16 = AutoModelForCausalLM.from_pretrained(
self.model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# PPL 계산
from transformers import pipeline
# 각 모델별 평가 결과 출력
print("\n=== 양자화 평가 결과 ===")
print(f"모델: {self.model_name}")
print(f"{'방법':<20} {'PPL':<10} {'메모리(GB)':<12}")
print("-" * 42)
return results
# 전체 파이프라인 실행
pipeline = QuantizationPipeline(
model_name="meta-llama/Llama-2-7b-hf",
output_base_dir="./quantized_models"
)
# GPTQ 4bit 양자화
gptq_dir = pipeline.quantize_gptq(bits=4)
# AWQ 4bit 양자화
awq_dir = pipeline.quantize_awq(bits=4)
마무리
모델 양자화는 LLM 민주화의 핵심 기술입니다. 이 가이드에서 다룬 내용을 정리하면:
- 기초 이해: FP32 → INT4로 압축하는 수학적 원리 (scale, zero_point)
- PTQ vs QAT: 재학습 없는 PTQ가 실용적, QAT는 극단적 압축에 필수
- GPTQ: 헤시안 기반 오차 보정으로 최고의 INT4 품질
- AWQ: 활성화 분포 기반으로 빠르고 효율적인 양자화
- GGUF: CPU 실행에 최적, 다양한 품질 수준 지원
- bitsandbytes: HuggingFace 통합, QLoRA 파인튜닝에 필수
추천 전략:
- 로컬 실행: GGUF Q4_K_M
- GPU 서버 배포: AWQ 4bit
- 고품질이 중요한 경우: GPTQ 4bit 또는 FP16
- 파인튜닝 필요: bitsandbytes NF4 + QLoRA
양자화 기술은 빠르게 발전하고 있으며, QuIP#, AQLM 같은 2-비트 양자화 기법도 등장하고 있습니다. 모델을 더 작고 빠르게 만드는 이 여정은 계속됩니다.
참고 자료
- GPTQ: arXiv:2209.05433
- AWQ: arXiv:2306.00978
- llama.cpp: github.com/ggerganov/llama.cpp
- bitsandbytes: github.com/TimDettmers/bitsandbytes
- SmoothQuant: arXiv:2211.10438
- SpQR: arXiv:2306.03078
- PyTorch 양자화: pytorch.org/docs/stable/quantization.html
Deep Learning Model Quantization Complete Guide: Master INT8, INT4, GPTQ, AWQ, GGUF
Introduction
As deep learning models grow increasingly large, inference costs and memory requirements have skyrocketed. GPT-3 has 175B parameters, Llama 3 has 70B, and storing them in full FP32 precision requires 700GB and 280GB of memory respectively — completely infeasible for typical GPUs.
Model Quantization is the key technology to solve this problem. By compressing 32-bit floating-point (FP32) weights to 8-bit or 4-bit integers, it reduces memory by 4–8x and improves inference speed by 2–4x. Remarkably, quality loss is minimal.
In this guide, we thoroughly explore quantization from mathematical foundations to the latest techniques: GPTQ, AWQ, GGUF, and bitsandbytes.
1. Quantization Fundamentals: Understanding Number Representations
1.1 Floating-Point Formats
Understanding the floating-point formats used in modern deep learning is the starting point for quantization.
FP32 (Float32)
- Sign (1 bit) + Exponent (8 bits) + Mantissa (23 bits) = 32 bits total
- Range: approximately -3.4e38 to 3.4e38
- Precision: ~7 decimal digits
FP16 (Float16)
- Sign (1 bit) + Exponent (5 bits) + Mantissa (10 bits) = 16 bits total
- Range: -65504 to 65504 (much narrower than FP32)
- Precision: ~3 decimal digits
- Risk of overflow during training; requires gradient scaling
BF16 (Brain Float16)
- Sign (1 bit) + Exponent (8 bits) + Mantissa (7 bits) = 16 bits total
- Maintains the same exponent range as FP32 while reducing mantissa bits
- No overflow risk, safer for deep learning training
- Developed by Google Brain, natively supported on modern GPUs (A100, H100)
import torch
import numpy as np
# Check memory size of each data type
x_fp32 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float32)
x_fp16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.float16)
x_bf16 = torch.tensor([1.5, -2.3, 0.7], dtype=torch.bfloat16)
print(f"FP32: {x_fp32.element_size()} bytes per element") # 4 bytes
print(f"FP16: {x_fp16.element_size()} bytes per element") # 2 bytes
print(f"BF16: {x_bf16.element_size()} bytes per element") # 2 bytes
# Memory calculation for a 7B parameter model
params = 7e9
fp32_memory_gb = params * 4 / 1e9
fp16_memory_gb = params * 2 / 1e9
int8_memory_gb = params * 1 / 1e9
int4_memory_gb = params * 0.5 / 1e9
print(f"\n7B Model Memory Requirements:")
print(f"FP32: {fp32_memory_gb:.1f} GB") # 28.0 GB
print(f"FP16: {fp16_memory_gb:.1f} GB") # 14.0 GB
print(f"INT8: {int8_memory_gb:.1f} GB") # 7.0 GB
print(f"INT4: {int4_memory_gb:.1f} GB") # 3.5 GB
1.2 Integer Representations
The core of quantization is mapping floating-point values to integers.
INT8: -128 to 127 (signed) or 0 to 255 (unsigned) INT4: -8 to 7 (signed) or 0 to 15 (unsigned) INT2: -2 to 1 (signed) or 0 to 3 (unsigned)
1.3 Quantization Formula
The fundamental formula to convert a floating-point value x to integer q:
q = clamp(round(x / scale) + zero_point, q_min, q_max)
Dequantization:
x_approx = scale * (q - zero_point)
Where:
- scale: quantization scale factor (scale = (max_val - min_val) / (q_max - q_min))
- zero_point: the offset representing which real value integer 0 corresponds to
- q_min, q_max: integer range bounds (-128, 127 for INT8)
import torch
import numpy as np
def symmetric_quantize(x: torch.Tensor, num_bits: int = 8):
"""Symmetric quantization implementation"""
q_max = 2 ** (num_bits - 1) - 1 # 127 for INT8
q_min = -q_max # -127
# Compute scale
max_abs = x.abs().max()
scale = max_abs / q_max
# Quantize
q = torch.clamp(torch.round(x / scale), q_min, q_max).to(torch.int8)
return q, scale
def asymmetric_quantize(x: torch.Tensor, num_bits: int = 8):
"""Asymmetric quantization implementation"""
q_max = 2 ** num_bits - 1 # 255 for UINT8
q_min = 0
# Compute scale and zero_point
min_val = x.min()
max_val = x.max()
scale = (max_val - min_val) / (q_max - q_min)
zero_point = q_min - torch.round(min_val / scale)
zero_point = torch.clamp(zero_point, q_min, q_max).to(torch.int32)
# Quantize
q = torch.clamp(torch.round(x / scale) + zero_point, q_min, q_max).to(torch.uint8)
return q, scale, zero_point
def dequantize(q: torch.Tensor, scale: torch.Tensor, zero_point: torch.Tensor = None):
"""Dequantization"""
if zero_point is None:
return scale * q.float()
return scale * (q.float() - zero_point.float())
# Test
x = torch.randn(100)
print(f"Original data range: [{x.min():.4f}, {x.max():.4f}]")
# Symmetric quantization
q_sym, scale_sym = symmetric_quantize(x)
x_reconstructed_sym = dequantize(q_sym, scale_sym)
error_sym = (x - x_reconstructed_sym).abs().mean()
print(f"Symmetric quantization mean error: {error_sym:.6f}")
# Asymmetric quantization
q_asym, scale_asym, zp_asym = asymmetric_quantize(x)
x_reconstructed_asym = dequantize(q_asym, scale_asym, zp_asym)
error_asym = (x - x_reconstructed_asym).abs().mean()
print(f"Asymmetric quantization mean error: {error_asym:.6f}")
1.4 Symmetric vs Asymmetric Quantization
Symmetric Quantization
- zero_point = 0
- Symmetric positive/negative range
- Suitable for weights (mostly zero-centered distribution)
- Simpler computation: x_approx = scale * q
Asymmetric Quantization
- zero_point != 0
- Can represent arbitrary ranges
- Suitable for activations (always non-negative after ReLU)
- More complex computation: x_approx = scale * (q - zero_point)
1.5 Quantization Granularity
Determines how many parameters share a single scale/zero_point.
Per-Tensor: One scale for the entire tensor
- Minimal memory overhead
- Largest precision loss
Per-Channel (Per-Row/Column): Individual scale per channel
- Separate scale for each row/column of the weight matrix
- Effectively handles distribution differences across channels
Per-Group (Per-Block): Individual scale per fixed-size group
- Typical group_size = 128
- Compromise between per-channel and per-tensor
- Commonly used in GPTQ and AWQ
import torch
def per_group_quantize(weight: torch.Tensor, group_size: int = 128, num_bits: int = 4):
"""Per-Group quantization implementation"""
rows, cols = weight.shape
# Split into groups
weight_grouped = weight.reshape(-1, group_size)
# Max/min per group
max_vals = weight_grouped.max(dim=1, keepdim=True)[0]
min_vals = weight_grouped.min(dim=1, keepdim=True)[0]
q_max = 2 ** num_bits - 1 # 15 for INT4
# Compute scales
scales = (max_vals - min_vals) / q_max
zero_points = torch.round(-min_vals / scales)
# Quantize
q = torch.clamp(torch.round(weight_grouped / scales) + zero_points, 0, q_max)
# Dequantize
weight_dequant = scales * (q - zero_points)
weight_dequant = weight_dequant.reshape(rows, cols)
return q, scales, zero_points, weight_dequant
# Example: Transformer weight quantization
weight = torch.randn(4096, 4096) # Llama-style weight
q, scales, zp, weight_dequant = per_group_quantize(weight, group_size=128, num_bits=4)
error = (weight - weight_dequant).abs().mean()
print(f"Per-Group INT4 quantization mean error: {error:.6f}")
print(f"Compression ratio: {weight.element_size() * weight.numel() / (q.numel() / 2 + scales.numel() * 4):.2f}x")
2. Post-Training Quantization (PTQ)
PTQ quantizes an already-trained model without retraining — the most practical approach and most widely used.
2.1 Calibration Dataset
PTQ uses a small calibration dataset to determine appropriate scale and zero_point values.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset
def collect_calibration_data(model_name: str, num_samples: int = 128):
"""Collect calibration data"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
# WikiText-2 or C4 dataset is commonly used
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
texts = []
for item in dataset:
if len(item['text'].strip()) > 100:
texts.append(item['text'].strip())
if len(texts) >= num_samples:
break
# Tokenize
encoded = [
tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
for text in texts
]
return encoded
def collect_activation_stats(model, calibration_data, layer_name: str):
"""Collect activation statistics for a specific layer"""
stats = {"min": float("inf"), "max": float("-inf")}
def hook_fn(module, input, output):
with torch.no_grad():
act = output.detach().float()
stats["min"] = min(stats["min"], act.min().item())
stats["max"] = max(stats["max"], act.max().item())
# Register hook
target_layer = dict(model.named_modules())[layer_name]
handle = target_layer.register_forward_hook(hook_fn)
# Run calibration data
model.eval()
with torch.no_grad():
for batch in calibration_data[:32]:
model(**batch)
handle.remove()
return stats
2.2 Min-Max Calibration
The simplest method: uses the global minimum and maximum values from calibration data.
class MinMaxCalibrator:
"""Min-Max calibrator"""
def __init__(self):
self.min_val = float("inf")
self.max_val = float("-inf")
def update(self, tensor: torch.Tensor):
self.min_val = min(self.min_val, tensor.min().item())
self.max_val = max(self.max_val, tensor.max().item())
def compute_scale_zp(self, num_bits: int = 8, symmetric: bool = True):
q_max = 2 ** (num_bits - 1) - 1 if symmetric else 2 ** num_bits - 1
if symmetric:
max_abs = max(abs(self.min_val), abs(self.max_val))
scale = max_abs / q_max
zero_point = 0
else:
scale = (self.max_val - self.min_val) / q_max
zero_point = -round(self.min_val / scale)
return scale, zero_point
2.3 Histogram Calibration
To reduce the impact of outliers, finds the optimal range based on the distribution histogram.
import numpy as np
from scipy import stats
class HistogramCalibrator:
"""Histogram-based calibrator (minimizes KL Divergence)"""
def __init__(self, num_bins: int = 2048):
self.num_bins = num_bins
self.histogram = None
self.bin_edges = None
def update(self, tensor: torch.Tensor):
data = tensor.detach().float().numpy().flatten()
if self.histogram is None:
self.histogram, self.bin_edges = np.histogram(data, bins=self.num_bins)
else:
new_hist, _ = np.histogram(data, bins=self.bin_edges)
self.histogram += new_hist
def compute_optimal_range(self, num_bits: int = 8):
"""Search for optimal range minimizing KL Divergence"""
num_quantized_bins = 2 ** num_bits - 1
best_kl = float("inf")
best_threshold = None
for i in range(num_quantized_bins, len(self.histogram)):
reference = self.histogram[:i].copy().astype(float)
reference /= reference.sum()
quantized = np.zeros(i)
bin_size = i / num_quantized_bins
for j in range(num_quantized_bins):
start = int(j * bin_size)
end = int((j + 1) * bin_size)
quantized[start:end] = reference[start:end].sum() / (end - start)
quantized = np.where(quantized == 0, 1e-10, quantized)
reference_clipped = np.where(reference == 0, 1e-10, reference)
kl = stats.entropy(reference_clipped, quantized)
if kl < best_kl:
best_kl = kl
best_threshold = self.bin_edges[i]
return -best_threshold, best_threshold
2.4 Impact on Perplexity
The most common metric for measuring quantization quality is Perplexity (PPL).
import torch
import math
from transformers import AutoModelForCausalLM, AutoTokenizer
def compute_perplexity(model, tokenizer, text: str, device: str = "cuda"):
"""Compute perplexity"""
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.to(device)
max_length = 1024
stride = 512
nlls = []
prev_end_loc = 0
for begin_loc in range(0, input_ids.size(1), stride):
end_loc = min(begin_loc + max_length, input_ids.size(1))
trg_len = end_loc - prev_end_loc
input_ids_chunk = input_ids[:, begin_loc:end_loc]
target_ids = input_ids_chunk.clone()
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids_chunk, labels=target_ids)
neg_log_likelihood = outputs.loss
nlls.append(neg_log_likelihood)
prev_end_loc = end_loc
if end_loc == input_ids.size(1):
break
ppl = torch.exp(torch.stack(nlls).mean())
return ppl.item()
# Example PPL comparison
# FP16: PPL ~5.68
# INT8: PPL ~5.71 (~0.5% increase)
# INT4 (GPTQ): PPL ~5.89 (~3.7% increase)
# INT4 (naive): PPL ~6.52 (~14.8% increase)
3. Quantization-Aware Training (QAT)
QAT simulates quantization during training so the model adapts to quantization noise.
3.1 Fake Quantization
Simulates quantization effects in FP32 instead of actual INT8 operations.
import torch
import torch.nn as nn
import torch.nn.functional as F
class FakeQuantize(nn.Module):
"""Fake quantization module"""
def __init__(self, num_bits: int = 8, symmetric: bool = True):
super().__init__()
self.num_bits = num_bits
self.symmetric = symmetric
self.register_buffer('scale', torch.tensor(1.0))
self.register_buffer('zero_point', torch.tensor(0))
self.register_buffer('fake_quant_enabled', torch.tensor(1))
if symmetric:
self.q_min = -(2 ** (num_bits - 1))
self.q_max = 2 ** (num_bits - 1) - 1
else:
self.q_min = 0
self.q_max = 2 ** num_bits - 1
def forward(self, x: torch.Tensor) -> torch.Tensor:
if self.fake_quant_enabled[0] == 0:
return x
# Update scale with exponential moving average during training
if self.training:
with torch.no_grad():
if self.symmetric:
max_abs = x.abs().max()
new_scale = max_abs / self.q_max
else:
new_scale = (x.max() - x.min()) / (self.q_max - self.q_min)
self.scale.copy_(0.9 * self.scale + 0.1 * new_scale)
# Fake quantize: quantize then dequantize
x_scaled = x / self.scale
x_clipped = torch.clamp(x_scaled, self.q_min, self.q_max)
x_rounded = torch.round(x_clipped)
x_dequant = x_rounded * self.scale
return x_dequant
3.2 STE (Straight-Through Estimator)
class STERound(torch.autograd.Function):
"""Straight-Through Estimator for round()"""
@staticmethod
def forward(ctx, x):
return torch.round(x)
@staticmethod
def backward(ctx, grad_output):
# Pass gradient through round() unchanged (identity approximation)
return grad_output
class STEClamp(torch.autograd.Function):
"""Straight-Through Estimator for clamp()"""
@staticmethod
def forward(ctx, x, min_val, max_val):
ctx.save_for_backward(x)
ctx.min_val = min_val
ctx.max_val = max_val
return torch.clamp(x, min_val, max_val)
@staticmethod
def backward(ctx, grad_output):
x, = ctx.saved_tensors
# Pass gradient only within clamp range
grad = grad_output * ((x >= ctx.min_val) & (x <= ctx.max_val)).float()
return grad, None, None
class QATLinear(nn.Module):
"""Linear layer with QAT applied"""
def __init__(self, in_features, out_features, num_bits=8):
super().__init__()
self.linear = nn.Linear(in_features, out_features)
self.weight_fake_quant = FakeQuantize(num_bits=num_bits)
self.act_fake_quant = FakeQuantize(num_bits=num_bits, symmetric=False)
def forward(self, x):
# Activation quantization
x_q = self.act_fake_quant(x)
# Weight quantization
w_q = self.weight_fake_quant(self.linear.weight)
# FP32 compute (INT8 in actual deployment)
return F.linear(x_q, w_q, self.linear.bias)
3.3 When is QAT Needed?
- When PTQ quality loss is too high: Especially effective for small models (BERT-small, etc.)
- Quantizing to INT4 or lower: Essential for extreme compression
- Precision-sensitive tasks: Object detection, ASR, etc.
# QAT training workflow
import torch.optim as optim
from torch.quantization import prepare_qat, convert
def train_qat_model(model, train_loader, num_epochs=10):
"""QAT training example"""
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = prepare_qat(model.train())
optimizer = optim.Adam(model_prepared.parameters(), lr=1e-5)
for epoch in range(num_epochs):
for batch in train_loader:
inputs, labels = batch
outputs = model_prepared(inputs)
loss = F.cross_entropy(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Convert to INT8 model
model_prepared.eval()
model_quantized = convert(model_prepared)
return model_quantized
4. PyTorch Quantization API
4.1 torch.ao.quantization
PyTorch's official quantization API.
import torch
from torch.ao.quantization import (
get_default_qconfig,
get_default_qat_qconfig,
prepare,
prepare_qat,
convert
)
# Static quantization (PTQ)
def static_quantization_example():
"""Static quantization example"""
model = MyModel()
model.eval()
# Backend config (fbgemm: x86, qnnpack: ARM)
model.qconfig = get_default_qconfig('fbgemm')
# Prepare for calibration
model_prepared = prepare(model)
# Collect statistics from calibration data
with torch.no_grad():
for data in calibration_loader:
model_prepared(data)
# Convert to INT8 model
model_quantized = convert(model_prepared)
return model_quantized
# Dynamic quantization (effective for LSTM, Linear)
def dynamic_quantization_example():
"""Dynamic quantization example"""
model = MyModel()
model_quantized = torch.quantization.quantize_dynamic(
model,
{nn.Linear, nn.LSTM}, # Layer types to quantize
dtype=torch.qint8
)
return model_quantized
4.2 FX Graph Mode Quantization
A more flexible and powerful quantization approach.
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping
def fx_quantization_example(model, calibration_data):
"""FX Graph Mode quantization"""
model.eval()
qconfig_mapping = QConfigMapping().set_global(
get_default_qconfig('fbgemm')
)
example_inputs = (torch.randn(1, 3, 224, 224),)
model_prepared = prepare_fx(
model,
qconfig_mapping,
example_inputs
)
with torch.no_grad():
for batch in calibration_data:
model_prepared(batch)
model_quantized = convert_fx(model_prepared)
return model_quantized
5. GPTQ: Accurate Post-Training Quantization
GPTQ, published in 2022, is an LLM-specific quantization algorithm that minimizes quality loss even at INT4. (arXiv:2209.05433)
5.1 GPTQ Algorithm Principles
GPTQ is based on OBQ (Optimal Brain Quantization). The core idea: quantize weights layer by layer sequentially, then compensate for quantization errors in the already-quantized weights by updating the remaining weights.
OBQ error minimization objective:
argmin_Q ||WX - QX||_F^2
Where W is the original weight, Q is the quantized weight, and X is the input activation.
Hessian-based weight update:
After quantizing each weight, the resulting error is propagated to the remaining weights using the inverse Hessian H^(-1).
import torch
import math
def gptq_quantize_weight(weight: torch.Tensor,
hessian: torch.Tensor,
num_bits: int = 4,
group_size: int = 128,
damp_percent: float = 0.01):
"""
Quantize weights using the GPTQ algorithm
Args:
weight: [out_features, in_features] weight matrix
hessian: [in_features, in_features] Hessian matrix (H = 2 * X @ X.T)
num_bits: quantization bit count
group_size: group size
damp_percent: damping ratio for Hessian stabilization
"""
W = weight.clone().float()
n_rows, n_cols = W.shape
# Hessian damping (numerical stability)
H = hessian.clone().float()
dead_cols = torch.diag(H) == 0
H[dead_cols, dead_cols] = 1
W[:, dead_cols] = 0
damp = damp_percent * H.diag().mean()
H.diagonal().add_(damp)
# Inverse Hessian via Cholesky decomposition
H_inv = torch.linalg.cholesky(H)
H_inv = torch.cholesky_inverse(H_inv)
H_inv = torch.linalg.cholesky(H_inv, upper=True)
Q = torch.zeros_like(W)
Losses = torch.zeros_like(W)
q_max = 2 ** (num_bits - 1) - 1
for col_idx in range(n_cols):
w_col = W[:, col_idx]
h_inv_diag = H_inv[col_idx, col_idx]
# Compute per-group scale
if group_size != -1 and col_idx % group_size == 0:
group_end = min(col_idx + group_size, n_cols)
w_group = W[:, col_idx:group_end]
max_abs = w_group.abs().max(dim=1)[0].unsqueeze(1)
scale = max_abs / q_max
scale = torch.clamp(scale, min=1e-8)
# Quantize
q_col = torch.clamp(torch.round(w_col / scale.squeeze()), -q_max, q_max)
q_col = q_col * scale.squeeze()
Q[:, col_idx] = q_col
# Quantization error
err = (w_col - q_col) / h_inv_diag
Losses[:, col_idx] = err ** 2 / 2
# Propagate error to remaining weights (the key step!)
W[:, col_idx + 1:] -= err.unsqueeze(1) * H_inv[col_idx, col_idx + 1:].unsqueeze(0)
return Q, Losses
5.2 Using AutoGPTQ
Practical GPTQ quantization uses the AutoGPTQ library.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch
def quantize_with_gptq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128
):
"""Quantize model with AutoGPTQ"""
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
quantize_config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
damp_percent=0.01,
desc_act=False,
sym=True,
true_sequential=True
)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config
)
# Prepare calibration data
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_data = []
for text in dataset["text"][:128]:
if len(text.strip()) > 50:
encoded = tokenizer(
text.strip(),
return_tensors="pt",
max_length=2048,
truncation=True
)
calibration_data.append(encoded["input_ids"].squeeze())
print(f"Starting GPTQ {bits}bit quantization...")
model.quantize(calibration_data)
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"Quantization complete: {output_dir}")
return model, tokenizer
def load_gptq_model(model_dir: str, device: str = "cuda"):
"""Load GPTQ quantized model"""
model = AutoGPTQForCausalLM.from_quantized(
model_dir,
device=device,
use_triton=False,
disable_exllama=False,
inject_fused_attention=True,
inject_fused_mlp=True
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
return model, tokenizer
6. AWQ: Activation-aware Weight Quantization
AWQ, published in 2023, analyzes activation distributions to protect important weight channels. (arXiv:2306.00978)
6.1 Differences from GPTQ
| Feature | GPTQ | AWQ |
|---|---|---|
| Approach | Hessian-based error compensation | Activation-based scaling |
| Calibration data | Required (128+ samples) | Required (32+ samples) |
| Speed | Slow (1–4 hours) | Fast (tens of minutes) |
| Quality | Excellent | Excellent (comparable or better) |
| Key feature | Per-channel optimization | Activation outlier handling |
6.2 Using AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
def quantize_with_awq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128
):
"""Quantize model with AutoAWQ"""
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoAWQForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
use_cache=False
)
quant_config = {
"zero_point": True,
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM"
}
print(f"Starting AWQ {bits}bit quantization...")
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ quantization complete: {output_dir}")
return model
def load_awq_model(model_dir: str, device: str = "cuda"):
"""Load AWQ quantized model"""
model = AutoAWQForCausalLM.from_quantized(
model_dir,
fuse_layers=True,
trust_remote_code=True,
safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
return model, tokenizer
7. GGUF/GGML: The llama.cpp Ecosystem
GGUF (GPT-Generated Unified Format) is the model format for the llama.cpp project, enabling efficient LLM execution even on CPU.
7.1 Understanding GGUF
GGUF was introduced in 2023 to replace GGML. It stores model metadata, hyperparameters, and tokenizer information in a single file.
7.2 Quantization Levels Comparison
| Format | Bits | Memory (7B) | PPL Increase | Recommended Use |
|---|---|---|---|---|
| Q2_K | 2.6 | 2.8 GB | High | Extreme compression |
| Q3_K_S | 3.0 | 3.3 GB | Medium | Memory saving |
| Q4_0 | 4.0 | 3.8 GB | Low | Balanced |
| Q4_K_M | 4.1 | 4.1 GB | Very low | General recommendation |
| Q5_0 | 5.0 | 4.7 GB | Minimal | High quality |
| Q5_K_M | 5.1 | 4.8 GB | Minimal | High quality recommended |
| Q6_K | 6.0 | 5.5 GB | Nearly none | Near FP16 |
| Q8_0 | 8.0 | 7.2 GB | None | Reference use |
| F16 | 16.0 | 13.5 GB | None | Baseline |
K-quants (Q4_K_M, Q5_K_M, etc.) keep some layers at higher precision to improve quality.
7.3 Building and Using llama.cpp
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
# CPU-only build
cmake -B build
cmake --build build --config Release -j $(nproc)
# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
--model meta-llama/Llama-2-7b-hf \
--outfile llama2-7b-f16.gguf \
--outtype f16
# Quantize to Q4_K_M
./build/bin/llama-quantize \
llama2-7b-f16.gguf \
llama2-7b-q4_k_m.gguf \
Q4_K_M
# Run inference
./build/bin/llama-cli \
-m llama2-7b-q4_k_m.gguf \
-p "The future of AI is" \
-n 100 \
--ctx-size 4096 \
--threads 8 \
--n-gpu-layers 35
7.4 Python Bindings (llama-cpp-python)
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="./llama2-7b-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
n_threads=8,
verbose=False
)
# Text generation
output = llm(
"Once upon a time",
max_tokens=200,
temperature=0.7,
top_p=0.9,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
# Chat completion format
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
max_tokens=500,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
# Streaming output
for chunk in llm.create_chat_completion(
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True
):
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
print(delta["content"], end="", flush=True)
8. bitsandbytes: LLM Quantization Library
bitsandbytes, developed by Tim Dettmers, integrates seamlessly with HuggingFace transformers.
8.1 LLM.int8() — 8-bit Mixed Precision
LLM.int8() handles activation outliers in FP16 during matrix multiplication while using INT8 for the rest.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load INT8 model
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto"
)
def print_model_size(model, label):
"""Print model memory usage"""
total_params = sum(p.numel() for p in model.parameters())
total_bytes = sum(
p.numel() * p.element_size() for p in model.parameters()
)
print(f"{label}: {total_params/1e9:.2f}B params, {total_bytes/1e9:.2f} GB")
print_model_size(model_8bit, "INT8 model")
# INT8 model: 6.74B params, ~7.0 GB
8.2 4-bit Quantization (Used in QLoRA)
import bitsandbytes as bnb
from transformers import BitsAndBytesConfig
# NF4 quantization config (QLoRA)
bnb_config_nf4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Double quantization
)
# FP4 quantization config
bnb_config_fp4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="fp4",
bnb_4bit_compute_dtype=torch.float16,
)
# Load model
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config_nf4,
device_map="auto"
)
print_model_size(model_4bit, "NF4 model")
# NF4 model: 6.74B params, ~4.0 GB (with double quantization)
# QLoRA fine-tuning setup
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model_4bit = prepare_model_for_kbit_training(model_4bit)
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model_lora = get_peft_model(model_4bit, lora_config)
model_lora.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,607,232 || trainable%: 0.1197
8.3 NF4 vs FP4
NF4 (Normal Float 4)
- Non-linear 4-bit quantization assuming a normal distribution
- Leverages the observation that weight distributions are approximately normal
- Better representational power at the same bit count
FP4 (Float 4)
- Floating-point based 4-bit
- Can represent wider ranges
9. SmoothQuant: W8A8 Quantization
SmoothQuant quantizes both weights (W) and activations (A) to INT8 for faster inference.
9.1 The Activation Outlier Problem
LLM activations exhibit very large values (outliers) in specific channels, making W8A8 quantization challenging.
9.2 Migration Scaling
SmoothQuant's key insight: transfer the difficulty from activations to weights.
Y = (X * diag(s)^(-1)) * (diag(s) * W)
= X_smooth * W_smooth
def smooth_quantize(
model,
calibration_samples,
alpha: float = 0.5
):
"""
Apply SmoothQuant
Args:
alpha: migration strength (0=weights only, 1=activations only)
Recommended: 0.5 (equal distribution)
"""
act_scales = {}
def collect_scales(name):
def hook(module, input, output):
inp = input[0].detach()
if inp.dim() == 3:
inp = inp.reshape(-1, inp.size(-1))
channel_max = inp.abs().max(dim=0)[0]
if name not in act_scales:
act_scales[name] = channel_max
else:
act_scales[name] = torch.maximum(act_scales[name], channel_max)
return hook
handles = []
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
handles.append(module.register_forward_hook(collect_scales(name)))
with torch.no_grad():
for sample in calibration_samples:
model(**sample)
for h in handles:
h.remove()
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear) and name in act_scales:
act_scale = act_scales[name]
weight_scale = module.weight.abs().max(dim=0)[0]
# Compute migration scale
smooth_scale = (act_scale ** alpha) / (weight_scale ** (1 - alpha))
smooth_scale = torch.clamp(smooth_scale, min=1e-5)
# Apply scale to weights
module.weight.data = module.weight.data / smooth_scale.unsqueeze(0)
return model, act_scales
10. SpQR: Sparse Quantization Representation
SpQR stores important weights (outliers) in FP16 separately and quantizes the remainder to low precision.
import torch
def spqr_quantize(weight: torch.Tensor,
num_bits: int = 3,
outlier_threshold_percentile: float = 1.0):
"""
SpQR quantization (simplified version)
Core: store top p% outliers as FP16, quantize rest to low bits
"""
threshold = torch.quantile(weight.abs(), 1 - outlier_threshold_percentile / 100)
outlier_mask = weight.abs() > threshold
# Store outliers (FP16)
outlier_values = weight.clone()
outlier_values[~outlier_mask] = 0
# Quantize remainder
regular_weight = weight.clone()
regular_weight[outlier_mask] = 0
q_max = 2 ** (num_bits - 1) - 1
group_size = 16
rows, cols = regular_weight.shape
regular_grouped = regular_weight.reshape(-1, group_size)
max_abs = regular_grouped.abs().max(dim=1, keepdim=True)[0]
scales = max_abs / q_max
scales = torch.clamp(scales, min=1e-8)
q = torch.clamp(torch.round(regular_grouped / scales), -q_max, q_max).to(torch.int8)
regular_dequant = (scales * q.float()).reshape(rows, cols)
reconstructed = regular_dequant + outlier_values
error = (weight - reconstructed).abs().mean().item()
outlier_memory = outlier_mask.sum().item() * 2
regular_memory = (~outlier_mask).sum().item() * (num_bits / 8)
total_memory = outlier_memory + regular_memory
original_memory = weight.numel() * weight.element_size()
compression_ratio = original_memory / total_memory
print(f"Outlier ratio: {outlier_mask.float().mean():.2%}")
print(f"Mean reconstruction error: {error:.6f}")
print(f"Compression ratio: {compression_ratio:.2f}x")
return q, scales, outlier_values, outlier_mask
11. Quantization Benchmark Comparison
11.1 Llama-2-7B Benchmark
import time
import torch
import GPUtil
def benchmark_quantization(model, tokenizer, device="cuda", num_runs=50):
"""Benchmark quantized model"""
prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
if device == "cuda":
torch.cuda.synchronize()
gpu = GPUtil.getGPUs()[0]
memory_used_gb = gpu.memoryUsed / 1024
# Warmup
with torch.no_grad():
for _ in range(5):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False
)
# Measure speed
if device == "cuda":
torch.cuda.synchronize()
start = time.time()
with torch.no_grad():
for _ in range(num_runs):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False
)
if device == "cuda":
torch.cuda.synchronize()
elapsed = time.time() - start
avg_time = elapsed / num_runs
tokens_per_second = 50 / avg_time
return {
"memory_gb": memory_used_gb,
"avg_time_ms": avg_time * 1000,
"tokens_per_second": tokens_per_second
}
# Example results (A100 80GB, Llama-2-7B)
benchmark_results = {
"FP16": {"memory_gb": 13.5, "tokens_per_second": 52.3, "ppl": 5.68},
"INT8 (bitsandbytes)": {"memory_gb": 7.8, "tokens_per_second": 38.1, "ppl": 5.71},
"INT4 GPTQ": {"memory_gb": 4.5, "tokens_per_second": 65.2, "ppl": 5.89},
"INT4 AWQ": {"memory_gb": 4.3, "tokens_per_second": 68.7, "ppl": 5.86},
"Q4_K_M (GGUF, CPU)": {"memory_gb": 4.1, "tokens_per_second": 45.2, "ppl": 5.91},
"INT4 NF4": {"memory_gb": 4.0, "tokens_per_second": 31.5, "ppl": 5.94},
}
print("=" * 80)
print(f"{'Method':<25} {'Memory (GB)':<12} {'Tok/s':<12} {'PPL':<8}")
print("=" * 80)
for method, stats in benchmark_results.items():
print(f"{method:<25} {stats['memory_gb']:<12.1f} {stats['tokens_per_second']:<12.1f} {stats['ppl']:<8.2f}")
12. Practical Guide: Choosing the Right Quantization Method
12.1 Strategy by Model Size
Small models under 7B:
- GGUF Q4_K_M: optimal for local CPU execution
- AWQ INT4: recommended for GPU server deployment
- FP16 viable if memory allows (under 24GB GPU)
Mid-size models 13B–30B:
- GPTQ INT4 or AWQ INT4: runs on a single 24GB GPU
- GGUF Q4_K_M: can run in 16GB RAM
Large models 70B+:
- GPTQ INT4: runs on a single A100 80GB
- GPTQ INT2: for extreme compression
- Multi-GPU + Tensor Parallel combination
12.2 Strategy by Task
def recommend_quantization(
task: str,
model_size_b: float,
gpu_memory_gb: float,
cpu_only: bool = False,
fine_tuning_needed: bool = False
):
"""Recommend quantization based on task and environment"""
recommendations = []
if cpu_only:
recommendations.append({
"method": "GGUF Q4_K_M",
"reason": "Optimized for CPU inference, based on llama.cpp",
"library": "llama-cpp-python"
})
return recommendations
if fine_tuning_needed:
recommendations.append({
"method": "bitsandbytes NF4 + QLoRA",
"reason": "Fine-tuning capable, LoRA adapter training with ~4GB overhead",
"library": "bitsandbytes + peft"
})
return recommendations
fp16_memory = model_size_b * 2
int8_memory = model_size_b * 1
int4_memory = model_size_b * 0.5
if fp16_memory <= gpu_memory_gb * 0.8:
recommendations.append({
"method": "FP16 (baseline)",
"reason": "Memory is sufficient, best quality",
"memory_gb": fp16_memory
})
if int8_memory <= gpu_memory_gb * 0.8:
if task in ["chat", "completion", "summarization"]:
recommendations.append({
"method": "AWQ INT8",
"reason": "Optimal balance of quality and speed",
"library": "autoawq",
"memory_gb": int8_memory
})
if int4_memory <= gpu_memory_gb * 0.8:
recommendations.append({
"method": "AWQ INT4",
"reason": "Fast inference, excellent quality",
"library": "autoawq",
"memory_gb": int4_memory
})
recommendations.append({
"method": "GPTQ INT4",
"reason": "Best INT4 quality, slower quantization process",
"library": "auto-gptq",
"memory_gb": int4_memory
})
return recommendations
# Example usage
recommendations = recommend_quantization(
task="chat",
model_size_b=7.0,
gpu_memory_gb=16.0,
fine_tuning_needed=False
)
for rec in recommendations:
print(f"\nMethod: {rec['method']}")
print(f"Reason: {rec['reason']}")
if 'library' in rec:
print(f"Library: {rec['library']}")
if 'memory_gb' in rec:
print(f"Expected memory: {rec['memory_gb']:.1f} GB")
12.3 Complete Quantization Pipeline
import torch
import os
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
class QuantizationPipeline:
"""Unified quantization pipeline"""
def __init__(self, model_name: str, output_base_dir: str):
self.model_name = model_name
self.output_base_dir = output_base_dir
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
os.makedirs(output_base_dir, exist_ok=True)
def quantize_gptq(self, bits: int = 4, group_size: int = 128):
"""GPTQ quantization"""
output_dir = os.path.join(self.output_base_dir, f"gptq-{bits}bit")
config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
sym=True,
desc_act=False
)
model = AutoGPTQForCausalLM.from_pretrained(
self.model_name,
quantize_config=config
)
calibration_data = self._prepare_calibration_data()
model.quantize(calibration_data)
model.save_quantized(output_dir)
self.tokenizer.save_pretrained(output_dir)
print(f"GPTQ {bits}bit saved: {output_dir}")
return output_dir
def quantize_awq(self, bits: int = 4, group_size: int = 128):
"""AWQ quantization"""
output_dir = os.path.join(self.output_base_dir, f"awq-{bits}bit")
model = AutoAWQForCausalLM.from_pretrained(
self.model_name,
low_cpu_mem_usage=True
)
quant_config = {
"zero_point": True,
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM"
}
model.quantize(self.tokenizer, quant_config=quant_config)
model.save_quantized(output_dir)
self.tokenizer.save_pretrained(output_dir)
print(f"AWQ {bits}bit saved: {output_dir}")
return output_dir
def _prepare_calibration_data(self, num_samples: int = 128):
"""Prepare calibration data"""
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
data = []
for text in dataset["text"]:
if len(text.strip()) > 50:
encoded = self.tokenizer(
text.strip(),
return_tensors="pt",
max_length=2048,
truncation=True
)
data.append(encoded["input_ids"].squeeze())
if len(data) >= num_samples:
break
return data
Conclusion
Model quantization is a foundational technology for democratizing LLMs. To summarize what we covered:
- Fundamentals: The math behind compressing FP32 to INT4 (scale, zero_point)
- PTQ vs QAT: PTQ is practical without retraining; QAT is essential for extreme compression
- GPTQ: Best INT4 quality via Hessian-based error compensation
- AWQ: Fast and efficient quantization based on activation distributions
- GGUF: Optimized for CPU execution, multiple quality levels available
- bitsandbytes: HuggingFace integration, essential for QLoRA fine-tuning
Recommended strategies:
- Local execution: GGUF Q4_K_M
- GPU server deployment: AWQ 4-bit
- Quality-critical scenarios: GPTQ 4-bit or FP16
- Fine-tuning needed: bitsandbytes NF4 + QLoRA
Quantization technology is evolving rapidly, with 2-bit methods like QuIP# and AQLM emerging. The journey toward smaller, faster models continues.
References
- GPTQ: arXiv:2209.05433
- AWQ: arXiv:2306.00978
- llama.cpp: github.com/ggerganov/llama.cpp
- bitsandbytes: github.com/TimDettmers/bitsandbytes
- SmoothQuant: arXiv:2211.10438
- SpQR: arXiv:2306.03078
- PyTorch Quantization: pytorch.org/docs/stable/quantization.html