Split View: LLM Quantization 완벽 비교 — GPTQ vs AWQ vs GGUF
LLM Quantization 완벽 비교 — GPTQ vs AWQ vs GGUF
- 1. 양자화(Quantization)란?
- 2. GPTQ (GPU 최적화 양자화)
- 3. AWQ (Activation-aware Weight Quantization)
- 4. GGUF (llama.cpp 포맷)
- 5. vLLM에서 양자화 모델 사용
- 6. 벤치마크 비교
- 7. 실전 선택 가이드
- 8. 퀴즈

1. 양자화(Quantization)란?
양자화는 모델의 가중치를 낮은 정밀도로 표현하여 메모리 사용량과 추론 속도를 개선하는 기술입니다. FP16(16bit) 가중치를 INT4(4bit)로 줄이면 메모리가 약 4배 절약됩니다.
양자화 유형
- PTQ (Post-Training Quantization): 학습 후 양자화. GPTQ, AWQ, GGUF 모두 PTQ
- QAT (Quantization-Aware Training): 학습 중 양자화 시뮬레이션
비트 폭에 따른 메모리
| 정밀도 | 7B 모델 메모리 | 70B 모델 메모리 |
|---|---|---|
| FP32 | 28 GB | 280 GB |
| FP16 | 14 GB | 140 GB |
| INT8 | 7 GB | 70 GB |
| INT4 | 3.5 GB | 35 GB |
2. GPTQ (GPU 최적화 양자화)
GPTQ는 **Optimal Brain Quantization(OBQ)**를 기반으로 레이어별 양자화를 수행합니다. 캘리브레이션 데이터를 사용하여 양자화 오차를 최소화합니다.
GPTQ 양자화 실행
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
# 모델 로드
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
# 캘리브레이션 데이터 준비
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_data = [tokenizer(text["text"]) for text in dataset.select(range(128))]
# GPTQ 양자화
quantizer = GPTQQuantizer(
bits=4,
group_size=128,
desc_act=True, # activation order로 양자화
damp_percent=0.01,
dataset=calibration_data,
)
quantized_model = quantizer.quantize_model(model, tokenizer)
# 저장
quantized_model.save_pretrained("./llama-3.1-8b-gptq-4bit")
tokenizer.save_pretrained("./llama-3.1-8b-gptq-4bit")
GPTQ 특징
- GPU에 최적화된 커널 (ExLlama, Marlin)
- 캘리브레이션 데이터 필요 (보통 128~256 샘플)
- group_size가 작을수록 정확도 높지만 속도 감소
3. AWQ (Activation-aware Weight Quantization)
AWQ는 활성화 분포를 분석하여 중요한 가중치 채널을 보호하는 방식입니다. 모든 가중치를 동일하게 양자화하지 않고, 활성화 크기가 큰 채널의 가중치는 스케일링하여 정밀도를 유지합니다.
AWQ 양자화 실행
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "./llama-3.1-8b-awq-4bit"
# 모델 로드
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# AWQ 양자화 설정
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # GEMM or GEMV
}
# 양자화 실행
model.quantize(tokenizer, quant_config=quant_config)
# 저장
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
AWQ vs GPTQ 핵심 차이
| 항목 | GPTQ | AWQ |
|---|---|---|
| 접근법 | 레이어별 오차 최소화 | 활성화 기반 채널 보호 |
| 캘리브레이션 | 128+ 샘플 필요 | 적은 샘플로 가능 |
| 양자화 속도 | 느림 (수 시간) | 빠름 (수십 분) |
| 추론 속도 | ExLlama 커널로 빠름 | GEMM 커널로 빠름 |
| 정확도 | 높음 | 비슷하거나 약간 우수 |
4. GGUF (llama.cpp 포맷)
GGUF는 llama.cpp에서 사용하는 양자화 포맷으로, CPU + GPU 혼합 추론을 지원합니다.
GGUF 양자화 방식 종류
| 방식 | 비트 | 설명 |
|---|---|---|
| Q2_K | 2.5 | 극한 압축, 품질 저하 큼 |
| Q4_K_M | 4.8 | 가장 인기있는 균형점 |
| Q5_K_M | 5.5 | 높은 품질 유지 |
| Q6_K | 6.6 | FP16에 근접한 품질 |
| Q8_0 | 8.0 | 거의 무손실 |
| IQ4_XS | 4.3 | importance matrix 기반 |
GGUF 변환 실행
# llama.cpp 빌드
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j$(nproc)
# HuggingFace 모델을 GGUF로 변환
python convert_hf_to_gguf.py \
../Llama-3.1-8B-Instruct \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# 양자화 적용
./llama-quantize \
llama-3.1-8b-f16.gguf \
llama-3.1-8b-Q4_K_M.gguf \
Q4_K_M
# importance matrix로 더 정확한 양자화 (IQ 시리즈)
./llama-imatrix \
-m llama-3.1-8b-f16.gguf \
-f calibration.txt \
-o imatrix.dat
./llama-quantize \
--imatrix imatrix.dat \
llama-3.1-8b-f16.gguf \
llama-3.1-8b-IQ4_XS.gguf \
IQ4_XS
5. vLLM에서 양자화 모델 사용
# GPTQ 모델
vllm serve ./llama-3.1-8b-gptq-4bit \
--quantization gptq \
--max-model-len 8192
# AWQ 모델
vllm serve ./llama-3.1-8b-awq-4bit \
--quantization awq \
--max-model-len 8192
# GGUF 모델 (vLLM 0.6+)
vllm serve ./llama-3.1-8b-Q4_K_M.gguf \
--max-model-len 8192
llama.cpp 서버
./llama-server \
-m llama-3.1-8b-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \ # GPU 레이어 수
-c 8192 \ # 컨텍스트 길이
--n-predict 2048
6. 벤치마크 비교
7B 모델 기준 벤치마크 (RTX 4090):
| 방식 | VRAM | tok/s (생성) | Perplexity | 양자화 시간 |
|---|---|---|---|---|
| FP16 | 14 GB | 95 | 5.12 | - |
| GPTQ-4bit | 4.2 GB | 145 | 5.28 | 2-4시간 |
| AWQ-4bit | 4.0 GB | 150 | 5.25 | 30분 |
| GGUF Q4_K_M | 4.5 GB | 130 | 5.30 | 5분 |
| GGUF Q5_K_M | 5.3 GB | 120 | 5.18 | 5분 |
7. 실전 선택 가이드
GPU 전용 서빙 (vLLM) → AWQ (빠른 양자화 + 좋은 성능)
GPU 전용 서빙 (정확도 중시) → GPTQ (desc_act=True)
CPU + GPU 혼합 → GGUF Q4_K_M
MacBook / CPU 전용 → GGUF Q4_K_M 또는 Q5_K_M
극한 압축 → GGUF IQ4_XS (imatrix 사용)
8. 퀴즈
Q1: AWQ가 GPTQ보다 양자화 속도가 빠른 이유는?
GPTQ는 레이어별로 Hessian 역행렬을 계산하며 순차적으로 가중치를 양자화하므로 계산량이 많습니다. AWQ는 활성화 분포만 분석하여 중요 채널을 식별하고 스케일링하므로, 복잡한 최적화 과정 없이 빠르게 양자화할 수 있습니다.
Q2: GGUF Q4_K_M에서 K와 M은 각각 무엇을 의미하나요?
K: K-quant 방식을 의미합니다. 레이어의 중요도에 따라 서로 다른 비트 폭을 적용하는 혼합 양자화입니다. M: Medium 품질을 의미합니다. S(Small), M(Medium), L(Large) 변형이 있으며, L로 갈수록 더 많은 레이어에 높은 비트를 할당합니다.
Q3: vLLM에서 GPTQ와 AWQ 중 어떤 것을 선택해야 하나요?
대부분의 경우 AWQ를 추천합니다:
양자화 속도가 훨씬 빠릅니다 (30분 vs 수 시간) 추론 성능이 비슷하거나 약간 우수합니다 Perplexity 차이가 거의 없습니다
다만 최고 정확도가 필요하면 GPTQ(desc_act=True)가 소폭 유리할 수 있습니다.
Complete LLM Quantization Comparison — GPTQ vs AWQ vs GGUF
- 1. What is Quantization?
- 2. GPTQ (GPU-Optimized Quantization)
- 3. AWQ (Activation-aware Weight Quantization)
- 4. GGUF (llama.cpp Format)
- 5. Using Quantized Models with vLLM
- 6. Benchmark Comparison
- 7. Practical Selection Guide
- 8. Quiz
- Quiz

1. What is Quantization?
Quantization is a technique that represents model weights at lower precision to improve memory usage and inference speed. Reducing FP16 (16-bit) weights to INT4 (4-bit) saves approximately 4x memory.
Types of Quantization
- PTQ (Post-Training Quantization): Quantization after training. GPTQ, AWQ, and GGUF all use PTQ
- QAT (Quantization-Aware Training): Simulates quantization during training
Memory by Bit Width
| Precision | 7B Model Memory | 70B Model Memory |
|---|---|---|
| FP32 | 28 GB | 280 GB |
| FP16 | 14 GB | 140 GB |
| INT8 | 7 GB | 70 GB |
| INT4 | 3.5 GB | 35 GB |
2. GPTQ (GPU-Optimized Quantization)
GPTQ performs layer-wise quantization based on Optimal Brain Quantization (OBQ). It uses calibration data to minimize quantization error.
Running GPTQ Quantization
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
# Load model
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
# Prepare calibration data
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_data = [tokenizer(text["text"]) for text in dataset.select(range(128))]
# GPTQ quantization
quantizer = GPTQQuantizer(
bits=4,
group_size=128,
desc_act=True, # Quantize in activation order
damp_percent=0.01,
dataset=calibration_data,
)
quantized_model = quantizer.quantize_model(model, tokenizer)
# Save
quantized_model.save_pretrained("./llama-3.1-8b-gptq-4bit")
tokenizer.save_pretrained("./llama-3.1-8b-gptq-4bit")
GPTQ Characteristics
- GPU-optimized kernels (ExLlama, Marlin)
- Requires calibration data (typically 128-256 samples)
- Smaller group_size means higher accuracy but slower speed
3. AWQ (Activation-aware Weight Quantization)
AWQ is a method that analyzes activation distributions to protect important weight channels. Instead of quantizing all weights equally, it scales weights of channels with large activation magnitudes to preserve precision.
Running AWQ Quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "./llama-3.1-8b-awq-4bit"
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# AWQ quantization config
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # GEMM or GEMV
}
# Run quantization
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Key Differences Between AWQ and GPTQ
| Item | GPTQ | AWQ |
|---|---|---|
| Approach | Layer-wise error minimization | Activation-based channel protection |
| Calibration | 128+ samples required | Possible with fewer samples |
| Quantization Speed | Slow (several hours) | Fast (tens of minutes) |
| Inference Speed | Fast with ExLlama kernel | Fast with GEMM kernel |
| Accuracy | High | Similar or slightly better |
4. GGUF (llama.cpp Format)
GGUF is a quantization format used by llama.cpp that supports hybrid CPU + GPU inference.
GGUF Quantization Method Types
| Method | Bits | Description |
|---|---|---|
| Q2_K | 2.5 | Extreme compression, significant quality loss |
| Q4_K_M | 4.8 | Most popular balance point |
| Q5_K_M | 5.5 | High quality retention |
| Q6_K | 6.6 | Near FP16 quality |
| Q8_0 | 8.0 | Nearly lossless |
| IQ4_XS | 4.3 | Importance matrix based |
Running GGUF Conversion
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j$(nproc)
# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
../Llama-3.1-8B-Instruct \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# Apply quantization
./llama-quantize \
llama-3.1-8b-f16.gguf \
llama-3.1-8b-Q4_K_M.gguf \
Q4_K_M
# More accurate quantization with importance matrix (IQ series)
./llama-imatrix \
-m llama-3.1-8b-f16.gguf \
-f calibration.txt \
-o imatrix.dat
./llama-quantize \
--imatrix imatrix.dat \
llama-3.1-8b-f16.gguf \
llama-3.1-8b-IQ4_XS.gguf \
IQ4_XS
5. Using Quantized Models with vLLM
# GPTQ model
vllm serve ./llama-3.1-8b-gptq-4bit \
--quantization gptq \
--max-model-len 8192
# AWQ model
vllm serve ./llama-3.1-8b-awq-4bit \
--quantization awq \
--max-model-len 8192
# GGUF model (vLLM 0.6+)
vllm serve ./llama-3.1-8b-Q4_K_M.gguf \
--max-model-len 8192
llama.cpp Server
./llama-server \
-m llama-3.1-8b-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \ # Number of GPU layers
-c 8192 \ # Context length
--n-predict 2048
6. Benchmark Comparison
Benchmark for 7B model (RTX 4090):
| Method | VRAM | tok/s (generation) | Perplexity | Quantization Time |
|---|---|---|---|---|
| FP16 | 14 GB | 95 | 5.12 | - |
| GPTQ-4bit | 4.2 GB | 145 | 5.28 | 2-4 hours |
| AWQ-4bit | 4.0 GB | 150 | 5.25 | 30 min |
| GGUF Q4_K_M | 4.5 GB | 130 | 5.30 | 5 min |
| GGUF Q5_K_M | 5.3 GB | 120 | 5.18 | 5 min |
7. Practical Selection Guide
GPU-only serving (vLLM) → AWQ (fast quantization + good performance)
GPU-only serving (accuracy-focused) → GPTQ (desc_act=True)
CPU + GPU hybrid → GGUF Q4_K_M
MacBook / CPU-only → GGUF Q4_K_M or Q5_K_M
Extreme compression → GGUF IQ4_XS (with imatrix)
8. Quiz
Q1: Why is AWQ faster to quantize than GPTQ?
GPTQ computes the Hessian inverse for each layer and quantizes weights sequentially, which is computationally expensive. AWQ only analyzes activation distributions to identify important channels and applies scaling, enabling fast quantization without complex optimization processes.
Q2: What do the K and M in GGUF Q4_K_M stand for?
K: Refers to the K-quant method. It is a mixed quantization approach that applies different bit widths depending on the importance of each layer. M: Stands for Medium quality. There are S (Small), M (Medium), and L (Large) variants, where L allocates higher bits to more layers.
Q3: Should you choose GPTQ or AWQ for vLLM?
In most cases, AWQ is recommended:
Quantization speed is much faster (30 minutes vs several hours) Inference performance is similar or slightly better The perplexity difference is negligible
However, if maximum accuracy is required, GPTQ (desc_act=True) may have a slight edge.
Quiz
Q1: What is the main topic covered in "Complete LLM Quantization Comparison — GPTQ vs AWQ vs
GGUF"?
A comprehensive guide to LLM Quantization — from quantization fundamentals to comparing GPTQ, AWQ, and GGUF methods, vLLM/llama.cpp integration, and practical benchmarks.
Q2: What is Quantization??
Quantization is a technique that represents model weights at lower precision to improve memory
usage and inference speed. Reducing FP16 (16-bit) weights to INT4 (4-bit) saves approximately 4x
memory.
Q3: Explain the core concept of GPTQ (GPU-Optimized Quantization).
GPTQ performs layer-wise quantization based on Optimal Brain Quantization (OBQ). It uses
calibration data to minimize quantization error.
Q4: What are the key aspects of AWQ (Activation-aware Weight Quantization)?
AWQ is a method that analyzes activation distributions to protect important weight channels.
Instead of quantizing all weights equally, it scales weights of channels with large activation
magnitudes to preserve precision.
Q5: How does GGUF (llama.cpp Format) work?
GGUF is a quantization format used by llama.cpp that supports hybrid CPU + GPU inference. GGUF
Quantization Method Types Running GGUF Conversion