Complete LLM Quantization Comparison: GPTQ, AWQ, GGUF Practical Application Guide

Introduction
Quantization Fundamentals
GPTQ: Generative Pre-trained Transformer Quantization
AWQ: Activation-aware Weight Quantization
GGUF: The Universal Format for GPU-poor
BitsAndBytes and QLoRA Integration
- 4-bit Inference and Fine-tuning
- BitsAndBytes Characteristics
Comprehensive Comparison: GPTQ vs AWQ vs GGUF vs BitsAndBytes
Model-Specific Quantization Quality Guide
- Recommended Quantization by Model Size
- Recommended Quantization Level by Task
Operational Considerations and Troubleshooting
- Common Issues
- Quantization Quality Verification Checklist
Conclusion
References

Introduction

Loading a 70B parameter LLM in FP16 requires approximately 140GB of VRAM. Even two A100 80GB GPUs are barely sufficient. However, by applying quantization, the same model can be operated on a single GPU. With INT4 quantization, a 70B model shrinks to approximately 35GB, fitting on a single A100.

Between 2024 and 2026, quantization techniques have evolved rapidly. Various methods including GPTQ, AWQ, GGUF, and BitsAndBytes have emerged, each with different accuracy-speed-memory trade-offs. In this article, we analyze the principles of each technique in depth and guide you to the optimal choice through practical code and benchmarks.

Quantization Fundamentals

The Journey from FP16 to INT4

Quantization is the process of converting high-precision floating-point (FP16/FP32) weights to lower-bit integers (INT8/INT4).

Data Type	Bits	Range	70B Model Memory
FP32	32bit	~+/-3.4x10^38	~280GB
FP16	16bit	~+/-6.5x10^4	~140GB
INT8	8bit	-128 ~ 127	~70GB
INT4	4bit	-8 ~ 7	~35GB

Absmax Quantization

The simplest quantization method, scaling based on the absolute maximum value of the tensor.

import torch

def absmax_quantize(tensor: torch.Tensor, bits: int = 8) -> tuple:
    """Absmax quantization: scaling based on absolute maximum value"""
    qmax = 2 ** (bits - 1) - 1  # 127 for INT8
    scale = tensor.abs().max() / qmax
    quantized = torch.round(tensor / scale).clamp(-qmax, qmax).to(torch.int8)
    return quantized, scale

def absmax_dequantize(quantized: torch.Tensor, scale: float) -> torch.Tensor:
    """Dequantization: restore to original scale"""
    return quantized.float() * scale

# Example: Quantize FP16 weights
weight = torch.randn(4096, 4096, dtype=torch.float16)
q_weight, scale = absmax_quantize(weight.float())
restored = absmax_dequantize(q_weight, scale)
error = (weight.float() - restored).abs().mean()
print(f"Mean quantization error: {error:.6f}")

Zero-Point Quantization

Absmax is inefficient when the distribution is not symmetric. Zero-Point quantization handles asymmetric distributions.

import torch

def zeropoint_quantize(tensor: torch.Tensor, bits: int = 8) -> tuple:
    """Zero-Point quantization: handles asymmetric distributions"""
    qmin = -(2 ** (bits - 1))
    qmax = 2 ** (bits - 1) - 1

    rmin, rmax = tensor.min(), tensor.max()
    scale = (rmax - rmin) / (qmax - qmin)
    zero_point = torch.round(qmin - rmin / scale).clamp(qmin, qmax)

    quantized = torch.round(tensor / scale + zero_point).clamp(qmin, qmax).to(torch.int8)
    return quantized, scale, zero_point

def zeropoint_dequantize(quantized: torch.Tensor, scale: float, zero_point: float) -> torch.Tensor:
    """Zero-Point dequantization"""
    return (quantized.float() - zero_point) * scale

# Test with asymmetric distribution (positive-biased like ReLU output)
weight = torch.randn(4096, 4096).abs()  # Positive only
q_weight, scale, zp = zeropoint_quantize(weight)
restored = zeropoint_dequantize(q_weight, scale, zp)
error = (weight - restored).abs().mean()
print(f"Zero-Point quantization mean error: {error:.6f}")

Group-wise Quantization

Quantizing in small groups (typically 128 elements) rather than the entire tensor significantly improves accuracy. Both GPTQ and AWQ use group_size=128 as the default setting.

GPTQ: Generative Pre-trained Transformer Quantization

Algorithm Details

GPTQ is a Post-Training Quantization (PTQ) method proposed by Frantar et al. in 2022, based on the OBQ (Optimal Brain Quantization) algorithm. The core ideas are:

Layer-wise quantization: Rather than quantizing the entire model at once, it processes layer by layer sequentially
Hessian-based correction: Compensates for quantization-induced output error by distributing it to remaining weights using the inverse Hessian matrix
Column order optimization: Optimizes the quantization order to minimize error propagation

Mathematically, when quantizing each weight w_q, the remaining weights are updated as follows:

delta_w = -(w - quant(w)) / H_inv[q,q] * H_inv[:, q]

Here, H_inv is the inverse Hessian matrix. This allows quantization error to be redistributed across other weights.

Quantization with AutoGPTQ

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

# 1. Quantization configuration
model_name = "meta-llama/Llama-3.1-8B-Instruct"
quantize_config = BaseQuantizeConfig(
    bits=4,                  # 4-bit quantization
    group_size=128,          # Groups of 128 elements
    desc_act=True,           # Use activation order
    damp_percent=0.01,       # Hessian damping ratio
    sym=True,                # Symmetric quantization
)

# 2. Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config,
    torch_dtype=torch.float16,
)

# 3. Prepare calibration data
calibration_data = [
    tokenizer("The meaning of life is", return_tensors="pt"),
    tokenizer("Artificial intelligence has", return_tensors="pt"),
    tokenizer("In the context of machine learning", return_tensors="pt"),
    # In practice, 128-256 diverse samples are recommended
]

# 4. Execute quantization (approximately 15-30 min for Llama-3.1-8B)
model.quantize(calibration_data)

# 5. Save quantized model
output_dir = "./llama-3.1-8b-gptq-4bit"
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Quantization complete. Saved to: {output_dir}")

GPTQ Pros and Cons

Pros:

Maintains high accuracy relative to original even at 4-bit quantization
Significant inference speed improvement with Marlin kernel (2.6x speedup)
Native Hugging Face Transformers support

Cons:

Requires calibration data (typically 128-256 samples)
Relatively long quantization time (15-30 minutes for 8B model)
GPU-only (no CPU inference support)

AWQ: Activation-aware Weight Quantization

Principles and Differentiators

AWQ is a method proposed by Lin et al. at MIT in 2024, and the core observation is that not all weights are equally important.

Salient weight identification: Identifies important weight channels (approximately 1%) based on activation magnitude
Selective protection: Applies scaling factors to important weights to reduce quantization error
Regular quantization for the rest: Applies standard quantization to the remaining 99% of weights

The key difference from GPTQ is that Hessian inverse matrix computation is unnecessary. AWQ achieves similar or better quality with a simpler statistics-based approach while quantizing faster.

Quantization with AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# 1. Load model
model_path = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    low_cpu_mem_usage=True,
    use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 2. Quantization configuration
quant_config = {
    "zero_point": True,       # Use Zero-Point quantization
    "q_group_size": 128,      # Group size
    "w_bit": 4,               # 4-bit quantization
    "version": "GEMM",        # GEMM kernel (general) vs GEMV (batch=1 optimized)
}

# 3. Execute quantization (faster than GPTQ, approximately 10 minutes)
model.quantize(tokenizer, quant_config=quant_config)

# 4. Save
output_dir = "./llama-3.1-8b-awq-4bit"
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ quantization complete: {output_dir}")

AWQ Pros and Cons

Pros:

Faster quantization speed than GPTQ
High quality retention with activation-aware approach (HumanEval Pass@1: 51.8%)
High throughput of 741 tok/s with Marlin-AWQ kernel
Native vLLM support

Cons:

GPU-only (no CPU inference support)
Smaller ecosystem compared to GGUF
Compatibility issues with some model architectures

GGUF: The Universal Format for GPU-poor

GGUF Format and the llama.cpp Ecosystem

GGUF (GPT-Generated Unified Format) is a file format, not a quantization algorithm. Created by the llama.cpp project, this format supports CPU and GPU hybrid inference.

Key features of GGUF:

Single file: Model weights, tokenizer, and metadata are all contained in one file
CPU inference support: Inference is possible without a GPU
Hybrid offloading: Some layers on GPU, the rest on CPU
Various quantization levels: Fine-grained quality-size control from Q2_K to Q8_0

GGUF Quantization Type Comparison

Quant Type	Bits	Llama-3.1-8B Size	Perplexity Increase	Recommended Use
Q2_K	2.6bit	~2.8GB	+2.5~3.0	Extreme memory limits
Q3_K_M	3.3bit	~3.5GB	+1.0~1.5	Memory-constrained env
Q4_K_M	4.5bit	~4.9GB	+0.3~0.5	General recommended
Q5_K_M	5.3bit	~5.7GB	+0.1~0.2	Quality-focused
Q6_K	6.6bit	~6.6GB	+0.05	Near lossless
Q8_0	8.0bit	~8.5GB	~0	Close to lossless

GGUF Quantization with llama.cpp

# 1. Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && mkdir build && cd build
cmake .. -DGGML_CUDA=ON  # For GPU acceleration
cmake --build . --config Release

# 2. Convert HF model to GGUF FP16
python convert_hf_to_gguf.py \
    /path/to/Llama-3.1-8B-Instruct \
    --outtype f16 \
    --outfile llama-3.1-8b-f16.gguf

# 3. Quantize to Q4_K_M
./build/bin/llama-quantize \
    llama-3.1-8b-f16.gguf \
    llama-3.1-8b-q4_k_m.gguf \
    Q4_K_M

# 4. Inference test
./build/bin/llama-cli \
    -m llama-3.1-8b-q4_k_m.gguf \
    -p "Explain quantum computing in simple terms:" \
    -n 256 \
    --n-gpu-layers 35  # Number of layers to offload to GPU

Python Integration with llama-cpp-python

from llama_cpp import Llama

# Load GGUF model (with GPU layer offloading)
llm = Llama(
    model_path="./llama-3.1-8b-q4_k_m.gguf",
    n_gpu_layers=35,    # Number of layers on GPU (-1 for all)
    n_ctx=4096,          # Context length
    n_threads=8,         # CPU thread count
    verbose=False,
)

# Text generation
output = llm(
    "Explain the difference between TCP and UDP:",
    max_tokens=512,
    temperature=0.7,
    top_p=0.9,
    stop=["\\n\\n"],
)
print(output["choices"][0]["text"])

# ChatCompletion API style
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantization in ML?"},
    ],
    max_tokens=512,
    temperature=0.7,
)
print(response["choices"][0]["message"]["content"])

BitsAndBytes and QLoRA Integration

4-bit Inference and Fine-tuning

BitsAndBytes integrates directly with Hugging Face Transformers, applying quantization at loading time without a separate quantization process. When combined with QLoRA, LoRA adapters can be trained on top of quantized models.

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (QLoRA paper)
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16
    bnb_4bit_use_double_quant=True,       # Double quantization (additional memory savings)
)

# Load model (automatic quantization on load)
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check memory usage
print(f"Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")

# Inference
inputs = tokenizer("Explain gradient descent:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

BitsAndBytes Characteristics

Pros:

No pre-quantization process needed (applied instantly on load)
Natural integration with QLoRA fine-tuning
Perfect Hugging Face ecosystem support
High quality retention with NF4 quantization

Cons:

Inference speed slower than GPTQ/AWQ (no custom kernels)
Difficult to save/share quantized models as files
CUDA-only (no AMD/CPU support, though partial ROCm support since 2025)

Comprehensive Comparison: GPTQ vs AWQ vs GGUF vs BitsAndBytes

Key Characteristics Comparison Table

Item	GPTQ	AWQ	GGUF	BitsAndBytes
Quantization method	PTQ (Hessian-based)	PTQ (Activation-aware)	PTQ (various methods)	Dynamic quant
Default bits	4bit	4bit	2~8bit selectable	4bit (NF4)
Calibration	Required (128-256)	Required (small amount)	Not required	Not required
Quantization time	15-30 min (8B)	~10 min (8B)	Minutes	Instant on load
GPU inference	Very fast	Fastest	Fast (with offload)	Moderate
CPU inference	Not supported	Not supported	Supported	Not supported
vLLM support	Supported	Supported	Partial support	Supported
Fine-tuning	Limited	Limited	Not supported	QLoRA optimal
Model sharing	HF Hub upload	HF Hub upload	Single file deploy	Difficult
Perplexity (4bit)	Baseline +0.3~0.5	Baseline +0.2~0.4	Q4_K_M: +0.3~0.5	Baseline +0.2~0.4

Inference Performance Benchmark (Llama-3.1-8B, A100 80GB)

Method	Throughput (tok/s)	VRAM Usage	HumanEval Pass@1	Latency (TTFT)
FP16 (baseline)	~350	16GB	53.2%	45ms
GPTQ 4bit	~520	5.5GB	50.6%	32ms
GPTQ + Marlin	~712	5.5GB	50.6%	25ms
AWQ 4bit	~550	5.2GB	51.8%	30ms
AWQ + Marlin	~741	5.2GB	51.8%	23ms
GGUF Q4_K_M	~280	4.9GB	51.8%	55ms
BitsAndBytes NF4	~300	5.8GB	51.8%	50ms

The key takeaway from this benchmark is that kernel implementation matters more than the algorithm itself. Using the Marlin kernel, GPTQ achieves a 2.6x speedup and AWQ achieves a 10.9x speedup.

Serving Quantized Models with vLLM

from vllm import LLM, SamplingParams

# Serve AWQ model (Marlin kernel auto-applied)
llm = LLM(
    model="casperhansen/llama-3.1-8b-instruct-awq",
    quantization="awq_marlin",      # Explicitly specify Marlin kernel
    max_model_len=8192,
    gpu_memory_utilization=0.85,
    dtype="half",
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

prompts = [
    "Explain the CAP theorem in distributed systems.",
    "Write a Python function to implement binary search.",
    "What are the SOLID principles in software engineering?",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"Prompt: {prompt[:50]}...")
    print(f"Output: {generated[:200]}...")
    print("---")

Model-Specific Quantization Quality Guide

Recommended Quantization by Model Size

Model Size	Consumer GPU (24GB)	Server GPU (80GB)	CPU Only
7~8B	GPTQ/AWQ 4bit	FP16 recommended	GGUF Q4_K_M
13B	GPTQ/AWQ 4bit	FP16 or AWQ 4bit	GGUF Q4_K_M
34B	GGUF Q4_K_M	AWQ 4bit + Marlin	GGUF Q3_K_M
70B	GGUF Q3_K_M (partial)	AWQ 4bit + Marlin	GGUF Q2_K

Recommended Quantization Level by Task

Code generation: AWQ 4bit recommended (highest accuracy on HumanEval)
Long-form generation/summarization: Q5_K_M or higher recommended (repetition increases at lower quantization)
Classification/NER: INT4 is sufficient (minimal accuracy impact)
Mathematical reasoning: Q6_K or higher, or AWQ 4bit recommended (numerically sensitive)

Operational Considerations and Troubleshooting

Common Issues

1. OOM during GPTQ quantization

If out-of-memory occurs during calibration, try changing desc_act=False and increasing damp_percent.

2. Strange output from AWQ models

The quantization version (GEMM vs GEMV) might not match the inference framework. When using vLLM, it is safest to explicitly specify awq_marlin.

3. Slow GGUF model

Increase the n_gpu_layers value to offload more layers to the GPU. Setting it higher than the total number of layers puts everything on the GPU.

4. torch version conflict with BitsAndBytes

The combination of bitsandbytes>=0.43.0 and torch>=2.1.0 is recommended. CUDA version 12.1 or higher is also more stable.

Quantization Quality Verification Checklist

Perplexity measurement: Confirm perplexity increase is within 0.5 compared to original
Task-specific benchmarks: Verify quality on actual use tasks (code generation, summarization, etc.)
Edge case testing: Validate on boundary cases like long inputs, multilingual, math problems
Latency profiling: Measure TTFT (Time to First Token) and TPS (Tokens per Second)
Memory monitoring: Track actual VRAM usage in the serving environment

Conclusion

LLM quantization is not simply about reducing model size -- it is a core technology that determines deployment feasibility. As of 2026, the recommendations can be summarized as follows:

Production GPU serving: AWQ + Marlin kernel (highest throughput + high quality)
Development/experimentation: BitsAndBytes NF4 (no quantization process needed, QLoRA integration)
Edge/CPU deployment: GGUF Q4_K_M (versatility, single file deployment)
Legacy compatibility: GPTQ (broadest ecosystem support)

Quantization techniques continue to evolve. Like the Marlin kernel that emerged after 2025, the trend is that kernel implementation optimization has a greater impact on actual performance than the algorithm itself. When evaluating new quantization methods, be sure to check not only theoretical advantages but also benchmarks in actual serving environments.

Introduction

Quantization Fundamentals

The Journey from FP16 to INT4

Absmax Quantization

Zero-Point Quantization

Group-wise Quantization

GPTQ: Generative Pre-trained Transformer Quantization

Algorithm Details

Quantization with AutoGPTQ

GPTQ Pros and Cons

AWQ: Activation-aware Weight Quantization

Principles and Differentiators

Quantization with AutoAWQ

AWQ Pros and Cons

GGUF: The Universal Format for GPU-poor

GGUF Format and the llama.cpp Ecosystem

GGUF Quantization Type Comparison

GGUF Quantization with llama.cpp

Python Integration with llama-cpp-python

BitsAndBytes and QLoRA Integration

4-bit Inference and Fine-tuning

BitsAndBytes Characteristics

Comprehensive Comparison: GPTQ vs AWQ vs GGUF vs BitsAndBytes

Key Characteristics Comparison Table

Inference Performance Benchmark (Llama-3.1-8B, A100 80GB)

Serving Quantized Models with vLLM

Model-Specific Quantization Quality Guide

Recommended Quantization by Model Size

Recommended Quantization Level by Task

Operational Considerations and Troubleshooting

Common Issues

Quantization Quality Verification Checklist

Conclusion

References