- Authors
- Name
- Part 1: vLLM
- Part 2: Ollama
- Part 3: Comparison and Practice
Part 1: vLLM
1. Introduction to vLLM
vLLM is a high-performance LLM inference and serving engine developed at UC Berkeley. Since its release alongside the PagedAttention paper in 2023, it has established itself as the de facto standard for production LLM serving. As of March 2026, the latest version is v0.16.x, with the transition to V1 architecture underway.
1.1 Core Principles of PagedAttention
In traditional LLM inference, KV Cache is allocated in contiguous GPU memory blocks per sequence. This approach pre-reserves memory based on the maximum sequence length, resulting in 60-80% memory waste in practice.
PagedAttention introduces the operating system's Virtual Memory Paging concept to KV Cache management.
┌─────────────────────────────────────────────────┐
│ Traditional KV Cache │
│ ┌──────────────────────────────────┐ │
│ │ Seq 1: [used][used][used][waste][waste][waste]│
│ │ Seq 2: [used][waste][waste][waste][waste] │
│ │ Seq 3: [used][used][waste][waste][waste] │
│ └──────────────────────────────────┘ │
│ → 60~80% Memory Waste │
├─────────────────────────────────────────────────┤
│ PagedAttention KV Cache │
│ Physical Blocks: [B0][B1][B2][B3][B4][B5]... │
│ Block Table: │
│ Seq 1 → [B0, B3, B5] (logical → physical) │
│ Seq 2 → [B1, B4] │
│ Seq 3 → [B2, B6] │
│ → < 4% Memory Waste │
└─────────────────────────────────────────────────┘
The core mechanisms are as follows.
- Fixed-size blocks: KV Cache is split into fixed-size blocks (default 16 tokens)
- Block Table: Maintains a table mapping logical block numbers of sequences to physical block addresses
- Dynamic allocation: Physical blocks are allocated only as needed during token generation
- Copy-on-Write: When branching sequences (e.g., Beam Search), physical blocks are shared and copied only when modification is needed
1.2 Continuous Batching
Traditional Static Batching waits until all sequences in a batch complete. Continuous Batching removes completed sequences and inserts new requests at every decoding step.
Static Batching:
Step 1: [Seq1, Seq2, Seq3, Seq4] ← Waits even if Seq2 completes
Step 2: [Seq1, Seq2, Seq3, Seq4]
Step 3: [Seq1, ___, Seq3, Seq4] ← Slot wasted after Seq2 ends
...
Step N: Next batch starts after all sequences complete
Continuous Batching:
Step 1: [Seq1, Seq2, Seq3, Seq4]
Step 2: [Seq1, Seq5, Seq3, Seq4] ← Seq5 inserted immediately after Seq2 completes
Step 3: [Seq1, Seq5, Seq6, Seq4] ← Seq6 inserted immediately after Seq3 completes
→ Minimizes GPU idle time, maximizes throughput
1.3 Supported Models
vLLM supports virtually all major Transformer-based LLM architectures.
| Category | Supported Models |
|---|---|
| Meta Llama Family | Llama 2, Llama 3, Llama 3.1, Llama 3.2, Llama 3.3, Llama 4 |
| Mistral Family | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Large, Mistral Small |
| Qwen Family | Qwen, Qwen 1.5, Qwen 2, Qwen 2.5, Qwen 3, QwQ |
| Google Family | Gemma, Gemma 2, Gemma 3 |
| DeepSeek Family | DeepSeek V2, DeepSeek V3, DeepSeek-R1 |
| Others | Phi-3/4, Yi, InternLM 2/3, Command R, DBRX, Falcon, StarCoder 2 |
| Multimodal | LLaVA, InternVL, Pixtral, Qwen-VL, MiniCPM-V |
| Embedding | E5-Mistral, GTE-Qwen, Jina Embeddings |
1.4 LLM Serving Engine Comparison
| Item | vLLM | TGI | TensorRT-LLM | llama.cpp |
|---|---|---|---|---|
| Developer | UC Berkeley / vLLM Project | Hugging Face | NVIDIA | Georgi Gerganov |
| Language | Python/C++/CUDA | Rust/Python | C++/CUDA | C/C++ |
| Core Technology | PagedAttention | Continuous Batching | FP8/INT4 kernel optimization | GGUF quantization |
| Multi-GPU | TP + PP | TP | TP + PP | Limited |
| Quantization | AWQ, GPTQ, FP8, BnB | AWQ, GPTQ, BnB | FP8, INT4, INT8 | GGUF (Q2~Q8) |
| API Compat | OpenAI compatible | OpenAI compatible | Triton | Custom API |
| Install Difficulty | Medium | Medium | High | Low |
| Production Ready | Very High | High | Very High | Low~Medium |
| Community | Very Active | Active | NVIDIA-led | Very Active |
2. vLLM Installation and Startup
2.1 pip Installation
# Basic installation (CUDA 12.x)
pip install vllm
# Specific version installation
pip install vllm==0.16.0
# CUDA 11.8 environment
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118
2.2 conda Installation
conda create -n vllm python=3.11 -y
conda activate vllm
pip install vllm
2.3 Docker Installation
# Official Docker image (NVIDIA GPU)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<hf_token>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
# ROCm (AMD GPU)
docker run --device /dev/kfd --device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest-rocm \
--model meta-llama/Llama-3.1-8B-Instruct
2.4 Basic Server Start
# vllm serve command (recommended)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# Direct Python module execution (legacy)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# Start with YAML config file
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--config config.yaml
config.yaml example:
# vLLM server configuration file
host: '0.0.0.0'
port: 8000
tensor_parallel_size: 2
gpu_memory_utilization: 0.90
max_model_len: 8192
dtype: 'auto'
enforce_eager: false
enable_prefix_caching: true
2.5 Offline Batch Inference
You can perform batch inference directly from Python code without starting a server.
from vllm import LLM, SamplingParams
# Load model
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
gpu_memory_utilization=0.90,
)
# Set sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
# Prompt list
prompts = [
"Explain PagedAttention in simple terms.",
"What is continuous batching?",
"Compare vLLM and TensorRT-LLM.",
]
# Run batch inference
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated!r}\n")
2.6 OpenAI-Compatible API Server
The vLLM server provides OpenAI API-compatible endpoints.
# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--served-model-name llama-3.1-8b \
--api-key my-secret-key
# Call Chat Completion with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer my-secret-key" \
-d '{
"model": "llama-3.1-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is PagedAttention?"}
],
"temperature": 0.7,
"max_tokens": 512
}'
# Call with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="my-secret-key",
)
response = client.chat.completions.create(
model="llama-3.1-8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the advantages of vLLM."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
3. Complete vLLM CLI Arguments Reference
Here is a categorized summary of key CLI arguments that can be passed to vllm serve. You can check the full list with vllm serve --help, or query by group with vllm serve --help=ModelConfig.
3.1 Model-Related Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--model | str | facebook/opt-125m | HuggingFace model ID or local path |
--tokenizer | str | None (same as model) | Specify a separate tokenizer |
--revision | str | None | Specific Git revision of the model (branch, tag, commit hash) |
--tokenizer-revision | str | None | Specific revision of the tokenizer |
--dtype | str | "auto" | Model weight data type (auto, float16, bfloat16, float32) |
--max-model-len | int | None (follows model config) | Maximum sequence length (sum of input + output tokens) |
--trust-remote-code | flag | False | Allow HuggingFace remote code execution |
--download-dir | str | None | Model download directory |
--load-format | str | "auto" | Model load format (auto, pt, safetensors, npcache, dummy, bitsandbytes) |
--config-format | str | "auto" | Model configuration format (auto, hf, mistral) |
--seed | int | 0 | Random seed for reproducibility |
3.2 Server-Related Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--host | str | "0.0.0.0" | Host address to bind |
--port | int | 8000 | Server port number |
--uvicorn-log-level | str | "info" | Uvicorn log level |
--api-key | str | None | API authentication key (Bearer token) |
--served-model-name | str | None | Model name for the API (uses --model value if unset) |
--chat-template | str | None | Jinja2 chat template file path or string |
--response-role | str | "assistant" | Role in chat completion responses |
--ssl-keyfile | str | None | SSL key file path |
--ssl-certfile | str | None | SSL certificate file path |
--allowed-origins | list | ["*"] | CORS allowed origin list |
--middleware | list | None | FastAPI middleware classes |
--max-log-len | int | None | Maximum prompt/output length in logs |
--disable-log-requests | flag | False | Disable request logging |
3.3 Parallelism-Related Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--tensor-parallel-size (-tp) | int | 1 | Number of GPUs for Tensor Parallelism |
--pipeline-parallel-size (-pp) | int | 1 | Number of Pipeline Parallelism stages |
--distributed-executor-backend | str | None | Distributed execution backend (ray, mp) |
--ray-workers-use-nsight | flag | False | Use Nsight profiler with Ray workers |
--data-parallel-size (-dp) | int | 1 | Number of Data Parallelism processes |
Usage examples:
# 4-GPU Tensor Parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
# 2-GPU Tensor + 2-way Pipeline (4 GPUs total)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--pipeline-parallel-size 2
# Ray distributed backend (multi-node)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--distributed-executor-backend ray
3.4 Memory and Performance Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--gpu-memory-utilization | float | 0.90 | GPU memory usage ratio (0.0~1.0) |
--max-num-seqs | int | 256 | Maximum concurrent sequences |
--max-num-batched-tokens | int | None (auto) | Maximum tokens processed per step |
--block-size | int | 16 | PagedAttention block size (in tokens) |
--swap-space | float | 4 | CPU swap space size (GiB) |
--enforce-eager | flag | False | Disable CUDA Graph, force Eager mode |
--max-seq-len-to-capture | int | 8192 | Maximum sequence length for CUDA Graph capture |
--disable-custom-all-reduce | flag | False | Disable custom All-Reduce |
--enable-prefix-caching | flag | True (v1) | Enable Automatic Prefix Caching |
--enable-chunked-prefill | flag | True (v1) | Enable Chunked Prefill |
--num-scheduler-steps | int | 1 | Decoding steps per scheduler (Multi-Step Scheduling) |
--kv-cache-dtype | str | "auto" | KV Cache data type (auto, fp8, fp8_e5m2, fp8_e4m3) |
Usage examples:
# Memory optimization settings
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.95 \
--max-num-seqs 128 \
--max-model-len 4096 \
--enable-prefix-caching \
--enable-chunked-prefill
# Eager mode (debugging/compatibility)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enforce-eager \
--gpu-memory-utilization 0.85
3.5 Quantization-Related Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--quantization (-q) | str | None | Select quantization method |
--load-format | str | "auto" | Model load format |
--quantization supported values:
| Value | Description | Notes |
|---|---|---|
awq | AWQ (Activation-aware Weight Quantization) | 4-bit, fast inference |
gptq | GPTQ (Post-Training Quantization) | 4-bit, ExLlamaV2 kernel |
gptq_marlin | GPTQ + Marlin kernel | 4-bit, faster kernel |
awq_marlin | AWQ + Marlin kernel | 4-bit, faster kernel |
squeezellm | SqueezeLLM | Sparse quantization |
fp8 | FP8 (8-bit floating point) | H100/MI300x only |
bitsandbytes | BitsAndBytes | 4-bit NF4 |
gguf | GGUF format | llama.cpp compatible |
compressed-tensors | Compressed Tensors | General purpose |
experts_int8 | MoE Expert INT8 | MoE models only |
Usage examples:
# AWQ quantized model
vllm serve TheBloke/Llama-2-7B-AWQ \
--quantization awq
# GPTQ quantized model
vllm serve TheBloke/Llama-2-7B-GPTQ \
--quantization gptq
# FP8 quantization (H100 and above)
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--quantization fp8
# BitsAndBytes 4-bit (GPU memory saving)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization bitsandbytes \
--load-format bitsandbytes
3.6 LoRA-Related Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--enable-lora | flag | False | Enable LoRA adapter serving |
--max-loras | int | 1 | Maximum number of simultaneously loaded LoRAs |
--max-lora-rank | int | 16 | Maximum LoRA rank |
--lora-extra-vocab-size | int | 256 | Extra vocabulary size for LoRA adapters |
--lora-modules | list | None | LoRA adapter list (name=path format) |
--long-lora-scaling-factors | list | None | Long LoRA scaling factors |
Usage example:
# LoRA adapter serving
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--max-loras 4 \
--max-lora-rank 64 \
--lora-modules \
adapter1=/path/to/lora1 \
adapter2=/path/to/lora2
3.7 Speculative Decoding Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--speculative-model | str | None | Draft model (small model or [ngram]) |
--num-speculative-tokens | int | None | Number of tokens to speculatively generate |
--speculative-draft-tensor-parallel-size | int | None | TP size for the draft model |
--speculative-disable-by-batch-size | int | None | Disable when batch size exceeds threshold |
--ngram-prompt-lookup-max | int | None | Maximum lookup size for N-gram speculation |
--ngram-prompt-lookup-min | int | None | Minimum lookup size for N-gram speculation |
Usage examples:
# Using a separate draft model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 4
# N-gram based speculative decoding (no additional model needed)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-model "[ngram]" \
--num-speculative-tokens 5 \
--ngram-prompt-lookup-max 4
4. vLLM Sampling Parameters
vLLM supports OpenAI API-compatible parameters plus additional advanced parameters.
4.1 Complete Parameter Reference
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
temperature | float | 1.0 | >= 0.0 | Lower is more deterministic, higher is more creative. 0 = Greedy |
top_p | float | 1.0 | (0.0, 1.0] | Nucleus sampling. Sample only top tokens by cumulative probability |
top_k | int | -1 | -1 or >= 1 | Consider only top k tokens. -1 disables |
min_p | float | 0.0 | [0.0, 1.0] | Minimum probability threshold. Filters by ratio to highest probability |
frequency_penalty | float | 0.0 | [-2.0, 2.0] | Frequency-based penalty. Positive suppresses repetition |
presence_penalty | float | 0.0 | [-2.0, 2.0] | Presence-based penalty. Penalizes tokens that appeared at least once |
repetition_penalty | float | 1.0 | > 0.0 | Repetition penalty (1.0 disables, greater than 1.0 suppresses) |
max_tokens | int | 16 | >= 1 | Maximum tokens to generate |
stop | list | None | - | List of stop strings |
seed | int | None | - | Random seed (ensures reproducibility) |
n | int | 1 | >= 1 | Number of responses per prompt |
best_of | int | None | >= n | Generate best_of candidates and select the best |
use_beam_search | bool | False | - | Enable Beam Search |
logprobs | int | None | [0, 20] | Number of per-token log probabilities to return |
prompt_logprobs | int | None | [0, 20] | Number of prompt token log probabilities to return |
skip_special_tokens | bool | True | - | Whether to skip special tokens |
spaces_between_special_tokens | bool | True | - | Insert spaces between special tokens |
guided_json | object | None | - | JSON Schema-based structured output |
guided_regex | str | None | - | Regex-based structured output |
guided_choice | list | None | - | Choice-based structured output |
4.2 API Call Examples with curl
# Basic Chat Completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "What is the population of Seoul?"}
],
"temperature": 0.3,
"top_p": 0.9,
"max_tokens": 256,
"frequency_penalty": 0.5,
"seed": 42
}'
# Structured Output (JSON mode)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Give me the population of Seoul, Busan, and Daegu in JSON"}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "city_population",
"schema": {
"type": "object",
"properties": {
"cities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"population": {"type": "integer"}
},
"required": ["name", "population"]
}
}
},
"required": ["cities"]
}
}
},
"temperature": 0.1,
"max_tokens": 512
}'
# Returning logprobs
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "1+1=?"}
],
"logprobs": true,
"top_logprobs": 5,
"max_tokens": 10
}'
4.3 Python requests Example
import requests
import json
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful Korean assistant."},
{"role": "user", "content": "What is quantum computing?"},
],
"temperature": 0.7,
"top_p": 0.9,
"top_k": 50,
"max_tokens": 1024,
"repetition_penalty": 1.1,
"stop": ["\n\n\n"],
}
response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(result["choices"][0]["message"]["content"])
4.4 Streaming Example with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
# Streaming response
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Implement quicksort in Python"},
],
temperature=0.2,
max_tokens=2048,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
5. Complete vLLM Environment Variables Reference
vLLM controls runtime behavior through various environment variables. Here is a categorized summary of key environment variables.
5.1 Core Environment Variables
| Environment Variable | Default | Description |
|---|---|---|
VLLM_TARGET_DEVICE | "cuda" | Target device (cuda, rocm, neuron, cpu, xpu) |
VLLM_USE_V1 | True | Use V1 code path |
VLLM_WORKER_MULTIPROC_METHOD | "fork" | Multiprocess spawn method (spawn, fork) |
VLLM_ALLOW_LONG_MAX_MODEL_LEN | False | Allow max_model_len longer than model config |
CUDA_VISIBLE_DEVICES | None | GPU device numbers to use |
5.2 Attention and Kernel Related
| Environment Variable | Default | Description |
|---|---|---|
VLLM_ATTENTION_BACKEND | None | Attention backend (deprecated, use --attention-backend from v0.14) |
VLLM_USE_TRITON_FLASH_ATTN | True | Use Triton Flash Attention |
VLLM_FLASH_ATTN_VERSION | None | Force Flash Attention version (2 or 3) |
VLLM_USE_FLASHINFER_SAMPLER | None | Use FlashInfer sampler |
VLLM_FLASHINFER_FORCE_TENSOR_CORES | False | Force FlashInfer tensor core usage |
VLLM_USE_TRITON_AWQ | False | Use Triton AWQ kernel |
VLLM_USE_DEEP_GEMM | False | Use DeepGemm kernel (MoE operations) |
VLLM_MLA_DISABLE | False | Disable MLA Attention optimization |
5.3 Logging Related
| Environment Variable | Default | Description |
|---|---|---|
VLLM_CONFIGURE_LOGGING | 1 | Auto-configure vLLM logging (0 to disable) |
VLLM_LOGGING_LEVEL | "INFO" | Default logging level |
VLLM_LOGGING_CONFIG_PATH | None | Custom logging config file path |
VLLM_LOGGING_PREFIX | "" | Prefix to prepend to log messages |
VLLM_LOG_BATCHSIZE_INTERVAL | -1 | Batch size logging interval (seconds, -1 disables) |
VLLM_TRACE_FUNCTION | 0 | Enable function call tracing |
VLLM_DEBUG_LOG_API_SERVER_RESPONSE | False | API response debug logging |
5.4 Distributed Processing Related
| Environment Variable | Default | Description |
|---|---|---|
VLLM_HOST_IP | "" | Node IP for distributed setup |
VLLM_PORT | 0 | Distributed communication port |
VLLM_NCCL_SO_PATH | None | NCCL library file path |
NCCL_DEBUG | None | NCCL debug level (INFO, WARN, TRACE) |
NCCL_SOCKET_IFNAME | None | Network interface for NCCL communication |
VLLM_PP_LAYER_PARTITION | None | Pipeline Parallelism layer partition strategy |
VLLM_DP_RANK | 0 | Data Parallel process rank |
VLLM_DP_SIZE | 1 | Data Parallel world size |
VLLM_DP_MASTER_IP | "127.0.0.1" | Data Parallel master node IP |
VLLM_DP_MASTER_PORT | 0 | Data Parallel master node port |
VLLM_USE_RAY_SPMD_WORKER | False | Ray SPMD worker execution |
VLLM_USE_RAY_COMPILED_DAG | False | Use Ray Compiled Graph API |
VLLM_SKIP_P2P_CHECK | False | Skip GPU P2P capability check |
5.5 HuggingFace and External Services
| Environment Variable | Default | Description |
|---|---|---|
HF_TOKEN | None | HuggingFace API token |
HUGGING_FACE_HUB_TOKEN | None | HuggingFace Hub token (legacy) |
VLLM_USE_MODELSCOPE | False | Load models from ModelScope |
VLLM_API_KEY | None | vLLM API server auth key |
VLLM_NO_USAGE_STATS | False | Disable usage stats collection |
VLLM_DO_NOT_TRACK | False | Opt out of tracking |
5.6 Cache and Paths
| Environment Variable | Default | Description |
|---|---|---|
VLLM_CONFIG_ROOT | ~/.config/vllm | Config file root directory |
VLLM_CACHE_ROOT | ~/.cache/vllm | Cache file root directory |
VLLM_ASSETS_CACHE | ~/.cache/vllm/assets | Downloaded assets cache path |
VLLM_RPC_BASE_PATH | System temp | IPC multiprocessing path |
5.7 Environment Variable Usage Examples
# Multi-GPU + logging + HF token setup
export CUDA_VISIBLE_DEVICES=0,1,2,3
export HF_TOKEN="hf_xxxxxxxxxxxx"
export VLLM_LOGGING_LEVEL="DEBUG"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90
# Passing environment variables in Docker
docker run --runtime nvidia --gpus all \
-e CUDA_VISIBLE_DEVICES=0,1 \
-e HF_TOKEN="hf_xxxxxxxxxxxx" \
-e VLLM_LOGGING_LEVEL="INFO" \
-e VLLM_WORKER_MULTIPROC_METHOD="spawn" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2
6. Advanced vLLM Configuration
6.1 Multi-GPU Setup
Tensor Parallelism (TP): Distributes each layer of the model across multiple GPUs. The most commonly used approach on a single node.
# TP=4 (distribute model across 4 GPUs)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90
Pipeline Parallelism (PP): Places model layers sequentially across multiple GPUs. Advantageous in slow interconnect environments.
# PP=2, TP=2 (4 GPUs total, 2x2 configuration)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--pipeline-parallel-size 2
Multi-node setup (using Ray):
# Master node
ray start --head --port=6379
# Worker node
ray start --address=<master-ip>:6379
# Run vLLM (from master)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray
6.2 Quantization Details
AWQ (Activation-aware Weight Quantization):
# Using pre-quantized AWQ model
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
--quantization awq \
--max-model-len 4096
# Faster with Marlin kernel (SM 80+ GPU)
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
--quantization awq_marlin
GPTQ (Post-Training Quantization):
# GPTQ model (ExLlamaV2 kernel auto-used)
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
--quantization gptq
# Using Marlin kernel
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
--quantization gptq_marlin
FP8 (8-bit Floating Point): Hardware acceleration supported on H100, MI300x and above GPUs.
# Pre-quantized FP8 model
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--quantization fp8
# Dynamic FP8 quantization (no pre-quantization needed)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8
BitsAndBytes 4-bit NF4: Instant quantization without calibration data.
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization bitsandbytes \
--load-format bitsandbytes \
--enforce-eager # BnB requires Eager mode
6.3 LoRA Serving
# Enable LoRA adapters
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--max-loras 4 \
--max-lora-rank 64 \
--lora-modules \
korean-chat=/path/to/korean-lora \
code-assist=/path/to/code-lora
Specifying a LoRA model in API calls:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
# Use a specific LoRA adapter
response = client.chat.completions.create(
model="korean-chat", # LoRA adapter name
messages=[{"role": "user", "content": "Hello!"}],
temperature=0.7,
max_tokens=256,
)
6.4 Prefix Caching & Chunked Prefill
Automatic Prefix Caching: Reuses KV Cache for common prompt prefixes to reduce TTFT. Especially effective when many requests share the same system prompt.
# Enabled by default in v1, requires explicit flag in v0
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching
Chunked Prefill: Splits long prompts into chunks and interleaves Prefill and Decode. Prevents long prompts from blocking Decode of shorter requests.
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-chunked-prefill \
--max-num-batched-tokens 2048
6.5 Structured Output (Guided Decoding)
# JSON Schema-based structured output
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Please provide Seoul weather info in JSON"}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "weather_info",
"schema": {
"type": "object",
"properties": {
"city": {"type": "string"},
"temperature_celsius": {"type": "number"},
"condition": {"type": "string"},
"humidity_percent": {"type": "integer"}
},
"required": ["city", "temperature_celsius", "condition"]
}
}
}
}'
# Regex-based output (Completion API)
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Generate a valid email address:",
"extra_body": {
"guided_regex": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
},
"max_tokens": 50
}'
6.6 Docker Deployment
# docker-compose.yaml
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- '8000:8000'
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
- VLLM_LOGGING_LEVEL=INFO
- VLLM_WORKER_MULTIPROC_METHOD=spawn
ipc: host
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 8000
--tensor-parallel-size 2
--gpu-memory-utilization 0.90
--max-model-len 8192
--enable-prefix-caching
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
interval: 30s
timeout: 10s
retries: 3
start_period: 120s
# Run with Docker Compose
HF_TOKEN=hf_xxxx docker compose up -d
# Check logs
docker compose logs -f vllm
6.7 Kubernetes Deployment
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3
namespace: ai-serving
labels:
app: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
ports:
- containerPort: 8000
name: http
args:
- '--model'
- 'meta-llama/Llama-3.1-8B-Instruct'
- '--host'
- '0.0.0.0'
- '--port'
- '8000'
- '--tensor-parallel-size'
- '2'
- '--gpu-memory-utilization'
- '0.90'
- '--max-model-len'
- '8192'
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
- name: VLLM_WORKER_MULTIPROC_METHOD
value: 'spawn'
resources:
limits:
nvidia.com/gpu: '2'
requests:
nvidia.com/gpu: '2'
memory: '32Gi'
cpu: '8'
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: model-cache
mountPath: /root/.cache/huggingface
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 2Gi
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
nodeSelector:
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: ai-serving
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
name: http
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: ai-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama3
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: '50'
Part 2: Ollama
7. Introduction to Ollama
Ollama is an open-source tool that makes it easy to run LLMs in local environments. Like Docker, you can download a model and start chatting immediately with a single command like ollama run llama3.1.
7.1 Architecture Features
- GGUF-based: Uses llama.cpp's GGUF (GPT-Generated Unified Format) quantized models
- llama.cpp engine: Internally uses llama.cpp as the inference engine
- Single binary: Go server + llama.cpp C++ engine distributed as a single binary
- Automatic GPU acceleration: Auto-detects NVIDIA CUDA, AMD ROCm, Apple Metal for GPU offloading
- Model registry: Pull/push pre-quantized models from
ollama.com/librarylike Docker Hub
7.2 Supported Models
| Category | Models | Size |
|---|---|---|
| Meta Llama | llama3.1, llama3.2, llama3.3 | 1B ~ 405B |
| Mistral | mistral, mixtral | 7B ~ 8x22B |
gemma, gemma2, gemma3 | 2B ~ 27B | |
| Microsoft | phi3, phi4 | 3.8B ~ 14B |
| DeepSeek | deepseek-r1, deepseek-v3, deepseek-coder-v2 | 1.5B ~ 671B |
| Qwen | qwen, qwen2, qwen2.5, qwen3 | 0.5B ~ 72B |
| Code | codellama, starcoder2, qwen2.5-coder | 3B ~ 34B |
| Embedding | nomic-embed-text, mxbai-embed-large, all-minilm | - |
| Multimodal | llava, bakllava, llama3.2-vision | 7B ~ 90B |
8. Ollama Installation and Startup
8.1 Platform-Specific Installation
macOS:
# Homebrew
brew install ollama
# Or official install script
curl -fsSL https://ollama.com/install.sh | sh
Linux:
# Official install script (recommended)
curl -fsSL https://ollama.com/install.sh | sh
# Or manual installation
curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama
Windows:
Download and run the Windows installer from the official website (ollama.com).
Docker:
# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama
# NVIDIA GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama
# AMD GPU (ROCm)
docker run -d --device /dev/kfd --device /dev/dri \
-v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama:rocm
8.2 Basic Usage
# Start server (if not auto-started in background)
ollama serve
# Download model and start chatting
ollama run llama3.1
# Specify a tag (size/quantization)
ollama run llama3.1:8b
ollama run llama3.1:70b-instruct-q4_K_M
ollama run qwen2.5:32b-instruct-q5_K_M
# Download model only (without running)
ollama pull llama3.1:8b
# One-line prompt
ollama run llama3.1 "What is PagedAttention?"
9. Complete Ollama CLI Commands Reference
9.1 Command Summary
| Command | Description | Key Options |
|---|---|---|
ollama serve | Start Ollama server | Check env vars with --help |
ollama run <model> | Run model (auto-pulls if missing) | --verbose, --nowordwrap, --format json |
ollama pull <model> | Download model | --insecure |
ollama push <model> | Upload model to registry | --insecure |
ollama create <model> | Create custom model from Modelfile | -f <Modelfile>, --quantize |
ollama list / ollama ls | List installed models | - |
ollama show <model> | Show model details | --modelfile, --parameters, --system, --template, --license |
ollama cp <src> <dst> | Copy model | - |
ollama rm <model> | Delete model | - |
ollama ps | List running models | - |
ollama stop <model> | Stop a running model | - |
ollama signin | Sign in to ollama.com | - |
ollama signout | Sign out from ollama.com | - |
9.2 Detailed Command Examples
ollama serve - Start server:
# Default start (localhost:11434)
ollama serve
# Change bind address via environment variable
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# Debug mode
OLLAMA_DEBUG=1 ollama serve
ollama run - Run model:
# Interactive mode
ollama run llama3.1
# One-line prompt
ollama run llama3.1 "Explain quantum computing"
# JSON format output
ollama run llama3.1 "List 3 Korean cities" --format json
# Multimodal (image input)
ollama run llama3.2-vision "What's in this image? /path/to/image.png"
# Verbose mode (display performance stats)
ollama run llama3.1 --verbose
# With system prompt
ollama run llama3.1 --system "You are a Korean translator."
ollama create - Create custom model:
# Create from Modelfile
ollama create my-model -f ./Modelfile
# Create from GGUF file
ollama create my-model -f ./Modelfile-from-gguf
# Quantization conversion
ollama create my-model-q4 --quantize q4_K_M -f ./Modelfile
ollama show - Check model info:
# Full info
ollama show llama3.1
# Output Modelfile
ollama show llama3.1 --modelfile
# Check parameters
ollama show llama3.1 --parameters
# Check system prompt
ollama show llama3.1 --system
# Check template
ollama show llama3.1 --template
ollama ps - Running models:
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.1:8b a]f2e33d4e25 6.7 GB 100% GPU 4 minutes from now
qwen2.5:7b 845dbda0ea48 4.7 GB 100% GPU 3 minutes from now
10. Ollama API Endpoints
Ollama provides both a REST API and an OpenAI-compatible API. The default address is http://localhost:11434.
10.1 Native API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/generate | POST | Text Completion generation |
/api/chat | POST | Chat Completion generation |
/api/embed | POST | Generate embedding vectors |
/api/tags | GET | List local models |
/api/show | POST | Model details |
/api/pull | POST | Download model |
/api/push | POST | Upload model |
/api/create | POST | Create custom model |
/api/copy | POST | Copy model |
/api/delete | DELETE | Delete model |
/api/ps | GET | List running models |
/api/version | GET | Ollama version info |
10.2 OpenAI-Compatible Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions | POST | OpenAI Chat Completion compatible |
/v1/completions | POST | OpenAI Completion compatible |
/v1/models | GET | Model list (OpenAI format) |
/v1/embeddings | POST | Embeddings (OpenAI format) |
10.3 API Call Examples
Generate (Completion):
# Basic generation
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Streaming (default)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Write a haiku about coding",
"options": {
"temperature": 0.7,
"num_predict": 100
}
}'
# JSON format output
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "List 3 programming languages as JSON",
"format": "json",
"stream": false
}'
Chat (Conversation):
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are a helpful Korean assistant."},
{"role": "user", "content": "Recommend tourist spots in Seoul."}
],
"stream": false,
"options": {
"temperature": 0.8,
"top_p": 0.9,
"num_ctx": 4096,
"num_predict": 512
}
}'
Embed (Embeddings):
# Single text embedding
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Hello, world!"
}'
# Multiple text embeddings
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": ["Hello world", "Goodbye world"]
}'
OpenAI-Compatible API:
# OpenAI format Chat Completion
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 256
}'
# Model list
curl http://localhost:11434/v1/models
Calling from Python:
import requests
# Generate API
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.1",
"prompt": "Explain Docker in Korean",
"stream": False,
"options": {
"temperature": 0.7,
"num_predict": 512,
},
})
print(response.json()["response"])
# Chat API
response = requests.post("http://localhost:11434/api/chat", json={
"model": "llama3.1",
"messages": [
{"role": "user", "content": "What is Kubernetes?"},
],
"stream": False,
})
print(response.json()["message"]["content"])
# Using Ollama with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama doesn't require an API key, any value works
)
response = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "user", "content": "Explain Python's GIL."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
11. Ollama Parameters (Modelfile & API)
11.1 Modelfile Structure
A Modelfile defines an Ollama custom model. It has a structure similar to a Dockerfile.
# Specify base model (required)
FROM llama3.1:8b
# Parameter settings
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"
# System prompt
SYSTEM """
You are a friendly Korean AI assistant.
You provide accurate and concise answers, explaining with examples when needed.
"""
# Chat template (Jinja2 or Go template)
TEMPLATE """
{{- if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}
{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
{{ .Content }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""
# LoRA adapter (optional)
ADAPTER /path/to/lora-adapter.gguf
# License info (optional)
LICENSE """
Apache 2.0
"""
| Directive | Description | Required |
|---|---|---|
FROM | Base model (model name or GGUF file path) | Required |
PARAMETER | Model parameter settings | Optional |
TEMPLATE | Prompt template | Optional |
SYSTEM | System prompt | Optional |
ADAPTER | LoRA/QLoRA adapter path | Optional |
LICENSE | License information | Optional |
MESSAGE | Pre-set conversation history | Optional |
11.2 PARAMETER Options Detail
| Parameter | Type | Default | Range/Description |
|---|---|---|---|
temperature | float | 0.8 | 0.0~2.0. Higher is more creative, lower is more deterministic |
top_p | float | 0.9 | 0.0~1.0. Nucleus sampling probability threshold |
top_k | int | 40 | 1~100. Consider only top k tokens |
min_p | float | 0.0 | 0.0~1.0. Minimum probability filtering |
num_predict | int | -1 | Maximum tokens to generate (-1: unlimited, -2: until context fills) |
num_ctx | int | 2048 | Context window size (in tokens) |
repeat_penalty | float | 1.1 | Repetition penalty (1.0 disables) |
repeat_last_n | int | 64 | Repetition check range (0: disabled, -1: num_ctx) |
seed | int | 0 | Random seed (0 means different results each time) |
stop | string | - | Stop string (multiple can be specified) |
num_gpu | int | auto | Number of layers to offload to GPU (0: CPU only) |
num_thread | int | auto | Number of CPU threads |
num_batch | int | 512 | Prompt processing batch size |
mirostat | int | 0 | Mirostat sampling (0: disabled, 1: Mirostat, 2: Mirostat 2.0) |
mirostat_eta | float | 0.1 | Mirostat learning rate |
mirostat_tau | float | 5.0 | Mirostat target entropy |
tfs_z | float | 1.0 | Tail-Free Sampling (1.0 disables) |
typical_p | float | 1.0 | Locally Typical Sampling (1.0 disables) |
use_mlock | bool | false | Lock model in memory (prevent swap) |
num_keep | int | 0 | Number of tokens to keep during context recycling |
penalize_newline | bool | true | Apply penalty to newline tokens |
11.3 Using Parameters in API
Pass parameters via the options field in API calls.
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "user", "content": "Hello"}
],
"options": {
"temperature": 0.3,
"top_p": 0.9,
"top_k": 50,
"num_ctx": 8192,
"num_predict": 1024,
"repeat_penalty": 1.2,
"seed": 42,
"stop": ["<|eot_id|>"]
}
}'
12. Complete Ollama Environment Variables Reference
12.1 Server and Network
| Environment Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Server bind address and port |
OLLAMA_ORIGINS | None | CORS allowed origins (comma-separated) |
OLLAMA_KEEP_ALIVE | 5m | Idle time before model unload (5m, 1h, -1=permanent) |
OLLAMA_MAX_QUEUE | 512 | Maximum queue size (requests rejected when exceeded) |
OLLAMA_NUM_PARALLEL | 1 | Concurrent requests per model |
OLLAMA_MAX_LOADED_MODELS | 1 (CPU), GPUs*3 | Maximum simultaneously loaded models |
12.2 Storage and Paths
| Environment Variable | Default | Description |
|---|---|---|
OLLAMA_MODELS | OS default path | Model storage directory |
OLLAMA_TMPDIR | System temp | Temporary file directory |
OLLAMA_NOPRUNE | None | Disable unused blob cleanup on boot |
Default model storage paths by platform:
| OS | Default Path |
|---|---|
| macOS | ~/.ollama/models |
| Linux | /usr/share/ollama/.ollama/models |
| Windows | C:\Users\<user>\.ollama\models |
12.3 GPU and Performance
| Environment Variable | Default | Description |
|---|---|---|
OLLAMA_FLASH_ATTENTION | 0 | Enable Flash Attention (set to 1) |
OLLAMA_KV_CACHE_TYPE | f16 | KV Cache quantization type (f16, q8_0, q4_0) |
OLLAMA_GPU_OVERHEAD | 0 | VRAM to reserve per GPU (bytes) |
OLLAMA_LLM_LIBRARY | auto | Force specific LLM library |
CUDA_VISIBLE_DEVICES | All GPUs | NVIDIA GPU device numbers to use |
ROCR_VISIBLE_DEVICES | All GPUs | AMD GPU device numbers to use |
GPU_DEVICE_ORDINAL | All GPUs | GPU order to use |
12.4 Logging and Debug
| Environment Variable | Default | Description |
|---|---|---|
OLLAMA_DEBUG | 0 | Enable debug logging (set to 1) |
OLLAMA_NOHISTORY | 0 | Disable readline history in interactive mode |
12.5 Context and Inference
| Environment Variable | Default | Description |
|---|---|---|
OLLAMA_CONTEXT_LENGTH | 4096 | Default context window size |
OLLAMA_NO_CLOUD | 0 | Disable cloud features (set to 1) |
HTTPS_PROXY / HTTP_PROXY | None | Proxy server settings |
NO_PROXY | None | Proxy bypass hosts |
12.6 How to Set Environment Variables
macOS (launchctl):
# Set environment variables
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_MODELS "/Volumes/ExternalSSD/ollama/models"
launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_NUM_PARALLEL "4"
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
# Restart Ollama
brew services restart ollama
Linux (systemd):
# Create systemd service override
sudo systemctl edit ollama
# Add the following in the editor:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="CUDA_VISIBLE_DEVICES=0,1"
# Restart service
sudo systemctl daemon-reload
sudo systemctl restart ollama
Docker:
docker run -d --gpus=all \
-e OLLAMA_HOST=0.0.0.0:11434 \
-e OLLAMA_FLASH_ATTENTION=1 \
-e OLLAMA_KV_CACHE_TYPE=q8_0 \
-e OLLAMA_NUM_PARALLEL=4 \
-e OLLAMA_KEEP_ALIVE=-1 \
-v /data/ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
13. Advanced Ollama Usage
13.1 Modelfile Writing Guide
Korean Assistant Model:
FROM llama3.1:8b
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER repeat_penalty 1.15
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"
SYSTEM """
You are a Korean AI assistant well-versed in Korean culture and history.
You always respond accurately and kindly in Korean, using English technical terms alongside when needed.
You provide answers in a structured format.
"""
MESSAGE user Hello, please introduce yourself.
MESSAGE assistant Hello! I'm an AI assistant specialized in Korean. I can help with various topics including Korean culture, history, technology, and more. Feel free to ask me anything!
# Create model
ollama create korean-assistant -f ./Modelfile-korean
# Run
ollama run korean-assistant "Tell me about the three grand palaces of Seoul"
Code Review Model:
FROM qwen2.5-coder:7b
PARAMETER temperature 0.2
PARAMETER top_p 0.85
PARAMETER num_ctx 8192
PARAMETER num_predict 2048
SYSTEM """
You are an expert code reviewer. Analyze code for:
1. Bugs and potential issues
2. Performance improvements
3. Security vulnerabilities
4. Code style and best practices
Provide specific, actionable feedback with corrected code examples.
"""
Quantization Level Selection Guide:
| Quantization | Size Ratio | Quality | Speed | Recommended Use |
|---|---|---|---|---|
Q2_K | ~30% | Low | Very Fast | Testing only |
Q3_K_M | ~37% | Fair | Fast | Memory-constrained |
Q4_0 | ~42% | Good | Fast | General use (default) |
Q4_K_M | ~45% | Good+ | Fast | General use (recommended) |
Q5_K_M | ~53% | Great | Medium | Quality-focused |
Q6_K | ~62% | Excellent | Medium | High quality required |
Q8_0 | ~80% | Best | Slow | Near-original quality |
F16 | 100% | Original | Slow | Baseline/benchmark |
13.2 GPU Acceleration Setup
NVIDIA GPU:
# Check NVIDIA driver
nvidia-smi
# Use specific GPU only
CUDA_VISIBLE_DEVICES=0 ollama serve
# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1 ollama serve
AMD GPU (ROCm):
# Check ROCm driver
rocm-smi
# Specify GPU
ROCR_VISIBLE_DEVICES=0 ollama serve
Apple Silicon (Metal):
On macOS, Metal GPU acceleration is automatically enabled. No separate configuration needed.
# Check GPU usage (Processor column in ollama ps)
ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# llama3.1:8b a]f2e33d4e25 6.7 GB 100% GPU 4 minutes from now
13.3 Docker Deployment
# docker-compose.yaml
version: '3.8'
services:
ollama:
image: ollama/ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- '11434:11434'
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_KV_CACHE_TYPE=q8_0
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_KEEP_ALIVE=24h
restart: unless-stopped
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:11434/api/version']
interval: 30s
timeout: 5s
retries: 3
# Model initialization (optional)
ollama-init:
image: curlimages/curl:latest
depends_on:
ollama:
condition: service_healthy
entrypoint: >
sh -c "
curl -s http://ollama:11434/api/pull -d '{\"name\": \"llama3.1:8b\"}' &&
curl -s http://ollama:11434/api/pull -d '{\"name\": \"nomic-embed-text\"}'
"
volumes:
ollama_data:
13.4 Multimodal Model Usage
# Run LLaVA model
ollama run llava "What's in this image? /path/to/photo.jpg"
# Llama 3.2 Vision
ollama run llama3.2-vision "Describe this image in Korean. /path/to/image.png"
import requests
import base64
# Encode image to base64
with open("image.jpg", "rb") as f:
image_base64 = base64.b64encode(f.read()).decode("utf-8")
response = requests.post("http://localhost:11434/api/chat", json={
"model": "llava",
"messages": [
{
"role": "user",
"content": "What's in this image?",
"images": [image_base64],
}
],
"stream": False,
})
print(response.json()["message"]["content"])
13.5 Tool Calling / Function Calling
Ollama supports OpenAI-compatible Tool Calling.
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
},
},
"required": ["city"],
},
},
}
]
response = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "user", "content": "What's the current weather in Seoul?"}
],
tools=tools,
tool_choice="auto",
)
message = response.choices[0].message
if message.tool_calls:
for tool_call in message.tool_calls:
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Part 3: Comparison and Practice
14. vLLM vs Ollama Comparison
14.1 Comprehensive Comparison Table
| Item | vLLM | Ollama |
|---|---|---|
| Primary Use | Production API serving, high-throughput inference | Local development, prototyping, personal use |
| Engine | Custom engine (PagedAttention) | llama.cpp |
| Model Format | HF Safetensors, AWQ, GPTQ, FP8 | GGUF (quantized) |
| API | OpenAI compatible | Native + OpenAI compatible |
| Install Difficulty | Medium (Python/CUDA env required) | Very Easy (single binary) |
| GPU Required | Nearly essential (NVIDIA/AMD) | Optional (runs on CPU) |
| Multi-GPU | TP + PP (up to hundreds of GPUs) | Auto-distributed (limited) |
| Concurrency | Hundreds~thousands of requests | Default 1~4 parallel |
| Quantization | AWQ, GPTQ, FP8, BnB | GGUF Q2~Q8, F16 |
| Continuous Batching | Supported | Not supported (llama.cpp limitation) |
| PagedAttention | Core technology | Not supported |
| Prefix Caching | Supported (automatic) | Not supported |
| LoRA Serving | Multi-LoRA concurrent serving | Single LoRA |
| Structured Output | JSON Schema, Regex, Grammar | JSON mode |
| Speculative Decoding | Supported (Draft model, N-gram) | Not supported |
| Streaming | Supported | Supported |
| Docker Deployment | Official image (GPU) | Official image (CPU/GPU) |
| Kubernetes | Official guide + Production Stack | Community Helm Chart |
| Memory Efficiency | Very high (less than 4% waste) | High (GGUF quantization) |
| License | Apache 2.0 | MIT |
14.2 Throughput Comparison (Llama 3.1 8B, RTX 4090)
| Concurrent Users | vLLM (tokens/s) | Ollama (tokens/s) | Ratio |
|---|---|---|---|
| 1 | ~140 | ~65 | 2.2x |
| 5 | ~500 | ~120 | 4.2x |
| 10 | ~800 | ~150 | 5.3x |
| 50 | ~1,200 | ~150 | 8.0x |
| 100 | ~1,500 | ~150 (queued) | 10.0x |
In Red Hat's benchmark, vLLM showed 793 TPS vs Ollama 41 TPS on the same hardware -- a 19x difference. This varies depending on concurrent requests, batch size, and model size.
15. Performance Benchmarks
15.1 Throughput Comparison
| Metric | vLLM | Ollama | Notes |
|---|---|---|---|
| Single Request TPS | 100~140 tok/s | 50~70 tok/s | RTX 4090, Llama 3.1 8B |
| 10 Concurrent Total TPS | 700~900 tok/s | 120~200 tok/s | Continuous Batching effect |
| 50 Concurrent Total TPS | 1,000~1,500 tok/s | ~150 tok/s | Ollama queues requests |
| Batch Inference (1K prompts) | 2,000~3,000 tok/s | Not supported | vLLM offline inference |
15.2 Latency Comparison
| Metric | vLLM | Ollama | Notes |
|---|---|---|---|
| TTFT (Time To First Token) | 50~200 ms | 100~500 ms | Varies by prompt length |
| TPOT (Time Per Output Token) | 7~15 ms | 15~25 ms | Single request basis |
| P99 Latency | 80~150 ms | 500~700 ms | 10 concurrent requests |
| Model Loading Time | 30~120 sec | 5~30 sec | GGUF loads faster |
15.3 Memory Usage Comparison (Llama 3.1 8B)
| Configuration | vLLM GPU Memory | Ollama GPU Memory | Notes |
|---|---|---|---|
| FP16 | ~16 GB | N/A | vLLM default |
| FP8 | ~9 GB | N/A | H100 only |
| AWQ 4-bit | ~5 GB | N/A | vLLM quantized |
| GPTQ 4-bit | ~5 GB | N/A | vLLM quantized |
| Q4_K_M (GGUF) | N/A | ~5.5 GB | Ollama default |
| Q5_K_M (GGUF) | N/A | ~6.2 GB | Higher quality |
| Q8_0 (GGUF) | N/A | ~9 GB | Best quantization quality |
| KV Cache included (4K ctx) | +0.5~2 GB | +0.5~1.5 GB | Proportional to sequences |
16. Recommended Scenarios
16.1 Individual Developer Local Environment
Recommended: Ollama
# Install and use immediately
ollama run llama3.1
# VS Code + Continue extension integration
# Set Ollama endpoint in settings.json
Reason: Simple installation, runs on CPU, supports macOS/Windows/Linux. Easy integration with IDE extensions.
16.2 Production API Serving
Recommended: vLLM
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--enable-prefix-caching \
--enable-chunked-prefill \
--api-key ${API_KEY}
Reason: Overwhelming concurrent request handling with Continuous Batching. High memory efficiency with PagedAttention. Mature multi-GPU support, Kubernetes deployment, and monitoring integration.
16.3 Edge/IoT Environments
Recommended: Ollama + High Quantization
# Small model + high quantization
ollama run phi3:3.8b-mini-instruct-4k-q4_0
# Or Qwen 0.5B
ollama run qwen2.5:0.5b
Reason: Simple deployment as single binary. Runs on low-spec hardware with GGUF quantization. CPU-only inference support.
16.4 Large-Scale Batch Inference
Recommended: vLLM Offline Inference
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=2,
gpu_memory_utilization=0.95,
)
# Process thousands of prompts at once
prompts = load_prompts_from_file("prompts.jsonl") # 10,000+ prompts
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)
outputs = llm.generate(prompts, sampling_params)
save_outputs(outputs, "results.jsonl")
Reason: Batch scheduling that maximizes GPU memory utilization. Efficiently processes thousands to tens of thousands of prompts.
16.5 RAG Pipeline
Both work -- choose based on situation:
# Ollama-based RAG (development/small-scale)
from langchain_ollama import OllamaLLM, OllamaEmbeddings
llm = OllamaLLM(model="llama3.1")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# vLLM-based RAG (production)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
llm = ChatOpenAI(
base_url="http://vllm-server:8000/v1",
api_key="token",
model="meta-llama/Llama-3.1-8B-Instruct",
)
17. Request Tracing Integration
Tracking LLM requests in production environments is essential for debugging, auditing, and performance monitoring.
17.1 vLLM Request ID Tracking
vLLM automatically generates a request_id in its OpenAI API-compatible server. To pass a custom ID, use extra_body.
from openai import OpenAI
import uuid
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
# Pass custom request_id
xid = str(uuid.uuid4())
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"X-Request-ID": xid},
)
print(f"XID: {xid}")
print(f"Response ID: {response.id}")
17.2 Ollama Request Tracking
Ollama's native API does not support a separate request ID, so handle it at the reverse proxy level.
import requests
import uuid
xid = str(uuid.uuid4())
response = requests.post(
"http://localhost:11434/api/chat",
headers={"X-Request-ID": xid},
json={
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello"}],
"stream": False,
},
)
# Include xid in logging
import logging
logger = logging.getLogger(__name__)
logger.info(f"[xid={xid}] Response: {response.status_code}")
17.3 X-Request-ID Forwarding at API Gateway
NGINX Configuration:
upstream vllm_backend {
server vllm-server:8000;
}
server {
listen 80;
location /v1/ {
# Auto-generate X-Request-ID if missing
set $request_id $http_x_request_id;
if ($request_id = "") {
set $request_id $request_id;
}
proxy_pass http://vllm_backend;
proxy_set_header X-Request-ID $request_id;
proxy_set_header Host $host;
# Add X-Request-ID to response headers
add_header X-Request-ID $request_id always;
# Include request_id in access log
access_log /var/log/nginx/vllm_access.log combined_with_xid;
}
}
# Log format definition
log_format combined_with_xid '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'xid="$http_x_request_id"';
17.4 OpenTelemetry Integration
# vLLM + OpenTelemetry distributed tracing
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Initialize Tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://jaeger:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Wrap LLM call as a Span
def call_llm(prompt: str, xid: str) -> str:
with tracer.start_as_current_span("llm_inference") as span:
span.set_attribute("xid", xid)
span.set_attribute("model", "llama-3.1-8b")
span.set_attribute("prompt_length", len(prompt))
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": prompt}],
extra_headers={"X-Request-ID": xid},
)
result = response.choices[0].message.content
span.set_attribute("response_length", len(result))
span.set_attribute("tokens_used", response.usage.total_tokens)
return result
17.5 xid Usage Patterns in Logging
Python Example:
import logging
import uuid
from contextvars import ContextVar
# Manage xid with Context Variable
request_xid: ContextVar[str] = ContextVar("request_xid", default="")
class XIDFilter(logging.Filter):
def filter(self, record):
record.xid = request_xid.get("")
return True
# Logger setup
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
"%(asctime)s [%(levelname)s] [xid=%(xid)s] %(message)s"
))
handler.addFilter(XIDFilter())
logger = logging.getLogger("llm_service")
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Usage
async def handle_request(prompt: str):
xid = str(uuid.uuid4())
request_xid.set(xid)
logger.info(f"Received prompt: {prompt[:50]}...")
response = await call_llm(prompt, xid)
logger.info(f"Generated {len(response)} chars")
return {"xid": xid, "response": response}
Go Example:
package main
import (
"context"
"fmt"
"log/slog"
"net/http"
"github.com/google/uuid"
)
type contextKey string
const xidKey contextKey = "xid"
// XID Middleware
func xidMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
xid := r.Header.Get("X-Request-ID")
if xid == "" {
xid = uuid.New().String()
}
ctx := context.WithValue(r.Context(), xidKey, xid)
w.Header().Set("X-Request-ID", xid)
slog.Info("request received",
"xid", xid,
"method", r.Method,
"path", r.URL.Path,
)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
// Ollama call function
func callOllama(ctx context.Context, prompt string) (string, error) {
xid := ctx.Value(xidKey).(string)
slog.Info("calling ollama",
"xid", xid,
"prompt_len", len(prompt),
)
// ... Ollama API call logic ...
slog.Info("ollama response received",
"xid", xid,
"response_len", len(response),
)
return response, nil
}
18. References
vLLM
- vLLM Official Documentation
- vLLM GitHub
- vLLM Server Arguments
- vLLM Environment Variables
- vLLM Docker Deployment Guide
- vLLM Kubernetes Deployment Guide
- vLLM Quantization
- vLLM Production Stack (GitHub)
Ollama
- Ollama Official Documentation
- Ollama GitHub
- Ollama API Documentation
- Ollama Modelfile Reference
- Ollama CLI Reference
- Ollama FAQ (including environment variables)
- Ollama Model Library
Papers and Technical Resources
- Efficient Memory Management for Large Language Model Serving with PagedAttention (2023)
- Inside vLLM: Anatomy of a High-Throughput LLM Inference System (vLLM Blog, 2025)
- Ollama vs. vLLM: Performance Benchmarking (Red Hat Developer, 2025)
- vLLM vs Ollama vs llama.cpp vs TGI vs TensorRT-LLM: 2025 Guide (ITECS)
Related Projects
- llama.cpp (GitHub) -- Ollama's inference engine
- HuggingFace Text Generation Inference (TGI)
- NVIDIA TensorRT-LLM
- OpenAI API Reference