Part 1: vLLM

1. Introduction to vLLM

vLLM is a high-performance LLM inference and serving engine developed at UC Berkeley. Since its release alongside the PagedAttention paper in 2023, it has established itself as the de facto standard for production LLM serving. As of March 2026, the latest version is v0.16.x, with the transition to V1 architecture underway.

1.1 Core Principles of PagedAttention

In traditional LLM inference, KV Cache is allocated in contiguous GPU memory blocks per sequence. This approach pre-reserves memory based on the maximum sequence length, resulting in 60-80% memory waste in practice.

PagedAttention introduces the operating system's Virtual Memory Paging concept to KV Cache management.

┌─────────────────────────────────────────────────┐
│              Traditional KV Cache                │
│  ┌──────────────────────────────────┐            │
│  │ Seq 1: [used][used][used][waste][waste][waste]│
│  │ Seq 2: [used][waste][waste][waste][waste]     │
│  │ Seq 3: [used][used][waste][waste][waste]      │
│  └──────────────────────────────────┘            │
│              → 60~80% Memory Waste               │
├─────────────────────────────────────────────────┤
│              PagedAttention KV Cache             │
│  Physical Blocks: [B0][B1][B2][B3][B4][B5]...   │
│  Block Table:                                    │
│    Seq 1 → [B0, B3, B5]  (logical → physical)   │
│    Seq 2 → [B1, B4]                             │
│    Seq 3 → [B2, B6]                             │
│              → < 4% Memory Waste                 │
└─────────────────────────────────────────────────┘

The core mechanisms are as follows.

Fixed-size blocks: KV Cache is split into fixed-size blocks (default 16 tokens)
Block Table: Maintains a table mapping logical block numbers of sequences to physical block addresses
Dynamic allocation: Physical blocks are allocated only as needed during token generation
Copy-on-Write: When branching sequences (e.g., Beam Search), physical blocks are shared and copied only when modification is needed

1.2 Continuous Batching

Traditional Static Batching waits until all sequences in a batch complete. Continuous Batching removes completed sequences and inserts new requests at every decoding step.

Static Batching:
Step 1: [Seq1, Seq2, Seq3, Seq4]  ← Waits even if Seq2 completes
Step 2: [Seq1, Seq2, Seq3, Seq4]
Step 3: [Seq1, ___, Seq3, Seq4]  ← Slot wasted after Seq2 ends
...
Step N: Next batch starts after all sequences complete

Continuous Batching:
Step 1: [Seq1, Seq2, Seq3, Seq4]
Step 2: [Seq1, Seq5, Seq3, Seq4]  ← Seq5 inserted immediately after Seq2 completes
Step 3: [Seq1, Seq5, Seq6, Seq4]  ← Seq6 inserted immediately after Seq3 completes
→ Minimizes GPU idle time, maximizes throughput

1.3 Supported Models

vLLM supports virtually all major Transformer-based LLM architectures.

Category	Supported Models
Meta Llama Family	Llama 2, Llama 3, Llama 3.1, Llama 3.2, Llama 3.3, Llama 4
Mistral Family	Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Large, Mistral Small
Qwen Family	Qwen, Qwen 1.5, Qwen 2, Qwen 2.5, Qwen 3, QwQ
Google Family	Gemma, Gemma 2, Gemma 3
DeepSeek Family	DeepSeek V2, DeepSeek V3, DeepSeek-R1
Others	Phi-3/4, Yi, InternLM 2/3, Command R, DBRX, Falcon, StarCoder 2
Multimodal	LLaVA, InternVL, Pixtral, Qwen-VL, MiniCPM-V
Embedding	E5-Mistral, GTE-Qwen, Jina Embeddings

1.4 LLM Serving Engine Comparison

Item	vLLM	TGI	TensorRT-LLM	llama.cpp
Developer	UC Berkeley / vLLM Project	Hugging Face	NVIDIA	Georgi Gerganov
Language	Python/C++/CUDA	Rust/Python	C++/CUDA	C/C++
Core Technology	PagedAttention	Continuous Batching	FP8/INT4 kernel optimization	GGUF quantization
Multi-GPU	TP + PP	TP	TP + PP	Limited
Quantization	AWQ, GPTQ, FP8, BnB	AWQ, GPTQ, BnB	FP8, INT4, INT8	GGUF (Q2~Q8)
API Compat	OpenAI compatible	OpenAI compatible	Triton	Custom API
Install Difficulty	Medium	Medium	High	Low
Production Ready	Very High	High	Very High	Low~Medium
Community	Very Active	Active	NVIDIA-led	Very Active

2. vLLM Installation and Startup

2.1 pip Installation

# Basic installation (CUDA 12.x)
pip install vllm

# Specific version installation
pip install vllm==0.16.0

# CUDA 11.8 environment
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

2.2 conda Installation

conda create -n vllm python=3.11 -y
conda activate vllm
pip install vllm

2.3 Docker Installation

# Official Docker image (NVIDIA GPU)
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<hf_token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

# ROCm (AMD GPU)
docker run --device /dev/kfd --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest-rocm \
  --model meta-llama/Llama-3.1-8B-Instruct

2.4 Basic Server Start

# vllm serve command (recommended)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# Direct Python module execution (legacy)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# Start with YAML config file
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --config config.yaml

config.yaml example:

# vLLM server configuration file
host: '0.0.0.0'
port: 8000
tensor_parallel_size: 2
gpu_memory_utilization: 0.90
max_model_len: 8192
dtype: 'auto'
enforce_eager: false
enable_prefix_caching: true

2.5 Offline Batch Inference

You can perform batch inference directly from Python code without starting a server.

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.90,
)

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Prompt list
prompts = [
    "Explain PagedAttention in simple terms.",
    "What is continuous batching?",
    "Compare vLLM and TensorRT-LLM.",
]

# Run batch inference
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Output: {generated!r}\n")

2.6 OpenAI-Compatible API Server

The vLLM server provides OpenAI API-compatible endpoints.

# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --served-model-name llama-3.1-8b \
  --api-key my-secret-key

# Call Chat Completion with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret-key" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is PagedAttention?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

# Call with OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="my-secret-key",
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the advantages of vLLM."},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

3. Complete vLLM CLI Arguments Reference

Here is a categorized summary of key CLI arguments that can be passed to vllm serve. You can check the full list with vllm serve --help, or query by group with vllm serve --help=ModelConfig.

Argument	Type	Default	Description
`--model`	str	`facebook/opt-125m`	HuggingFace model ID or local path
`--tokenizer`	str	None (same as model)	Specify a separate tokenizer
`--revision`	str	None	Specific Git revision of the model (branch, tag, commit hash)
`--tokenizer-revision`	str	None	Specific revision of the tokenizer
`--dtype`	str	`"auto"`	Model weight data type (`auto`, `float16`, `bfloat16`, `float32`)
`--max-model-len`	int	None (follows model config)	Maximum sequence length (sum of input + output tokens)
`--trust-remote-code`	flag	False	Allow HuggingFace remote code execution
`--download-dir`	str	None	Model download directory
`--load-format`	str	`"auto"`	Model load format (`auto`, `pt`, `safetensors`, `npcache`, `dummy`, `bitsandbytes`)
`--config-format`	str	`"auto"`	Model configuration format (`auto`, `hf`, `mistral`)
`--seed`	int	0	Random seed for reproducibility

Argument	Type	Default	Description
`--host`	str	`"0.0.0.0"`	Host address to bind
`--port`	int	`8000`	Server port number
`--uvicorn-log-level`	str	`"info"`	Uvicorn log level
`--api-key`	str	None	API authentication key (Bearer token)
`--served-model-name`	str	None	Model name for the API (uses `--model` value if unset)
`--chat-template`	str	None	Jinja2 chat template file path or string
`--response-role`	str	`"assistant"`	Role in chat completion responses
`--ssl-keyfile`	str	None	SSL key file path
`--ssl-certfile`	str	None	SSL certificate file path
`--allowed-origins`	list	`["*"]`	CORS allowed origin list
`--middleware`	list	None	FastAPI middleware classes
`--max-log-len`	int	None	Maximum prompt/output length in logs
`--disable-log-requests`	flag	False	Disable request logging

Argument	Type	Default	Description
`--tensor-parallel-size` (`-tp`)	int	1	Number of GPUs for Tensor Parallelism
`--pipeline-parallel-size` (`-pp`)	int	1	Number of Pipeline Parallelism stages
`--distributed-executor-backend`	str	None	Distributed execution backend (`ray`, `mp`)
`--ray-workers-use-nsight`	flag	False	Use Nsight profiler with Ray workers
`--data-parallel-size` (`-dp`)	int	1	Number of Data Parallelism processes

Usage examples:

# 4-GPU Tensor Parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# 2-GPU Tensor + 2-way Pipeline (4 GPUs total)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

# Ray distributed backend (multi-node)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --distributed-executor-backend ray

3.4 Memory and Performance Arguments

Argument	Type	Default	Description
`--gpu-memory-utilization`	float	`0.90`	GPU memory usage ratio (0.0~1.0)
`--max-num-seqs`	int	256	Maximum concurrent sequences
`--max-num-batched-tokens`	int	None (auto)	Maximum tokens processed per step
`--block-size`	int	16	PagedAttention block size (in tokens)
`--swap-space`	float	4	CPU swap space size (GiB)
`--enforce-eager`	flag	False	Disable CUDA Graph, force Eager mode
`--max-seq-len-to-capture`	int	8192	Maximum sequence length for CUDA Graph capture
`--disable-custom-all-reduce`	flag	False	Disable custom All-Reduce
`--enable-prefix-caching`	flag	True (v1)	Enable Automatic Prefix Caching
`--enable-chunked-prefill`	flag	True (v1)	Enable Chunked Prefill
`--num-scheduler-steps`	int	1	Decoding steps per scheduler (Multi-Step Scheduling)
`--kv-cache-dtype`	str	`"auto"`	KV Cache data type (`auto`, `fp8`, `fp8_e5m2`, `fp8_e4m3`)

Usage examples:

# Memory optimization settings
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 128 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill

# Eager mode (debugging/compatibility)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enforce-eager \
  --gpu-memory-utilization 0.85

Argument	Type	Default	Description
`--quantization` (`-q`)	str	None	Select quantization method
`--load-format`	str	`"auto"`	Model load format

--quantization supported values:

Value	Description	Notes
`awq`	AWQ (Activation-aware Weight Quantization)	4-bit, fast inference
`gptq`	GPTQ (Post-Training Quantization)	4-bit, ExLlamaV2 kernel
`gptq_marlin`	GPTQ + Marlin kernel	4-bit, faster kernel
`awq_marlin`	AWQ + Marlin kernel	4-bit, faster kernel
`squeezellm`	SqueezeLLM	Sparse quantization
`fp8`	FP8 (8-bit floating point)	H100/MI300x only
`bitsandbytes`	BitsAndBytes	4-bit NF4
`gguf`	GGUF format	llama.cpp compatible
`compressed-tensors`	Compressed Tensors	General purpose
`experts_int8`	MoE Expert INT8	MoE models only

Usage examples:

# AWQ quantized model
vllm serve TheBloke/Llama-2-7B-AWQ \
  --quantization awq

# GPTQ quantized model
vllm serve TheBloke/Llama-2-7B-GPTQ \
  --quantization gptq

# FP8 quantization (H100 and above)
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --quantization fp8

# BitsAndBytes 4-bit (GPU memory saving)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes

Argument	Type	Default	Description
`--enable-lora`	flag	False	Enable LoRA adapter serving
`--max-loras`	int	1	Maximum number of simultaneously loaded LoRAs
`--max-lora-rank`	int	16	Maximum LoRA rank
`--lora-extra-vocab-size`	int	256	Extra vocabulary size for LoRA adapters
`--lora-modules`	list	None	LoRA adapter list (`name=path` format)
`--long-lora-scaling-factors`	list	None	Long LoRA scaling factors

Usage example:

# LoRA adapter serving
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --lora-modules \
    adapter1=/path/to/lora1 \
    adapter2=/path/to/lora2

3.7 Speculative Decoding Arguments

Argument	Type	Default	Description
`--speculative-model`	str	None	Draft model (small model or `[ngram]`)
`--num-speculative-tokens`	int	None	Number of tokens to speculatively generate
`--speculative-draft-tensor-parallel-size`	int	None	TP size for the draft model
`--speculative-disable-by-batch-size`	int	None	Disable when batch size exceeds threshold
`--ngram-prompt-lookup-max`	int	None	Maximum lookup size for N-gram speculation
`--ngram-prompt-lookup-min`	int	None	Minimum lookup size for N-gram speculation

Usage examples:

# Using a separate draft model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4

# N-gram based speculative decoding (no additional model needed)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model "[ngram]" \
  --num-speculative-tokens 5 \
  --ngram-prompt-lookup-max 4

4. vLLM Sampling Parameters

vLLM supports OpenAI API-compatible parameters plus additional advanced parameters.

4.1 Complete Parameter Reference

Parameter	Type	Default	Range	Description
`temperature`	float	1.0	>= 0.0	Lower is more deterministic, higher is more creative. 0 = Greedy
`top_p`	float	1.0	(0.0, 1.0]	Nucleus sampling. Sample only top tokens by cumulative probability
`top_k`	int	-1	-1 or >= 1	Consider only top k tokens. -1 disables
`min_p`	float	0.0	[0.0, 1.0]	Minimum probability threshold. Filters by ratio to highest probability
`frequency_penalty`	float	0.0	[-2.0, 2.0]	Frequency-based penalty. Positive suppresses repetition
`presence_penalty`	float	0.0	[-2.0, 2.0]	Presence-based penalty. Penalizes tokens that appeared at least once
`repetition_penalty`	float	1.0	> 0.0	Repetition penalty (1.0 disables, greater than 1.0 suppresses)
`max_tokens`	int	16	>= 1	Maximum tokens to generate
`stop`	list	None	-	List of stop strings
`seed`	int	None	-	Random seed (ensures reproducibility)
`n`	int	1	>= 1	Number of responses per prompt
`best_of`	int	None	>= n	Generate best_of candidates and select the best
`use_beam_search`	bool	False	-	Enable Beam Search
`logprobs`	int	None	[0, 20]	Number of per-token log probabilities to return
`prompt_logprobs`	int	None	[0, 20]	Number of prompt token log probabilities to return
`skip_special_tokens`	bool	True	-	Whether to skip special tokens
`spaces_between_special_tokens`	bool	True	-	Insert spaces between special tokens
`guided_json`	object	None	-	JSON Schema-based structured output
`guided_regex`	str	None	-	Regex-based structured output
`guided_choice`	list	None	-	Choice-based structured output

4.2 API Call Examples with curl

# Basic Chat Completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the population of Seoul?"}
    ],
    "temperature": 0.3,
    "top_p": 0.9,
    "max_tokens": 256,
    "frequency_penalty": 0.5,
    "seed": 42
  }'

# Structured Output (JSON mode)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Give me the population of Seoul, Busan, and Daegu in JSON"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "city_population",
        "schema": {
          "type": "object",
          "properties": {
            "cities": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "population": {"type": "integer"}
                },
                "required": ["name", "population"]
              }
            }
          },
          "required": ["cities"]
        }
      }
    },
    "temperature": 0.1,
    "max_tokens": 512
  }'

# Returning logprobs
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "1+1=?"}
    ],
    "logprobs": true,
    "top_logprobs": 5,
    "max_tokens": 10
  }'

4.3 Python requests Example

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

payload = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful Korean assistant."},
        {"role": "user", "content": "What is quantum computing?"},
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "max_tokens": 1024,
    "repetition_penalty": 1.1,
    "stop": ["\n\n\n"],
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(result["choices"][0]["message"]["content"])

4.4 Streaming Example with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# Streaming response
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Implement quicksort in Python"},
    ],
    temperature=0.2,
    max_tokens=2048,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

5. Complete vLLM Environment Variables Reference

vLLM controls runtime behavior through various environment variables. Here is a categorized summary of key environment variables.

5.1 Core Environment Variables

Environment Variable	Default	Description
`VLLM_TARGET_DEVICE`	`"cuda"`	Target device (`cuda`, `rocm`, `neuron`, `cpu`, `xpu`)
`VLLM_USE_V1`	`True`	Use V1 code path
`VLLM_WORKER_MULTIPROC_METHOD`	`"fork"`	Multiprocess spawn method (`spawn`, `fork`)
`VLLM_ALLOW_LONG_MAX_MODEL_LEN`	`False`	Allow max_model_len longer than model config
`CUDA_VISIBLE_DEVICES`	None	GPU device numbers to use

Environment Variable	Default	Description
`VLLM_ATTENTION_BACKEND`	None	Attention backend (deprecated, use `--attention-backend` from v0.14)
`VLLM_USE_TRITON_FLASH_ATTN`	`True`	Use Triton Flash Attention
`VLLM_FLASH_ATTN_VERSION`	None	Force Flash Attention version (`2` or `3`)
`VLLM_USE_FLASHINFER_SAMPLER`	None	Use FlashInfer sampler
`VLLM_FLASHINFER_FORCE_TENSOR_CORES`	`False`	Force FlashInfer tensor core usage
`VLLM_USE_TRITON_AWQ`	`False`	Use Triton AWQ kernel
`VLLM_USE_DEEP_GEMM`	`False`	Use DeepGemm kernel (MoE operations)
`VLLM_MLA_DISABLE`	`False`	Disable MLA Attention optimization

Environment Variable	Default	Description
`VLLM_CONFIGURE_LOGGING`	`1`	Auto-configure vLLM logging (0 to disable)
`VLLM_LOGGING_LEVEL`	`"INFO"`	Default logging level
`VLLM_LOGGING_CONFIG_PATH`	None	Custom logging config file path
`VLLM_LOGGING_PREFIX`	`""`	Prefix to prepend to log messages
`VLLM_LOG_BATCHSIZE_INTERVAL`	`-1`	Batch size logging interval (seconds, -1 disables)
`VLLM_TRACE_FUNCTION`	`0`	Enable function call tracing
`VLLM_DEBUG_LOG_API_SERVER_RESPONSE`	`False`	API response debug logging

Environment Variable	Default	Description
`VLLM_HOST_IP`	`""`	Node IP for distributed setup
`VLLM_PORT`	`0`	Distributed communication port
`VLLM_NCCL_SO_PATH`	None	NCCL library file path
`NCCL_DEBUG`	None	NCCL debug level (`INFO`, `WARN`, `TRACE`)
`NCCL_SOCKET_IFNAME`	None	Network interface for NCCL communication
`VLLM_PP_LAYER_PARTITION`	None	Pipeline Parallelism layer partition strategy
`VLLM_DP_RANK`	`0`	Data Parallel process rank
`VLLM_DP_SIZE`	`1`	Data Parallel world size
`VLLM_DP_MASTER_IP`	`"127.0.0.1"`	Data Parallel master node IP
`VLLM_DP_MASTER_PORT`	`0`	Data Parallel master node port
`VLLM_USE_RAY_SPMD_WORKER`	`False`	Ray SPMD worker execution
`VLLM_USE_RAY_COMPILED_DAG`	`False`	Use Ray Compiled Graph API
`VLLM_SKIP_P2P_CHECK`	`False`	Skip GPU P2P capability check

5.5 HuggingFace and External Services

Environment Variable	Default	Description
`HF_TOKEN`	None	HuggingFace API token
`HUGGING_FACE_HUB_TOKEN`	None	HuggingFace Hub token (legacy)
`VLLM_USE_MODELSCOPE`	`False`	Load models from ModelScope
`VLLM_API_KEY`	None	vLLM API server auth key
`VLLM_NO_USAGE_STATS`	`False`	Disable usage stats collection
`VLLM_DO_NOT_TRACK`	`False`	Opt out of tracking

5.6 Cache and Paths

Environment Variable	Default	Description
`VLLM_CONFIG_ROOT`	`~/.config/vllm`	Config file root directory
`VLLM_CACHE_ROOT`	`~/.cache/vllm`	Cache file root directory
`VLLM_ASSETS_CACHE`	`~/.cache/vllm/assets`	Downloaded assets cache path
`VLLM_RPC_BASE_PATH`	System temp	IPC multiprocessing path

5.7 Environment Variable Usage Examples

# Multi-GPU + logging + HF token setup
export CUDA_VISIBLE_DEVICES=0,1,2,3
export HF_TOKEN="hf_xxxxxxxxxxxx"
export VLLM_LOGGING_LEVEL="DEBUG"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

# Passing environment variables in Docker
docker run --runtime nvidia --gpus all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e HF_TOKEN="hf_xxxxxxxxxxxx" \
  -e VLLM_LOGGING_LEVEL="INFO" \
  -e VLLM_WORKER_MULTIPROC_METHOD="spawn" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2

6. Advanced vLLM Configuration

6.1 Multi-GPU Setup

Tensor Parallelism (TP): Distributes each layer of the model across multiple GPUs. The most commonly used approach on a single node.

# TP=4 (distribute model across 4 GPUs)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

Pipeline Parallelism (PP): Places model layers sequentially across multiple GPUs. Advantageous in slow interconnect environments.

# PP=2, TP=2 (4 GPUs total, 2x2 configuration)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

Multi-node setup (using Ray):

# Master node
ray start --head --port=6379

# Worker node
ray start --address=<master-ip>:6379

# Run vLLM (from master)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --distributed-executor-backend ray

6.2 Quantization Details

AWQ (Activation-aware Weight Quantization):

# Using pre-quantized AWQ model
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq \
  --max-model-len 4096

# Faster with Marlin kernel (SM 80+ GPU)
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq_marlin

GPTQ (Post-Training Quantization):

# GPTQ model (ExLlamaV2 kernel auto-used)
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
  --quantization gptq

# Using Marlin kernel
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
  --quantization gptq_marlin

FP8 (8-bit Floating Point): Hardware acceleration supported on H100, MI300x and above GPUs.

# Pre-quantized FP8 model
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --quantization fp8

# Dynamic FP8 quantization (no pre-quantization needed)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8

BitsAndBytes 4-bit NF4: Instant quantization without calibration data.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --enforce-eager  # BnB requires Eager mode

6.3 LoRA Serving

# Enable LoRA adapters
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --lora-modules \
    korean-chat=/path/to/korean-lora \
    code-assist=/path/to/code-lora

Specifying a LoRA model in API calls:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# Use a specific LoRA adapter
response = client.chat.completions.create(
    model="korean-chat",  # LoRA adapter name
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=256,
)

6.4 Prefix Caching & Chunked Prefill

Automatic Prefix Caching: Reuses KV Cache for common prompt prefixes to reduce TTFT. Especially effective when many requests share the same system prompt.

# Enabled by default in v1, requires explicit flag in v0
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching

Chunked Prefill: Splits long prompts into chunks and interleaves Prefill and Decode. Prevents long prompts from blocking Decode of shorter requests.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048

6.5 Structured Output (Guided Decoding)

# JSON Schema-based structured output
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Please provide Seoul weather info in JSON"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "weather_info",
        "schema": {
          "type": "object",
          "properties": {
            "city": {"type": "string"},
            "temperature_celsius": {"type": "number"},
            "condition": {"type": "string"},
            "humidity_percent": {"type": "integer"}
          },
          "required": ["city", "temperature_celsius", "condition"]
        }
      }
    }
  }'

# Regex-based output (Completion API)
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Generate a valid email address:",
    "extra_body": {
      "guided_regex": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
    },
    "max_tokens": 50
  }'

6.6 Docker Deployment

# docker-compose.yaml
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - '8000:8000'
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_LOGGING_LEVEL=INFO
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    ipc: host
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.90
      --max-model-len 8192
      --enable-prefix-caching
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s

# Run with Docker Compose
HF_TOKEN=hf_xxxx docker compose up -d

# Check logs
docker compose logs -f vllm

6.7 Kubernetes Deployment

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
  namespace: ai-serving
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          ports:
            - containerPort: 8000
              name: http
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-8B-Instruct'
            - '--host'
            - '0.0.0.0'
            - '--port'
            - '8000'
            - '--tensor-parallel-size'
            - '2'
            - '--gpu-memory-utilization'
            - '0.90'
            - '--max-model-len'
            - '8192'
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
            - name: VLLM_WORKER_MULTIPROC_METHOD
              value: 'spawn'
          resources:
            limits:
              nvidia.com/gpu: '2'
            requests:
              nvidia.com/gpu: '2'
              memory: '32Gi'
              cpu: '8'
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
            - name: model-cache
              mountPath: /root/.cache/huggingface
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
      nodeSelector:
        nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ai-serving
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000
      name: http
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama3
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: '50'

Part 2: Ollama

7. Introduction to Ollama

Ollama is an open-source tool that makes it easy to run LLMs in local environments. Like Docker, you can download a model and start chatting immediately with a single command like ollama run llama3.1.

7.1 Architecture Features

GGUF-based: Uses llama.cpp's GGUF (GPT-Generated Unified Format) quantized models
llama.cpp engine: Internally uses llama.cpp as the inference engine
Single binary: Go server + llama.cpp C++ engine distributed as a single binary
Automatic GPU acceleration: Auto-detects NVIDIA CUDA, AMD ROCm, Apple Metal for GPU offloading
Model registry: Pull/push pre-quantized models from ollama.com/library like Docker Hub

7.2 Supported Models

Category	Models	Size
Meta Llama	`llama3.1`, `llama3.2`, `llama3.3`	1B ~ 405B
Mistral	`mistral`, `mixtral`	7B ~ 8x22B
Google	`gemma`, `gemma2`, `gemma3`	2B ~ 27B
Microsoft	`phi3`, `phi4`	3.8B ~ 14B
DeepSeek	`deepseek-r1`, `deepseek-v3`, `deepseek-coder-v2`	1.5B ~ 671B
Qwen	`qwen`, `qwen2`, `qwen2.5`, `qwen3`	0.5B ~ 72B
Code	`codellama`, `starcoder2`, `qwen2.5-coder`	3B ~ 34B
Embedding	`nomic-embed-text`, `mxbai-embed-large`, `all-minilm`	-
Multimodal	`llava`, `bakllava`, `llama3.2-vision`	7B ~ 90B

8. Ollama Installation and Startup

8.1 Platform-Specific Installation

macOS:

# Homebrew
brew install ollama

# Or official install script
curl -fsSL https://ollama.com/install.sh | sh

Linux:

# Official install script (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Or manual installation
curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama

Windows:

Download and run the Windows installer from the official website (ollama.com).

Docker:

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# NVIDIA GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# AMD GPU (ROCm)
docker run -d --device /dev/kfd --device /dev/dri \
  -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama:rocm

8.2 Basic Usage

# Start server (if not auto-started in background)
ollama serve

# Download model and start chatting
ollama run llama3.1

# Specify a tag (size/quantization)
ollama run llama3.1:8b
ollama run llama3.1:70b-instruct-q4_K_M
ollama run qwen2.5:32b-instruct-q5_K_M

# Download model only (without running)
ollama pull llama3.1:8b

# One-line prompt
ollama run llama3.1 "What is PagedAttention?"

9. Complete Ollama CLI Commands Reference

9.1 Command Summary

Command	Description	Key Options
`ollama serve`	Start Ollama server	Check env vars with `--help`
`ollama run <model>`	Run model (auto-pulls if missing)	`--verbose`, `--nowordwrap`, `--format json`
`ollama pull <model>`	Download model	`--insecure`
`ollama push <model>`	Upload model to registry	`--insecure`
`ollama create <model>`	Create custom model from Modelfile	`-f <Modelfile>`, `--quantize`
`ollama list` / `ollama ls`	List installed models	-
`ollama show <model>`	Show model details	`--modelfile`, `--parameters`, `--system`, `--template`, `--license`
`ollama cp <src> <dst>`	Copy model	-
`ollama rm <model>`	Delete model	-
`ollama ps`	List running models	-
`ollama stop <model>`	Stop a running model	-
`ollama signin`	Sign in to ollama.com	-
`ollama signout`	Sign out from ollama.com	-

9.2 Detailed Command Examples

ollama serve - Start server:

# Default start (localhost:11434)
ollama serve

# Change bind address via environment variable
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Debug mode
OLLAMA_DEBUG=1 ollama serve

ollama run - Run model:

# Interactive mode
ollama run llama3.1

# One-line prompt
ollama run llama3.1 "Explain quantum computing"

# JSON format output
ollama run llama3.1 "List 3 Korean cities" --format json

# Multimodal (image input)
ollama run llama3.2-vision "What's in this image? /path/to/image.png"

# Verbose mode (display performance stats)
ollama run llama3.1 --verbose

# With system prompt
ollama run llama3.1 --system "You are a Korean translator."

ollama create - Create custom model:

# Create from Modelfile
ollama create my-model -f ./Modelfile

# Create from GGUF file
ollama create my-model -f ./Modelfile-from-gguf

# Quantization conversion
ollama create my-model-q4 --quantize q4_K_M -f ./Modelfile

ollama show - Check model info:

# Full info
ollama show llama3.1

# Output Modelfile
ollama show llama3.1 --modelfile

# Check parameters
ollama show llama3.1 --parameters

# Check system prompt
ollama show llama3.1 --system

# Check template
ollama show llama3.1 --template

ollama ps - Running models:

$ ollama ps
NAME              ID            SIZE     PROCESSOR    UNTIL
llama3.1:8b       a]f2e33d4e25  6.7 GB   100% GPU     4 minutes from now
qwen2.5:7b        845dbda0ea48  4.7 GB   100% GPU     3 minutes from now

10. Ollama API Endpoints

Ollama provides both a REST API and an OpenAI-compatible API. The default address is http://localhost:11434.

10.1 Native API Endpoints

Endpoint	Method	Description
`/api/generate`	POST	Text Completion generation
`/api/chat`	POST	Chat Completion generation
`/api/embed`	POST	Generate embedding vectors
`/api/tags`	GET	List local models
`/api/show`	POST	Model details
`/api/pull`	POST	Download model
`/api/push`	POST	Upload model
`/api/create`	POST	Create custom model
`/api/copy`	POST	Copy model
`/api/delete`	DELETE	Delete model
`/api/ps`	GET	List running models
`/api/version`	GET	Ollama version info

10.2 OpenAI-Compatible Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	OpenAI Chat Completion compatible
`/v1/completions`	POST	OpenAI Completion compatible
`/v1/models`	GET	Model list (OpenAI format)
`/v1/embeddings`	POST	Embeddings (OpenAI format)

10.3 API Call Examples

Generate (Completion):

# Basic generation
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Streaming (default)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a haiku about coding",
  "options": {
    "temperature": 0.7,
    "num_predict": 100
  }
}'

# JSON format output
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "List 3 programming languages as JSON",
  "format": "json",
  "stream": false
}'

Chat (Conversation):

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "system", "content": "You are a helpful Korean assistant."},
    {"role": "user", "content": "Recommend tourist spots in Seoul."}
  ],
  "stream": false,
  "options": {
    "temperature": 0.8,
    "top_p": 0.9,
    "num_ctx": 4096,
    "num_predict": 512
  }
}'

Embed (Embeddings):

# Single text embedding
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Hello, world!"
}'

# Multiple text embeddings
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["Hello world", "Goodbye world"]
}'

OpenAI-Compatible API:

# OpenAI format Chat Completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

# Model list
curl http://localhost:11434/v1/models

Calling from Python:

import requests

# Generate API
response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.1",
    "prompt": "Explain Docker in Korean",
    "stream": False,
    "options": {
        "temperature": 0.7,
        "num_predict": 512,
    },
})
print(response.json()["response"])

# Chat API
response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.1",
    "messages": [
        {"role": "user", "content": "What is Kubernetes?"},
    ],
    "stream": False,
})
print(response.json()["message"]["content"])

# Using Ollama with OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't require an API key, any value works
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Explain Python's GIL."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

11. Ollama Parameters (Modelfile & API)

11.1 Modelfile Structure

A Modelfile defines an Ollama custom model. It has a structure similar to a Dockerfile.

# Specify base model (required)
FROM llama3.1:8b

# Parameter settings
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

# System prompt
SYSTEM """
You are a friendly Korean AI assistant.
You provide accurate and concise answers, explaining with examples when needed.
"""

# Chat template (Jinja2 or Go template)
TEMPLATE """
{{- if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}
{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
{{ .Content }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""

# LoRA adapter (optional)
ADAPTER /path/to/lora-adapter.gguf

# License info (optional)
LICENSE """
Apache 2.0
"""

Directive	Description	Required
`FROM`	Base model (model name or GGUF file path)	Required
`PARAMETER`	Model parameter settings	Optional
`TEMPLATE`	Prompt template	Optional
`SYSTEM`	System prompt	Optional
`ADAPTER`	LoRA/QLoRA adapter path	Optional
`LICENSE`	License information	Optional
`MESSAGE`	Pre-set conversation history	Optional

11.2 PARAMETER Options Detail

Parameter	Type	Default	Range/Description
`temperature`	float	0.8	0.0~2.0. Higher is more creative, lower is more deterministic
`top_p`	float	0.9	0.0~1.0. Nucleus sampling probability threshold
`top_k`	int	40	1~100. Consider only top k tokens
`min_p`	float	0.0	0.0~1.0. Minimum probability filtering
`num_predict`	int	-1	Maximum tokens to generate (-1: unlimited, -2: until context fills)
`num_ctx`	int	2048	Context window size (in tokens)
`repeat_penalty`	float	1.1	Repetition penalty (1.0 disables)
`repeat_last_n`	int	64	Repetition check range (0: disabled, -1: num_ctx)
`seed`	int	0	Random seed (0 means different results each time)
`stop`	string	-	Stop string (multiple can be specified)
`num_gpu`	int	auto	Number of layers to offload to GPU (0: CPU only)
`num_thread`	int	auto	Number of CPU threads
`num_batch`	int	512	Prompt processing batch size
`mirostat`	int	0	Mirostat sampling (0: disabled, 1: Mirostat, 2: Mirostat 2.0)
`mirostat_eta`	float	0.1	Mirostat learning rate
`mirostat_tau`	float	5.0	Mirostat target entropy
`tfs_z`	float	1.0	Tail-Free Sampling (1.0 disables)
`typical_p`	float	1.0	Locally Typical Sampling (1.0 disables)
`use_mlock`	bool	false	Lock model in memory (prevent swap)
`num_keep`	int	0	Number of tokens to keep during context recycling
`penalize_newline`	bool	true	Apply penalty to newline tokens

11.3 Using Parameters in API

Pass parameters via the options field in API calls.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "options": {
    "temperature": 0.3,
    "top_p": 0.9,
    "top_k": 50,
    "num_ctx": 8192,
    "num_predict": 1024,
    "repeat_penalty": 1.2,
    "seed": 42,
    "stop": ["<|eot_id|>"]
  }
}'

12. Complete Ollama Environment Variables Reference

12.1 Server and Network

Environment Variable	Default	Description
`OLLAMA_HOST`	`127.0.0.1:11434`	Server bind address and port
`OLLAMA_ORIGINS`	None	CORS allowed origins (comma-separated)
`OLLAMA_KEEP_ALIVE`	`5m`	Idle time before model unload (`5m`, `1h`, `-1`=permanent)
`OLLAMA_MAX_QUEUE`	`512`	Maximum queue size (requests rejected when exceeded)
`OLLAMA_NUM_PARALLEL`	`1`	Concurrent requests per model
`OLLAMA_MAX_LOADED_MODELS`	`1` (CPU), GPUs*3	Maximum simultaneously loaded models

12.2 Storage and Paths

Environment Variable	Default	Description
`OLLAMA_MODELS`	OS default path	Model storage directory
`OLLAMA_TMPDIR`	System temp	Temporary file directory
`OLLAMA_NOPRUNE`	None	Disable unused blob cleanup on boot

Default model storage paths by platform:

OS	Default Path
macOS	`~/.ollama/models`
Linux	`/usr/share/ollama/.ollama/models`
Windows	`C:\Users\<user>\.ollama\models`

12.3 GPU and Performance

Environment Variable	Default	Description
`OLLAMA_FLASH_ATTENTION`	`0`	Enable Flash Attention (set to `1`)
`OLLAMA_KV_CACHE_TYPE`	`f16`	KV Cache quantization type (`f16`, `q8_0`, `q4_0`)
`OLLAMA_GPU_OVERHEAD`	`0`	VRAM to reserve per GPU (bytes)
`OLLAMA_LLM_LIBRARY`	auto	Force specific LLM library
`CUDA_VISIBLE_DEVICES`	All GPUs	NVIDIA GPU device numbers to use
`ROCR_VISIBLE_DEVICES`	All GPUs	AMD GPU device numbers to use
`GPU_DEVICE_ORDINAL`	All GPUs	GPU order to use

12.4 Logging and Debug

Environment Variable	Default	Description
`OLLAMA_DEBUG`	`0`	Enable debug logging (set to `1`)
`OLLAMA_NOHISTORY`	`0`	Disable readline history in interactive mode

12.5 Context and Inference

Environment Variable	Default	Description
`OLLAMA_CONTEXT_LENGTH`	`4096`	Default context window size
`OLLAMA_NO_CLOUD`	`0`	Disable cloud features (set to `1`)
`HTTPS_PROXY` / `HTTP_PROXY`	None	Proxy server settings
`NO_PROXY`	None	Proxy bypass hosts

12.6 How to Set Environment Variables

macOS (launchctl):

# Set environment variables
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_MODELS "/Volumes/ExternalSSD/ollama/models"
launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_NUM_PARALLEL "4"
launchctl setenv OLLAMA_KEEP_ALIVE "-1"

# Restart Ollama
brew services restart ollama

Linux (systemd):

# Create systemd service override
sudo systemctl edit ollama

# Add the following in the editor:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="CUDA_VISIBLE_DEVICES=0,1"

# Restart service
sudo systemctl daemon-reload
sudo systemctl restart ollama

Docker:

docker run -d --gpus=all \
  -e OLLAMA_HOST=0.0.0.0:11434 \
  -e OLLAMA_FLASH_ATTENTION=1 \
  -e OLLAMA_KV_CACHE_TYPE=q8_0 \
  -e OLLAMA_NUM_PARALLEL=4 \
  -e OLLAMA_KEEP_ALIVE=-1 \
  -v /data/ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

13. Advanced Ollama Usage

13.1 Modelfile Writing Guide

Korean Assistant Model:

FROM llama3.1:8b

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER repeat_penalty 1.15
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

SYSTEM """
You are a Korean AI assistant well-versed in Korean culture and history.
You always respond accurately and kindly in Korean, using English technical terms alongside when needed.
You provide answers in a structured format.
"""

MESSAGE user Hello, please introduce yourself.
MESSAGE assistant Hello! I'm an AI assistant specialized in Korean. I can help with various topics including Korean culture, history, technology, and more. Feel free to ask me anything!

# Create model
ollama create korean-assistant -f ./Modelfile-korean

# Run
ollama run korean-assistant "Tell me about the three grand palaces of Seoul"

Code Review Model:

FROM qwen2.5-coder:7b

PARAMETER temperature 0.2
PARAMETER top_p 0.85
PARAMETER num_ctx 8192
PARAMETER num_predict 2048

SYSTEM """
You are an expert code reviewer. Analyze code for:
1. Bugs and potential issues
2. Performance improvements
3. Security vulnerabilities
4. Code style and best practices

Provide specific, actionable feedback with corrected code examples.
"""

Quantization Level Selection Guide:

Quantization	Size Ratio	Quality	Speed	Recommended Use
`Q2_K`	~30%	Low	Very Fast	Testing only
`Q3_K_M`	~37%	Fair	Fast	Memory-constrained
`Q4_0`	~42%	Good	Fast	General use (default)
`Q4_K_M`	~45%	Good+	Fast	General use (recommended)
`Q5_K_M`	~53%	Great	Medium	Quality-focused
`Q6_K`	~62%	Excellent	Medium	High quality required
`Q8_0`	~80%	Best	Slow	Near-original quality
`F16`	100%	Original	Slow	Baseline/benchmark

13.2 GPU Acceleration Setup

NVIDIA GPU:

# Check NVIDIA driver
nvidia-smi

# Use specific GPU only
CUDA_VISIBLE_DEVICES=0 ollama serve

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1 ollama serve

AMD GPU (ROCm):

# Check ROCm driver
rocm-smi

# Specify GPU
ROCR_VISIBLE_DEVICES=0 ollama serve

Apple Silicon (Metal):

On macOS, Metal GPU acceleration is automatically enabled. No separate configuration needed.

# Check GPU usage (Processor column in ollama ps)
ollama ps
# NAME           ID            SIZE    PROCESSOR     UNTIL
# llama3.1:8b    a]f2e33d4e25  6.7 GB  100% GPU      4 minutes from now

13.3 Docker Deployment

# docker-compose.yaml
version: '3.8'

services:
  ollama:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - '11434:11434'
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_KV_CACHE_TYPE=q8_0
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_KEEP_ALIVE=24h
    restart: unless-stopped
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:11434/api/version']
      interval: 30s
      timeout: 5s
      retries: 3

  # Model initialization (optional)
  ollama-init:
    image: curlimages/curl:latest
    depends_on:
      ollama:
        condition: service_healthy
    entrypoint: >
      sh -c "
        curl -s http://ollama:11434/api/pull -d '{\"name\": \"llama3.1:8b\"}' &&
        curl -s http://ollama:11434/api/pull -d '{\"name\": \"nomic-embed-text\"}'
      "

volumes:
  ollama_data:

13.4 Multimodal Model Usage

# Run LLaVA model
ollama run llava "What's in this image? /path/to/photo.jpg"

# Llama 3.2 Vision
ollama run llama3.2-vision "Describe this image in Korean. /path/to/image.png"

import requests
import base64

# Encode image to base64
with open("image.jpg", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode("utf-8")

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llava",
    "messages": [
        {
            "role": "user",
            "content": "What's in this image?",
            "images": [image_base64],
        }
    ],
    "stream": False,
})
print(response.json()["message"]["content"])

13.5 Tool Calling / Function Calling

Ollama supports OpenAI-compatible Tool Calling.

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "What's the current weather in Seoul?"}
    ],
    tools=tools,
    tool_choice="auto",
)

message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

Part 3: Comparison and Practice

14. vLLM vs Ollama Comparison

14.1 Comprehensive Comparison Table

Item	vLLM	Ollama
Primary Use	Production API serving, high-throughput inference	Local development, prototyping, personal use
Engine	Custom engine (PagedAttention)	llama.cpp
Model Format	HF Safetensors, AWQ, GPTQ, FP8	GGUF (quantized)
API	OpenAI compatible	Native + OpenAI compatible
Install Difficulty	Medium (Python/CUDA env required)	Very Easy (single binary)
GPU Required	Nearly essential (NVIDIA/AMD)	Optional (runs on CPU)
Multi-GPU	TP + PP (up to hundreds of GPUs)	Auto-distributed (limited)
Concurrency	Hundreds~thousands of requests	Default 1~4 parallel
Quantization	AWQ, GPTQ, FP8, BnB	GGUF Q2~Q8, F16
Continuous Batching	Supported	Not supported (llama.cpp limitation)
PagedAttention	Core technology	Not supported
Prefix Caching	Supported (automatic)	Not supported
LoRA Serving	Multi-LoRA concurrent serving	Single LoRA
Structured Output	JSON Schema, Regex, Grammar	JSON mode
Speculative Decoding	Supported (Draft model, N-gram)	Not supported
Streaming	Supported	Supported
Docker Deployment	Official image (GPU)	Official image (CPU/GPU)
Kubernetes	Official guide + Production Stack	Community Helm Chart
Memory Efficiency	Very high (less than 4% waste)	High (GGUF quantization)
License	Apache 2.0	MIT

14.2 Throughput Comparison (Llama 3.1 8B, RTX 4090)

Concurrent Users	vLLM (tokens/s)	Ollama (tokens/s)	Ratio
1	~140	~65	2.2x
5	~500	~120	4.2x
10	~800	~150	5.3x
50	~1,200	~150	8.0x
100	~1,500	~150 (queued)	10.0x

In Red Hat's benchmark, vLLM showed 793 TPS vs Ollama 41 TPS on the same hardware -- a 19x difference. This varies depending on concurrent requests, batch size, and model size.

15. Performance Benchmarks

15.1 Throughput Comparison

Metric	vLLM	Ollama	Notes
Single Request TPS	100~140 tok/s	50~70 tok/s	RTX 4090, Llama 3.1 8B
10 Concurrent Total TPS	700~900 tok/s	120~200 tok/s	Continuous Batching effect
50 Concurrent Total TPS	1,000~1,500 tok/s	~150 tok/s	Ollama queues requests
Batch Inference (1K prompts)	2,000~3,000 tok/s	Not supported	vLLM offline inference

15.2 Latency Comparison

Metric	vLLM	Ollama	Notes
TTFT (Time To First Token)	50~200 ms	100~500 ms	Varies by prompt length
TPOT (Time Per Output Token)	7~15 ms	15~25 ms	Single request basis
P99 Latency	80~150 ms	500~700 ms	10 concurrent requests
Model Loading Time	30~120 sec	5~30 sec	GGUF loads faster

15.3 Memory Usage Comparison (Llama 3.1 8B)

Configuration	vLLM GPU Memory	Ollama GPU Memory	Notes
FP16	~16 GB	N/A	vLLM default
FP8	~9 GB	N/A	H100 only
AWQ 4-bit	~5 GB	N/A	vLLM quantized
GPTQ 4-bit	~5 GB	N/A	vLLM quantized
Q4_K_M (GGUF)	N/A	~5.5 GB	Ollama default
Q5_K_M (GGUF)	N/A	~6.2 GB	Higher quality
Q8_0 (GGUF)	N/A	~9 GB	Best quantization quality
KV Cache included (4K ctx)	+0.5~2 GB	+0.5~1.5 GB	Proportional to sequences

16. Recommended Scenarios

16.1 Individual Developer Local Environment

Recommended: Ollama

# Install and use immediately
ollama run llama3.1

# VS Code + Continue extension integration
# Set Ollama endpoint in settings.json

Reason: Simple installation, runs on CPU, supports macOS/Windows/Linux. Easy integration with IDE extensions.

16.2 Production API Serving

Recommended: vLLM

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --api-key ${API_KEY}

Reason: Overwhelming concurrent request handling with Continuous Batching. High memory efficiency with PagedAttention. Mature multi-GPU support, Kubernetes deployment, and monitoring integration.

16.3 Edge/IoT Environments

Recommended: Ollama + High Quantization

# Small model + high quantization
ollama run phi3:3.8b-mini-instruct-4k-q4_0

# Or Qwen 0.5B
ollama run qwen2.5:0.5b

Reason: Simple deployment as single binary. Runs on low-spec hardware with GGUF quantization. CPU-only inference support.

16.4 Large-Scale Batch Inference

Recommended: vLLM Offline Inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.95,
)

# Process thousands of prompts at once
prompts = load_prompts_from_file("prompts.jsonl")  # 10,000+ prompts
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)

outputs = llm.generate(prompts, sampling_params)
save_outputs(outputs, "results.jsonl")

Reason: Batch scheduling that maximizes GPU memory utilization. Efficiently processes thousands to tens of thousands of prompts.

16.5 RAG Pipeline

Both work -- choose based on situation:

# Ollama-based RAG (development/small-scale)
from langchain_ollama import OllamaLLM, OllamaEmbeddings

llm = OllamaLLM(model="llama3.1")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# vLLM-based RAG (production)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(
    base_url="http://vllm-server:8000/v1",
    api_key="token",
    model="meta-llama/Llama-3.1-8B-Instruct",
)

17. Request Tracing Integration

Tracking LLM requests in production environments is essential for debugging, auditing, and performance monitoring.

17.1 vLLM Request ID Tracking

vLLM automatically generates a request_id in its OpenAI API-compatible server. To pass a custom ID, use extra_body.

from openai import OpenAI
import uuid

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# Pass custom request_id
xid = str(uuid.uuid4())

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"X-Request-ID": xid},
)

print(f"XID: {xid}")
print(f"Response ID: {response.id}")

17.2 Ollama Request Tracking

Ollama's native API does not support a separate request ID, so handle it at the reverse proxy level.

import requests
import uuid

xid = str(uuid.uuid4())

response = requests.post(
    "http://localhost:11434/api/chat",
    headers={"X-Request-ID": xid},
    json={
        "model": "llama3.1",
        "messages": [{"role": "user", "content": "Hello"}],
        "stream": False,
    },
)

# Include xid in logging
import logging
logger = logging.getLogger(__name__)
logger.info(f"[xid={xid}] Response: {response.status_code}")

17.3 X-Request-ID Forwarding at API Gateway

NGINX Configuration:

upstream vllm_backend {
    server vllm-server:8000;
}

server {
    listen 80;

    location /v1/ {
        # Auto-generate X-Request-ID if missing
        set $request_id $http_x_request_id;
        if ($request_id = "") {
            set $request_id $request_id;
        }

        proxy_pass http://vllm_backend;
        proxy_set_header X-Request-ID $request_id;
        proxy_set_header Host $host;

        # Add X-Request-ID to response headers
        add_header X-Request-ID $request_id always;

        # Include request_id in access log
        access_log /var/log/nginx/vllm_access.log combined_with_xid;
    }
}

# Log format definition
log_format combined_with_xid '$remote_addr - $remote_user [$time_local] '
    '"$request" $status $body_bytes_sent '
    '"$http_referer" "$http_user_agent" '
    'xid="$http_x_request_id"';

17.4 OpenTelemetry Integration

# vLLM + OpenTelemetry distributed tracing
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Initialize Tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://jaeger:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

# Wrap LLM call as a Span
def call_llm(prompt: str, xid: str) -> str:
    with tracer.start_as_current_span("llm_inference") as span:
        span.set_attribute("xid", xid)
        span.set_attribute("model", "llama-3.1-8b")
        span.set_attribute("prompt_length", len(prompt))

        response = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": prompt}],
            extra_headers={"X-Request-ID": xid},
        )

        result = response.choices[0].message.content
        span.set_attribute("response_length", len(result))
        span.set_attribute("tokens_used", response.usage.total_tokens)

        return result

17.5 xid Usage Patterns in Logging

Python Example:

import logging
import uuid
from contextvars import ContextVar

# Manage xid with Context Variable
request_xid: ContextVar[str] = ContextVar("request_xid", default="")

class XIDFilter(logging.Filter):
    def filter(self, record):
        record.xid = request_xid.get("")
        return True

# Logger setup
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
    "%(asctime)s [%(levelname)s] [xid=%(xid)s] %(message)s"
))
handler.addFilter(XIDFilter())

logger = logging.getLogger("llm_service")
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
async def handle_request(prompt: str):
    xid = str(uuid.uuid4())
    request_xid.set(xid)

    logger.info(f"Received prompt: {prompt[:50]}...")

    response = await call_llm(prompt, xid)

    logger.info(f"Generated {len(response)} chars")
    return {"xid": xid, "response": response}

Go Example:

package main

import (
    "context"
    "fmt"
    "log/slog"
    "net/http"

    "github.com/google/uuid"
)

type contextKey string
const xidKey contextKey = "xid"

// XID Middleware
func xidMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        xid := r.Header.Get("X-Request-ID")
        if xid == "" {
            xid = uuid.New().String()
        }

        ctx := context.WithValue(r.Context(), xidKey, xid)
        w.Header().Set("X-Request-ID", xid)

        slog.Info("request received",
            "xid", xid,
            "method", r.Method,
            "path", r.URL.Path,
        )

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

// Ollama call function
func callOllama(ctx context.Context, prompt string) (string, error) {
    xid := ctx.Value(xidKey).(string)

    slog.Info("calling ollama",
        "xid", xid,
        "prompt_len", len(prompt),
    )

    // ... Ollama API call logic ...

    slog.Info("ollama response received",
        "xid", xid,
        "response_len", len(response),
    )

    return response, nil
}

Part 1: vLLM

1. Introduction to vLLM

1.1 Core Principles of PagedAttention

1.2 Continuous Batching

1.3 Supported Models

1.4 LLM Serving Engine Comparison

2. vLLM Installation and Startup

2.1 pip Installation

2.2 conda Installation

2.3 Docker Installation

2.4 Basic Server Start

2.5 Offline Batch Inference

2.6 OpenAI-Compatible API Server

3. Complete vLLM CLI Arguments Reference

3.1 Model-Related Arguments

3.2 Server-Related Arguments

3.3 Parallelism-Related Arguments

3.4 Memory and Performance Arguments

3.5 Quantization-Related Arguments

3.6 LoRA-Related Arguments

3.7 Speculative Decoding Arguments

4. vLLM Sampling Parameters

4.1 Complete Parameter Reference

4.2 API Call Examples with curl

4.3 Python requests Example

4.4 Streaming Example with OpenAI SDK

5. Complete vLLM Environment Variables Reference

5.1 Core Environment Variables

5.2 Attention and Kernel Related

5.3 Logging Related

5.4 Distributed Processing Related

5.5 HuggingFace and External Services

5.6 Cache and Paths

5.7 Environment Variable Usage Examples

6. Advanced vLLM Configuration

6.1 Multi-GPU Setup

6.2 Quantization Details

6.3 LoRA Serving

6.4 Prefix Caching & Chunked Prefill

6.5 Structured Output (Guided Decoding)

6.6 Docker Deployment

6.7 Kubernetes Deployment

Part 2: Ollama

7. Introduction to Ollama

7.1 Architecture Features

7.2 Supported Models

8. Ollama Installation and Startup

8.1 Platform-Specific Installation

8.2 Basic Usage

9. Complete Ollama CLI Commands Reference

9.1 Command Summary

9.2 Detailed Command Examples

10. Ollama API Endpoints

10.1 Native API Endpoints

10.2 OpenAI-Compatible Endpoints

10.3 API Call Examples

11. Ollama Parameters (Modelfile & API)

11.1 Modelfile Structure

11.2 PARAMETER Options Detail

11.3 Using Parameters in API

12. Complete Ollama Environment Variables Reference

12.1 Server and Network

12.2 Storage and Paths

12.3 GPU and Performance

12.4 Logging and Debug

12.5 Context and Inference

12.6 How to Set Environment Variables

13. Advanced Ollama Usage

13.1 Modelfile Writing Guide

13.2 GPU Acceleration Setup

13.3 Docker Deployment

13.4 Multimodal Model Usage

13.5 Tool Calling / Function Calling

Part 3: Comparison and Practice

14. vLLM vs Ollama Comparison

14.1 Comprehensive Comparison Table

14.2 Throughput Comparison (Llama 3.1 8B, RTX 4090)

15. Performance Benchmarks

15.1 Throughput Comparison

15.2 Latency Comparison