Skip to content

Split View: vLLM & Ollama 완벽 가이드: LLM 서빙 엔진의 구동, 파라미터, 환경변수 총정리

|

vLLM & Ollama 완벽 가이드: LLM 서빙 엔진의 구동, 파라미터, 환경변수 총정리


Part 1: vLLM

1. vLLM 소개

vLLM은 UC Berkeley에서 개발한 고성능 LLM 추론 및 서빙 엔진이다. 2023년 PagedAttention 논문과 함께 공개된 이후, 프로덕션 LLM 서빙의 사실상 표준(de facto standard)으로 자리잡았다. 2026년 3월 기준 최신 버전은 v0.16.x이며, V1 아키텍처로의 전환이 진행 중이다.

1.1 PagedAttention 핵심 원리

전통적인 LLM 추론에서 KV Cache는 시퀀스별로 연속된 GPU 메모리 블록에 할당된다. 이 방식은 최대 시퀀스 길이를 기준으로 메모리를 미리 예약하기 때문에, 실제로는 60~80%의 메모리가 낭비된다.

PagedAttention은 운영체제의 Virtual Memory Paging 개념을 KV Cache 관리에 도입했다.

┌─────────────────────────────────────────────────┐
Traditional KV Cache│  ┌──────────────────────────────────┐            │
│  │ Seq 1: [used][used][used][waste][waste][waste]│  │ Seq 2: [used][waste][waste][waste][waste]│  │ Seq 3: [used][used][waste][waste][waste]│  └──────────────────────────────────┘            │
│              → 60~80% Memory Waste├─────────────────────────────────────────────────┤
PagedAttention KV CachePhysical Blocks: [B0][B1][B2][B3][B4][B5]...Block Table:Seq 1[B0, B3, B5]  (logical → physical)Seq 2[B1, B4]Seq 3[B2, B6]│              → < 4% Memory Waste└─────────────────────────────────────────────────┘

핵심 메커니즘은 다음과 같다.

  • 고정 크기 블록: KV Cache를 고정 크기의 블록(기본 16 토큰)으로 분할
  • Block Table: 시퀀스의 논리적 블록 번호를 물리적 블록 주소로 매핑하는 테이블 유지
  • 동적 할당: 토큰 생성 시 필요한 만큼만 물리적 블록을 할당
  • Copy-on-Write: Beam Search 등에서 시퀀스를 분기할 때, 동일한 물리적 블록을 공유하다가 수정이 필요할 때만 복사

1.2 Continuous Batching

기존의 Static Batching은 배치 내 모든 시퀀스가 완료될 때까지 기다렸다. Continuous Batching은 매 디코딩 스텝마다 완료된 시퀀스를 제거하고 새 요청을 삽입한다.

Static Batching:
Step 1: [Seq1, Seq2, Seq3, Seq4]Seq2 완료되어도 대기
Step 2: [Seq1, Seq2, Seq3, Seq4]
Step 3: [Seq1, ___, Seq3, Seq4]Seq2 종료 후에도 슬롯 낭비
...
Step N: 모든 시퀀스 완료 후 다음 배치 시작

Continuous Batching:
Step 1: [Seq1, Seq2, Seq3, Seq4]
Step 2: [Seq1, Seq5, Seq3, Seq4]Seq2 완료 즉시 Seq5 투입
Step 3: [Seq1, Seq5, Seq6, Seq4]Seq3 완료 즉시 Seq6 투입
GPU 유휴 시간 최소화, Throughput 극대화

1.3 지원 모델

vLLM은 Transformer 기반의 거의 모든 주요 LLM 아키텍처를 지원한다.

카테고리지원 모델
Meta Llama 계열Llama 2, Llama 3, Llama 3.1, Llama 3.2, Llama 3.3, Llama 4
Mistral 계열Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Large, Mistral Small
Qwen 계열Qwen, Qwen 1.5, Qwen 2, Qwen 2.5, Qwen 3, QwQ
Google 계열Gemma, Gemma 2, Gemma 3
DeepSeek 계열DeepSeek V2, DeepSeek V3, DeepSeek-R1
기타Phi-3/4, Yi, InternLM 2/3, Command R, DBRX, Falcon, StarCoder 2
멀티모달LLaVA, InternVL, Pixtral, Qwen-VL, MiniCPM-V
EmbeddingE5-Mistral, GTE-Qwen, Jina Embeddings

1.4 LLM 서빙 엔진 비교

항목vLLMTGITensorRT-LLMllama.cpp
개발UC Berkeley / vLLM ProjectHugging FaceNVIDIAGeorgi Gerganov
언어Python/C++/CUDARust/PythonC++/CUDAC/C++
핵심 기술PagedAttentionContinuous BatchingFP8/INT4 커널 최적화GGUF 양자화
멀티 GPUTP + PPTPTP + PP제한적
양자화AWQ, GPTQ, FP8, BnBAWQ, GPTQ, BnBFP8, INT4, INT8GGUF (Q2~Q8)
API 호환OpenAI 호환OpenAI 호환Triton자체 API
설치 난이도중간중간높음낮음
프로덕션 적합매우 높음높음매우 높음낮음~중간
커뮤니티매우 활발활발NVIDIA 주도매우 활발

2. vLLM 설치 및 구동

2.1 pip 설치

# 기본 설치 (CUDA 12.x)
pip install vllm

# 특정 버전 설치
pip install vllm==0.16.0

# CUDA 11.8 환경
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

2.2 conda 설치

conda create -n vllm python=3.11 -y
conda activate vllm
pip install vllm

2.3 Docker 설치

# 공식 Docker 이미지 (NVIDIA GPU)
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<hf_token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

# ROCm (AMD GPU)
docker run --device /dev/kfd --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest-rocm \
  --model meta-llama/Llama-3.1-8B-Instruct

2.4 기본 서버 시작

# vllm serve 명령 (권장)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# Python 모듈 직접 실행 (레거시 방식)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# YAML 설정 파일로 구동
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --config config.yaml

config.yaml 예시:

# vLLM 서버 설정 파일
host: '0.0.0.0'
port: 8000
tensor_parallel_size: 2
gpu_memory_utilization: 0.90
max_model_len: 8192
dtype: 'auto'
enforce_eager: false
enable_prefix_caching: true

2.5 Offline Batch Inference

서버를 띄우지 않고 Python 코드에서 직접 배치 추론을 수행할 수 있다.

from vllm import LLM, SamplingParams

# 모델 로드
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.90,
)

# 샘플링 파라미터 설정
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# 프롬프트 목록
prompts = [
    "Explain PagedAttention in simple terms.",
    "What is continuous batching?",
    "Compare vLLM and TensorRT-LLM.",
]

# 배치 추론 실행
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Output: {generated!r}\n")

2.6 OpenAI-Compatible API Server

vLLM 서버는 OpenAI API와 호환되는 엔드포인트를 제공한다.

# 서버 시작
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --served-model-name llama-3.1-8b \
  --api-key my-secret-key

# curl로 Chat Completion 호출
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret-key" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "PagedAttention이란 무엇인가요?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'
# OpenAI SDK로 호출
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="my-secret-key",
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "vLLM의 장점을 설명해주세요."},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

3. vLLM CLI 인자 (Arguments) 완벽 정리

vllm serve 명령에 전달할 수 있는 주요 CLI 인자를 카테고리별로 정리한다. vllm serve --help로 전체 목록을 확인할 수 있으며, vllm serve --help=ModelConfig처럼 그룹 단위로 조회할 수도 있다.

3.1 Model 관련 인자

인자타입기본값설명
--modelstrfacebook/opt-125mHuggingFace 모델 ID 또는 로컬 경로
--tokenizerstrNone (모델과 동일)별도의 토크나이저 지정
--revisionstrNone모델의 특정 Git revision (branch, tag, commit hash)
--tokenizer-revisionstrNone토크나이저의 특정 revision
--dtypestr"auto"모델 가중치 데이터 타입 (auto, float16, bfloat16, float32)
--max-model-lenintNone (모델 설정 따름)최대 시퀀스 길이 (입력 + 출력 토큰 합계)
--trust-remote-codeflagFalseHuggingFace 원격 코드 실행 허용
--download-dirstrNone모델 다운로드 디렉토리
--load-formatstr"auto"모델 로드 형식 (auto, pt, safetensors, npcache, dummy, bitsandbytes)
--config-formatstr"auto"모델 설정 형식 (auto, hf, mistral)
--seedint0재현성을 위한 랜덤 시드

3.2 서버 관련 인자

인자타입기본값설명
--hoststr"0.0.0.0"바인드할 호스트 주소
--portint8000서버 포트 번호
--uvicorn-log-levelstr"info"Uvicorn 로그 레벨
--api-keystrNoneAPI 인증 키 (Bearer token)
--served-model-namestrNoneAPI에서 사용할 모델 이름 (미지정 시 --model 값 사용)
--chat-templatestrNoneJinja2 chat template 파일 경로 또는 문자열
--response-rolestr"assistant"Chat completion 응답의 role
--ssl-keyfilestrNoneSSL key 파일 경로
--ssl-certfilestrNoneSSL 인증서 파일 경로
--allowed-originslist["*"]CORS 허용 origin 목록
--middlewarelistNoneFastAPI 미들웨어 클래스
--max-log-lenintNone로그에 출력할 최대 프롬프트/출력 길이
--disable-log-requestsflagFalse요청 로깅 비활성화

3.3 병렬화 관련 인자

인자타입기본값설명
--tensor-parallel-size (-tp)int1Tensor Parallelism GPU 수
--pipeline-parallel-size (-pp)int1Pipeline Parallelism 단계 수
--distributed-executor-backendstrNone분산 실행 백엔드 (ray, mp)
--ray-workers-use-nsightflagFalseRay 워커에서 Nsight 프로파일러 사용
--data-parallel-size (-dp)int1Data Parallelism 프로세스 수

사용 예시:

# 4-GPU Tensor Parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# 2-GPU Tensor + 2-way Pipeline (총 4 GPU)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

# Ray 분산 백엔드 사용 (멀티 노드)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --distributed-executor-backend ray

3.4 메모리 및 성능 관련 인자

인자타입기본값설명
--gpu-memory-utilizationfloat0.90GPU 메모리 사용률 (0.0~1.0)
--max-num-seqsint256동시 처리 최대 시퀀스 수
--max-num-batched-tokensintNone (자동)한 스텝에서 처리할 최대 토큰 수
--block-sizeint16PagedAttention 블록 크기 (토큰 단위)
--swap-spacefloat4CPU swap 공간 크기 (GiB)
--enforce-eagerflagFalseCUDA Graph 비활성화, Eager 모드 강제
--max-seq-len-to-captureint8192CUDA Graph 캡처 최대 시퀀스 길이
--disable-custom-all-reduceflagFalse커스텀 All-Reduce 비활성화
--enable-prefix-cachingflagTrue (v1)Automatic Prefix Caching 활성화
--enable-chunked-prefillflagTrue (v1)Chunked Prefill 활성화
--num-scheduler-stepsint1스케줄러당 디코딩 스텝 수 (Multi-Step Scheduling)
--kv-cache-dtypestr"auto"KV Cache 데이터 타입 (auto, fp8, fp8_e5m2, fp8_e4m3)

사용 예시:

# 메모리 최적화 설정
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 128 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill

# Eager 모드 (디버깅/호환성)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enforce-eager \
  --gpu-memory-utilization 0.85

3.5 양자화 관련 인자

인자타입기본값설명
--quantization (-q)strNone양자화 방식 선택
--load-formatstr"auto"모델 로드 형식

--quantization 지원 값:

설명비고
awqAWQ (Activation-aware Weight Quantization)4-bit, 빠른 추론
gptqGPTQ (Post-Training Quantization)4-bit, ExLlamaV2 커널
gptq_marlinGPTQ + Marlin 커널4-bit, 더 빠른 커널
awq_marlinAWQ + Marlin 커널4-bit, 더 빠른 커널
squeezellmSqueezeLLM희소 양자화
fp8FP8 (8-bit floating point)H100/MI300x 전용
bitsandbytesBitsAndBytes4-bit NF4
ggufGGUF 형식llama.cpp 호환
compressed-tensorsCompressed Tensors범용
experts_int8MoE Expert INT8MoE 모델 전용

사용 예시:

# AWQ 양자화 모델
vllm serve TheBloke/Llama-2-7B-AWQ \
  --quantization awq

# GPTQ 양자화 모델
vllm serve TheBloke/Llama-2-7B-GPTQ \
  --quantization gptq

# FP8 양자화 (H100 이상)
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --quantization fp8

# BitsAndBytes 4-bit (GPU 메모리 절약)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes

3.6 LoRA 관련 인자

인자타입기본값설명
--enable-loraflagFalseLoRA 어댑터 서빙 활성화
--max-lorasint1동시 로드 가능한 최대 LoRA 수
--max-lora-rankint16최대 LoRA rank
--lora-extra-vocab-sizeint256LoRA 어댑터의 추가 vocabulary 크기
--lora-moduleslistNoneLoRA 어댑터 목록 (name=path 형식)
--long-lora-scaling-factorslistNoneLong LoRA 스케일링 팩터

사용 예시:

# LoRA 어댑터 서빙
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --lora-modules \
    adapter1=/path/to/lora1 \
    adapter2=/path/to/lora2

3.7 Speculative Decoding 관련 인자

인자타입기본값설명
--speculative-modelstrNoneDraft 모델 (작은 모델 또는 [ngram])
--num-speculative-tokensintNone투기적으로 생성할 토큰 수
--speculative-draft-tensor-parallel-sizeintNoneDraft 모델의 TP 크기
--speculative-disable-by-batch-sizeintNone배치 크기 초과 시 비활성화
--ngram-prompt-lookup-maxintNoneN-gram 기반 추측 시 최대 룩업 크기
--ngram-prompt-lookup-minintNoneN-gram 기반 추측 시 최소 룩업 크기

사용 예시:

# 별도 draft 모델 사용
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4

# N-gram 기반 speculative decoding (추가 모델 불필요)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model "[ngram]" \
  --num-speculative-tokens 5 \
  --ngram-prompt-lookup-max 4

4. vLLM 샘플링 파라미터 (Sampling Parameters)

vLLM은 OpenAI API 호환 파라미터와 추가적인 고급 파라미터를 지원한다.

4.1 파라미터 총정리

파라미터타입기본값범위설명
temperaturefloat1.0>= 0.0낮을수록 결정적, 높을수록 창의적. 0이면 Greedy
top_pfloat1.0(0.0, 1.0]Nucleus sampling. 누적 확률 기준 상위 토큰만 샘플링
top_kint-1-1 또는 >= 1상위 k개 토큰만 고려. -1이면 비활성화
min_pfloat0.0[0.0, 1.0]최소 확률 임계값. 최고 확률 대비 비율 기준 필터링
frequency_penaltyfloat0.0[-2.0, 2.0]빈도 기반 페널티. 양수면 반복 억제, 음수면 반복 촉진
presence_penaltyfloat0.0[-2.0, 2.0]존재 기반 페널티. 한 번이라도 등장한 토큰에 페널티
repetition_penaltyfloat1.0> 0.0반복 페널티 (1.0이면 비활성화, >1.0이면 반복 억제)
max_tokensint16>= 1생성할 최대 토큰 수
stoplistNone-생성 중단 문자열 목록
seedintNone-랜덤 시드 (재현성 보장)
nint1>= 1각 프롬프트당 생성할 응답 수
best_ofintNone>= nn개 중 best_of개를 생성하여 최적 응답 선택
use_beam_searchboolFalse-Beam Search 활성화
logprobsintNone[0, 20]반환할 토큰별 로그 확률 수
prompt_logprobsintNone[0, 20]프롬프트 토큰의 로그 확률 반환 수
skip_special_tokensboolTrue-특수 토큰 생략 여부
spaces_between_special_tokensboolTrue-특수 토큰 사이 공백 삽입
guided_jsonobjectNone-JSON Schema 기반 구조화된 출력
guided_regexstrNone-정규식 기반 구조화된 출력
guided_choicelistNone-선택지 기반 구조화된 출력

4.2 curl을 이용한 API 호출 예제

# 기본 Chat Completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "서울의 인구는?"}
    ],
    "temperature": 0.3,
    "top_p": 0.9,
    "max_tokens": 256,
    "frequency_penalty": 0.5,
    "seed": 42
  }'

# Structured Output (JSON mode)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "서울, 부산, 대구의 인구를 JSON으로 알려줘"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "city_population",
        "schema": {
          "type": "object",
          "properties": {
            "cities": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "population": {"type": "integer"}
                },
                "required": ["name", "population"]
              }
            }
          },
          "required": ["cities"]
        }
      }
    },
    "temperature": 0.1,
    "max_tokens": 512
  }'

# logprobs 반환
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "1+1=?"}
    ],
    "logprobs": true,
    "top_logprobs": 5,
    "max_tokens": 10
  }'

4.3 Python requests 예제

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

payload = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful Korean assistant."},
        {"role": "user", "content": "양자 컴퓨팅이란?"},
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "max_tokens": 1024,
    "repetition_penalty": 1.1,
    "stop": ["\n\n\n"],
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(result["choices"][0]["message"]["content"])

4.4 OpenAI SDK를 이용한 Streaming 예제

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# Streaming 응답
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Python으로 퀵소트를 구현해줘"},
    ],
    temperature=0.2,
    max_tokens=2048,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

5. vLLM 환경변수 총정리

vLLM은 다양한 환경변수를 통해 런타임 동작을 제어한다. 주요 환경변수를 카테고리별로 정리한다.

5.1 핵심 환경변수

환경변수기본값설명
VLLM_TARGET_DEVICE"cuda"타겟 디바이스 (cuda, rocm, neuron, cpu, xpu)
VLLM_USE_V1TrueV1 코드 경로 사용 여부
VLLM_WORKER_MULTIPROC_METHOD"fork"멀티프로세스 생성 방법 (spawn, fork)
VLLM_ALLOW_LONG_MAX_MODEL_LENFalse모델 설정보다 긴 max_model_len 허용
CUDA_VISIBLE_DEVICESNone사용할 GPU 디바이스 번호

5.2 Attention 및 커널 관련

환경변수기본값설명
VLLM_ATTENTION_BACKENDNoneAttention 백엔드 (deprecated, v0.14부터 --attention-backend 사용)
VLLM_USE_TRITON_FLASH_ATTNTrueTriton Flash Attention 사용
VLLM_FLASH_ATTN_VERSIONNoneFlash Attention 버전 강제 (2 또는 3)
VLLM_USE_FLASHINFER_SAMPLERNoneFlashInfer sampler 사용
VLLM_FLASHINFER_FORCE_TENSOR_CORESFalseFlashInfer 텐서 코어 강제 사용
VLLM_USE_TRITON_AWQFalseTriton AWQ 커널 사용
VLLM_USE_DEEP_GEMMFalseDeepGemm 커널 사용 (MoE 연산)
VLLM_MLA_DISABLEFalseMLA Attention 최적화 비활성화

5.3 로깅 관련

환경변수기본값설명
VLLM_CONFIGURE_LOGGING1vLLM 로깅 자동 설정 (0이면 비활성화)
VLLM_LOGGING_LEVEL"INFO"기본 로깅 레벨
VLLM_LOGGING_CONFIG_PATHNone커스텀 로깅 설정 파일 경로
VLLM_LOGGING_PREFIX""로그 메시지 앞에 추가할 접두사
VLLM_LOG_BATCHSIZE_INTERVAL-1배치 크기 로깅 간격 (초, -1이면 비활성화)
VLLM_TRACE_FUNCTION0함수 호출 추적 활성화
VLLM_DEBUG_LOG_API_SERVER_RESPONSEFalseAPI 응답 디버그 로깅

5.4 분산 처리 관련

환경변수기본값설명
VLLM_HOST_IP""분산 설정에서 노드 IP
VLLM_PORT0분산 통신 포트
VLLM_NCCL_SO_PATHNoneNCCL 라이브러리 파일 경로
NCCL_DEBUGNoneNCCL 디버그 레벨 (INFO, WARN, TRACE)
NCCL_SOCKET_IFNAMENoneNCCL 통신용 네트워크 인터페이스
VLLM_PP_LAYER_PARTITIONNonePipeline Parallelism 레이어 파티션 전략
VLLM_DP_RANK0Data Parallel 프로세스 랭크
VLLM_DP_SIZE1Data Parallel 월드 사이즈
VLLM_DP_MASTER_IP"127.0.0.1"Data Parallel 마스터 노드 IP
VLLM_DP_MASTER_PORT0Data Parallel 마스터 노드 포트
VLLM_USE_RAY_SPMD_WORKERFalseRay SPMD 워커 실행
VLLM_USE_RAY_COMPILED_DAGFalseRay Compiled Graph API 사용
VLLM_SKIP_P2P_CHECKFalseGPU 간 P2P 능력 검사 건너뛰기

5.5 HuggingFace 및 외부 서비스

환경변수기본값설명
HF_TOKENNoneHuggingFace API 토큰
HUGGING_FACE_HUB_TOKENNoneHuggingFace Hub 토큰 (레거시)
VLLM_USE_MODELSCOPEFalseModelScope에서 모델 로드
VLLM_API_KEYNonevLLM API 서버 인증 키
VLLM_NO_USAGE_STATSFalse사용 통계 수집 비활성화
VLLM_DO_NOT_TRACKFalse추적 옵트아웃

5.6 캐시 및 경로

환경변수기본값설명
VLLM_CONFIG_ROOT~/.config/vllm설정 파일 루트 디렉토리
VLLM_CACHE_ROOT~/.cache/vllm캐시 파일 루트 디렉토리
VLLM_ASSETS_CACHE~/.cache/vllm/assets다운로드 에셋 캐시 경로
VLLM_RPC_BASE_PATH시스템 tempIPC 멀티프로세싱 경로

5.7 환경변수 사용 예시

# 멀티 GPU + 로깅 + HF 토큰 설정
export CUDA_VISIBLE_DEVICES=0,1,2,3
export HF_TOKEN="hf_xxxxxxxxxxxx"
export VLLM_LOGGING_LEVEL="DEBUG"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

# Docker에서 환경변수 전달
docker run --runtime nvidia --gpus all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e HF_TOKEN="hf_xxxxxxxxxxxx" \
  -e VLLM_LOGGING_LEVEL="INFO" \
  -e VLLM_WORKER_MULTIPROC_METHOD="spawn" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2

6. vLLM 고급 설정

6.1 멀티 GPU 설정

Tensor Parallelism (TP): 모델의 각 레이어를 여러 GPU에 분할한다. 단일 노드에서 가장 일반적으로 사용되는 방식이다.

# TP=4 (4 GPU에 모델 분산)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

Pipeline Parallelism (PP): 모델의 레이어를 순차적으로 여러 GPU에 배치한다. 느린 인터커넥트 환경에서 유리하다.

# PP=2, TP=2 (총 4 GPU, 2×2 구성)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

멀티 노드 설정 (Ray 사용):

# 마스터 노드
ray start --head --port=6379

# 워커 노드
ray start --address=<master-ip>:6379

# vLLM 실행 (마스터에서)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --distributed-executor-backend ray

6.2 양자화 상세

AWQ (Activation-aware Weight Quantization):

# AWQ 사전 양자화 모델 사용
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq \
  --max-model-len 4096

# Marlin 커널로 더 빠르게 (SM 80+ GPU)
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq_marlin

GPTQ (Post-Training Quantization):

# GPTQ 모델 (ExLlamaV2 커널 자동 사용)
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
  --quantization gptq

# Marlin 커널 사용
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
  --quantization gptq_marlin

FP8 (8-bit Floating Point): H100, MI300x 이상 GPU에서 하드웨어 가속 지원.

# 사전 양자화 FP8 모델
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --quantization fp8

# 동적 FP8 양자화 (사전 양자화 불필요)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8

BitsAndBytes 4-bit NF4: 캘리브레이션 데이터 없이 즉시 양자화.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --enforce-eager  # BnB는 Eager 모드 필요

6.3 LoRA 서빙

# LoRA 어댑터 활성화
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --lora-modules \
    korean-chat=/path/to/korean-lora \
    code-assist=/path/to/code-lora

API 호출 시 LoRA 모델 지정:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# 특정 LoRA 어댑터 사용
response = client.chat.completions.create(
    model="korean-chat",  # LoRA 어댑터 이름
    messages=[{"role": "user", "content": "안녕하세요!"}],
    temperature=0.7,
    max_tokens=256,
)

6.4 Prefix Caching & Chunked Prefill

Automatic Prefix Caching: 공통 프롬프트 접두사의 KV Cache를 재사용하여 TTFT를 단축한다. 동일한 시스템 프롬프트를 사용하는 다수의 요청이 있을 때 특히 효과적이다.

# v1에서는 기본 활성화, v0에서는 명시 필요
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching

Chunked Prefill: 긴 프롬프트를 청크로 분할하여 Prefill과 Decode를 인터리빙한다. 긴 프롬프트가 짧은 요청의 Decode를 차단하는 문제를 방지한다.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048

6.5 Structured Output (Guided Decoding)

# JSON Schema 기반 구조화 출력
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "서울의 날씨 정보를 JSON으로 제공해주세요"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "weather_info",
        "schema": {
          "type": "object",
          "properties": {
            "city": {"type": "string"},
            "temperature_celsius": {"type": "number"},
            "condition": {"type": "string"},
            "humidity_percent": {"type": "integer"}
          },
          "required": ["city", "temperature_celsius", "condition"]
        }
      }
    }
  }'

# Regex 기반 출력 (Completion API)
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Generate a valid email address:",
    "extra_body": {
      "guided_regex": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
    },
    "max_tokens": 50
  }'

6.6 Docker 배포

# docker-compose.yaml
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - '8000:8000'
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_LOGGING_LEVEL=INFO
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    ipc: host
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.90
      --max-model-len 8192
      --enable-prefix-caching
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
# Docker Compose로 실행
HF_TOKEN=hf_xxxx docker compose up -d

# 로그 확인
docker compose logs -f vllm

6.7 Kubernetes 배포

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
  namespace: ai-serving
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          ports:
            - containerPort: 8000
              name: http
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-8B-Instruct'
            - '--host'
            - '0.0.0.0'
            - '--port'
            - '8000'
            - '--tensor-parallel-size'
            - '2'
            - '--gpu-memory-utilization'
            - '0.90'
            - '--max-model-len'
            - '8192'
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
            - name: VLLM_WORKER_MULTIPROC_METHOD
              value: 'spawn'
          resources:
            limits:
              nvidia.com/gpu: '2'
            requests:
              nvidia.com/gpu: '2'
              memory: '32Gi'
              cpu: '8'
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
            - name: model-cache
              mountPath: /root/.cache/huggingface
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
      nodeSelector:
        nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ai-serving
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000
      name: http
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama3
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: '50'

Part 2: Ollama

7. Ollama 소개

Ollama는 로컬 환경에서 LLM을 쉽게 실행할 수 있게 해주는 오픈소스 도구다. Docker처럼 ollama run llama3.1이라는 한 줄 명령으로 모델을 다운로드하고 즉시 대화할 수 있다.

7.1 아키텍처 특징

  • GGUF 기반: llama.cpp의 GGUF(GPT-Generated Unified Format) 양자화 모델을 사용
  • llama.cpp 엔진: 내부적으로 llama.cpp를 추론 엔진으로 사용
  • 단일 바이너리: Go로 작성된 서버 + llama.cpp C++ 엔진을 하나의 바이너리로 배포
  • 자동 GPU 가속: NVIDIA CUDA, AMD ROCm, Apple Metal을 자동 감지하여 GPU 오프로딩
  • 모델 레지스트리: ollama.com/library에서 사전 양자화된 모델을 Docker Hub처럼 pull/push

7.2 지원 모델

카테고리모델크기
Meta Llamallama3.1, llama3.2, llama3.31B ~ 405B
Mistralmistral, mixtral7B ~ 8x22B
Googlegemma, gemma2, gemma32B ~ 27B
Microsoftphi3, phi43.8B ~ 14B
DeepSeekdeepseek-r1, deepseek-v3, deepseek-coder-v21.5B ~ 671B
Qwenqwen, qwen2, qwen2.5, qwen30.5B ~ 72B
코드 특화codellama, starcoder2, qwen2.5-coder3B ~ 34B
Embeddingnomic-embed-text, mxbai-embed-large, all-minilm-
멀티모달llava, bakllava, llama3.2-vision7B ~ 90B

8. Ollama 설치 및 구동

8.1 플랫폼별 설치

macOS:

# Homebrew
brew install ollama

# 또는 공식 설치 스크립트
curl -fsSL https://ollama.com/install.sh | sh

Linux:

# 공식 설치 스크립트 (권장)
curl -fsSL https://ollama.com/install.sh | sh

# 또는 수동 설치
curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama

Windows:

공식 웹사이트(ollama.com)에서 Windows 설치 프로그램을 다운로드하여 실행한다.

Docker:

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# NVIDIA GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# AMD GPU (ROCm)
docker run -d --device /dev/kfd --device /dev/dri \
  -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama:rocm

8.2 기본 사용법

# 서버 시작 (백그라운드에서 자동 시작되지 않는 경우)
ollama serve

# 모델 다운로드 및 대화
ollama run llama3.1

# 특정 태그(크기/양자화) 지정
ollama run llama3.1:8b
ollama run llama3.1:70b-instruct-q4_K_M
ollama run qwen2.5:32b-instruct-q5_K_M

# 모델만 다운로드 (실행하지 않음)
ollama pull llama3.1:8b

# 한줄 프롬프트
ollama run llama3.1 "PagedAttention이란 무엇인가요?"

9. Ollama CLI 명령어 완벽 정리

9.1 명령어 총정리

명령어설명주요 옵션
ollama serveOllama 서버 시작--help로 환경변수 확인
ollama run <model>모델 실행 (없으면 자동 pull)--verbose, --nowordwrap, --format json
ollama pull <model>모델 다운로드--insecure
ollama push <model>모델을 레지스트리에 업로드--insecure
ollama create <model>Modelfile로 커스텀 모델 생성-f <Modelfile>, --quantize
ollama list / ollama ls설치된 모델 목록-
ollama show <model>모델 상세 정보--modelfile, --parameters, --system, --template, --license
ollama cp <src> <dst>모델 복제-
ollama rm <model>모델 삭제-
ollama ps실행 중인 모델 목록-
ollama stop <model>실행 중인 모델 중지-
ollama signinollama.com 로그인-
ollama signoutollama.com 로그아웃-

9.2 각 명령어 상세 예시

ollama serve - 서버 시작:

# 기본 시작 (localhost:11434)
ollama serve

# 환경변수로 바인드 주소 변경
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# 디버그 모드
OLLAMA_DEBUG=1 ollama serve

ollama run - 모델 실행:

# 대화형 모드
ollama run llama3.1

# 한줄 프롬프트
ollama run llama3.1 "Explain quantum computing"

# JSON 형식 출력
ollama run llama3.1 "List 3 Korean cities" --format json

# 멀티모달 (이미지 입력)
ollama run llama3.2-vision "What's in this image? /path/to/image.png"

# verbose 모드 (성능 통계 표시)
ollama run llama3.1 --verbose

# 시스템 프롬프트와 함께
ollama run llama3.1 --system "You are a Korean translator."

ollama create - 커스텀 모델 생성:

# Modelfile 기반 생성
ollama create my-model -f ./Modelfile

# GGUF 파일로부터 생성
ollama create my-model -f ./Modelfile-from-gguf

# 양자화 변환
ollama create my-model-q4 --quantize q4_K_M -f ./Modelfile

ollama show - 모델 정보 확인:

# 전체 정보
ollama show llama3.1

# Modelfile 출력
ollama show llama3.1 --modelfile

# 파라미터 확인
ollama show llama3.1 --parameters

# 시스템 프롬프트 확인
ollama show llama3.1 --system

# 템플릿 확인
ollama show llama3.1 --template

ollama ps - 실행 중인 모델:

$ ollama ps
NAME              ID            SIZE     PROCESSOR    UNTIL
llama3.1:8b       a]f2e33d4e25  6.7 GB   100% GPU     4 minutes from now
qwen2.5:7b        845dbda0ea48  4.7 GB   100% GPU     3 minutes from now

10. Ollama API 엔드포인트

Ollama는 REST API와 OpenAI 호환 API를 모두 제공한다. 기본 주소는 http://localhost:11434이다.

10.1 네이티브 API 엔드포인트

엔드포인트메서드설명
/api/generatePOST텍스트 Completion 생성
/api/chatPOSTChat Completion 생성
/api/embedPOST임베딩 벡터 생성
/api/tagsGET로컬 모델 목록
/api/showPOST모델 상세 정보
/api/pullPOST모델 다운로드
/api/pushPOST모델 업로드
/api/createPOST커스텀 모델 생성
/api/copyPOST모델 복제
/api/deleteDELETE모델 삭제
/api/psGET실행 중인 모델 목록
/api/versionGETOllama 버전 정보

10.2 OpenAI 호환 엔드포인트

엔드포인트메서드설명
/v1/chat/completionsPOSTOpenAI Chat Completion 호환
/v1/completionsPOSTOpenAI Completion 호환
/v1/modelsGET모델 목록 (OpenAI 형식)
/v1/embeddingsPOST임베딩 (OpenAI 형식)

10.3 API 호출 예제

Generate (Completion):

# 기본 생성
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# 스트리밍 (기본값)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a haiku about coding",
  "options": {
    "temperature": 0.7,
    "num_predict": 100
  }
}'

# JSON 형식 출력
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "List 3 programming languages as JSON",
  "format": "json",
  "stream": false
}'

Chat (대화):

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "system", "content": "You are a helpful Korean assistant."},
    {"role": "user", "content": "서울의 명소를 추천해주세요."}
  ],
  "stream": false,
  "options": {
    "temperature": 0.8,
    "top_p": 0.9,
    "num_ctx": 4096,
    "num_predict": 512
  }
}'

Embed (임베딩):

# 단일 텍스트 임베딩
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Hello, world!"
}'

# 복수 텍스트 임베딩
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["Hello world", "Goodbye world"]
}'

OpenAI 호환 API:

# OpenAI 형식 Chat Completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

# 모델 목록
curl http://localhost:11434/v1/models

Python에서 호출:

import requests

# Generate API
response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.1",
    "prompt": "Explain Docker in Korean",
    "stream": False,
    "options": {
        "temperature": 0.7,
        "num_predict": 512,
    },
})
print(response.json()["response"])

# Chat API
response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.1",
    "messages": [
        {"role": "user", "content": "쿠버네티스란?"},
    ],
    "stream": False,
})
print(response.json()["message"]["content"])
# OpenAI SDK로 Ollama 사용
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama는 API key 불필요, 임의 값 입력
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Python의 GIL을 설명해주세요."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

11. Ollama 파라미터 (Modelfile & API)

11.1 Modelfile 구조

Modelfile은 Ollama 커스텀 모델을 정의하는 파일이다. Dockerfile과 유사한 구조를 가진다.

# 기본 모델 지정 (필수)
FROM llama3.1:8b

# 파라미터 설정
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

# 시스템 프롬프트
SYSTEM """
당신은 친절한 한국어 AI 어시스턴트입니다.
정확하고 간결한 답변을 제공하며, 필요할 때 예시를 들어 설명합니다.
"""

# 대화 템플릿 (Jinja2 또는 Go template)
TEMPLATE """
{{- if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}
{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
{{ .Content }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""

# LoRA 어댑터 적용 (선택)
ADAPTER /path/to/lora-adapter.gguf

# 라이선스 정보 (선택)
LICENSE """
Apache 2.0
"""
지시어설명필수 여부
FROM기본 모델 (모델 이름 또는 GGUF 파일 경로)필수
PARAMETER모델 파라미터 설정선택
TEMPLATE프롬프트 템플릿선택
SYSTEM시스템 프롬프트선택
ADAPTERLoRA/QLora 어댑터 경로선택
LICENSE라이선스 정보선택
MESSAGE대화 히스토리 사전 설정선택

11.2 PARAMETER 옵션 상세

파라미터타입기본값범위/설명
temperaturefloat0.80.0~2.0. 높을수록 창의적, 낮을수록 결정적
top_pfloat0.90.0~1.0. Nucleus sampling 확률 임계값
top_kint401~100. 상위 k개 토큰만 고려
min_pfloat0.00.0~1.0. 최소 확률 필터링
num_predictint-1생성할 최대 토큰 수 (-1: 무제한, -2: 컨텍스트 채울 때까지)
num_ctxint2048컨텍스트 윈도우 크기 (토큰 수)
repeat_penaltyfloat1.1반복 페널티 (1.0이면 비활성화)
repeat_last_nint64반복 체크 범위 (0: 비활성화, -1: num_ctx)
seedint0랜덤 시드 (0이면 매번 다른 결과)
stopstring-생성 중단 문자열 (여러 개 지정 가능)
num_gpuintautoGPU에 오프로드할 레이어 수 (0: CPU only)
num_threadintautoCPU 스레드 수
num_batchint512프롬프트 처리 배치 크기
mirostatint0Mirostat 샘플링 (0: 비활성화, 1: Mirostat, 2: Mirostat 2.0)
mirostat_etafloat0.1Mirostat 학습률
mirostat_taufloat5.0Mirostat 타겟 엔트로피
tfs_zfloat1.0Tail-Free Sampling (1.0이면 비활성화)
typical_pfloat1.0Locally Typical Sampling (1.0이면 비활성화)
use_mlockboolfalse모델을 메모리에 고정 (swap 방지)
num_keepint0컨텍스트 재활용 시 유지할 토큰 수
penalize_newlinebooltrue줄바꿈 토큰에 페널티 적용

11.3 API에서 파라미터 사용

API 호출 시 options 필드로 파라미터를 전달한다.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "options": {
    "temperature": 0.3,
    "top_p": 0.9,
    "top_k": 50,
    "num_ctx": 8192,
    "num_predict": 1024,
    "repeat_penalty": 1.2,
    "seed": 42,
    "stop": ["<|eot_id|>"]
  }
}'

12. Ollama 환경변수 총정리

12.1 서버 및 네트워크

환경변수기본값설명
OLLAMA_HOST127.0.0.1:11434서버 바인드 주소와 포트
OLLAMA_ORIGINS없음CORS 허용 origin (쉼표 구분)
OLLAMA_KEEP_ALIVE5m모델 언로드까지 유휴 시간 (5m, 1h, -1=영구 로드)
OLLAMA_MAX_QUEUE512최대 대기열 크기 (초과 시 요청 거부)
OLLAMA_NUM_PARALLEL1모델당 동시 요청 처리 수
OLLAMA_MAX_LOADED_MODELS1 (CPU), GPU수*3동시 로드 가능한 최대 모델 수

12.2 스토리지 및 경로

환경변수기본값설명
OLLAMA_MODELSOS별 기본 경로모델 저장 디렉토리
OLLAMA_TMPDIR시스템 temp임시 파일 디렉토리
OLLAMA_NOPRUNE없음부팅 시 미사용 blob 정리 비활성화

플랫폼별 기본 모델 저장 경로:

OS기본 경로
macOS~/.ollama/models
Linux/usr/share/ollama/.ollama/models
WindowsC:\Users\<user>\.ollama\models

12.3 GPU 및 성능

환경변수기본값설명
OLLAMA_FLASH_ATTENTION0Flash Attention 활성화 (1로 설정)
OLLAMA_KV_CACHE_TYPEf16KV Cache 양자화 타입 (f16, q8_0, q4_0)
OLLAMA_GPU_OVERHEAD0GPU당 예약할 VRAM (바이트)
OLLAMA_LLM_LIBRARYauto사용할 LLM 라이브러리 강제 지정
CUDA_VISIBLE_DEVICES전체 GPU사용할 NVIDIA GPU 디바이스 번호
ROCR_VISIBLE_DEVICES전체 GPU사용할 AMD GPU 디바이스 번호
GPU_DEVICE_ORDINAL전체 GPU사용할 GPU 순서

12.4 로깅 및 디버그

환경변수기본값설명
OLLAMA_DEBUG0디버그 로깅 활성화 (1로 설정)
OLLAMA_NOHISTORY0대화형 모드에서 readline 히스토리 비활성화

12.5 컨텍스트 및 추론

환경변수기본값설명
OLLAMA_CONTEXT_LENGTH4096기본 컨텍스트 윈도우 크기
OLLAMA_NO_CLOUD0클라우드 기능 비활성화 (1로 설정)
HTTPS_PROXY / HTTP_PROXY없음프록시 서버 설정
NO_PROXY없음프록시 우회 호스트

12.6 환경변수 설정 방법

macOS (launchctl):

# 환경변수 설정
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_MODELS "/Volumes/ExternalSSD/ollama/models"
launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_NUM_PARALLEL "4"
launchctl setenv OLLAMA_KEEP_ALIVE "-1"

# Ollama 재시작
brew services restart ollama

Linux (systemd):

# systemd 서비스 오버라이드 생성
sudo systemctl edit ollama

# 에디터에서 다음 내용 추가:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="CUDA_VISIBLE_DEVICES=0,1"

# 서비스 재시작
sudo systemctl daemon-reload
sudo systemctl restart ollama

Docker:

docker run -d --gpus=all \
  -e OLLAMA_HOST=0.0.0.0:11434 \
  -e OLLAMA_FLASH_ATTENTION=1 \
  -e OLLAMA_KV_CACHE_TYPE=q8_0 \
  -e OLLAMA_NUM_PARALLEL=4 \
  -e OLLAMA_KEEP_ALIVE=-1 \
  -v /data/ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

13. Ollama 고급 활용

13.1 Modelfile 작성 가이드

한국어 어시스턴트 모델:

FROM llama3.1:8b

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER repeat_penalty 1.15
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

SYSTEM """
당신은 대한민국의 문화와 역사에 정통한 한국어 AI 어시스턴트입니다.
항상 정확하고 친절하게 한국어로 응답하며, 필요시 영어 기술 용어를 병기합니다.
답변은 구조적으로 정리하여 제공합니다.
"""

MESSAGE user 안녕하세요, 자기소개 해주세요.
MESSAGE assistant 안녕하세요! 저는 한국어에 특화된 AI 어시스턴트입니다. 한국의 문화, 역사, 기술 등 다양한 주제에 대해 도움을 드릴 수 있습니다. 무엇이든 물어보세요!
# 모델 생성
ollama create korean-assistant -f ./Modelfile-korean

# 실행
ollama run korean-assistant "서울 3대 궁궐에 대해 알려줘"

코드 리뷰 모델:

FROM qwen2.5-coder:7b

PARAMETER temperature 0.2
PARAMETER top_p 0.85
PARAMETER num_ctx 8192
PARAMETER num_predict 2048

SYSTEM """
You are an expert code reviewer. Analyze code for:
1. Bugs and potential issues
2. Performance improvements
3. Security vulnerabilities
4. Code style and best practices

Provide specific, actionable feedback with corrected code examples.
"""

양자화 레벨 선택 가이드:

양자화크기 비율품질속도추천 용도
Q2_K~30%낮음매우 빠름테스트용
Q3_K_M~37%보통빠름메모리 제한 환경
Q4_0~42%양호빠름일반 사용 (기본)
Q4_K_M~45%양호+빠름일반 사용 (권장)
Q5_K_M~53%우수보통품질 중시
Q6_K~62%매우 우수보통높은 품질 요구
Q8_0~80%최우수느림원본에 가까운 품질
F16100%원본느림기준선/벤치마크

13.2 GPU 가속 설정

NVIDIA GPU:

# NVIDIA 드라이버 확인
nvidia-smi

# 특정 GPU만 사용
CUDA_VISIBLE_DEVICES=0 ollama serve

# 멀티 GPU
CUDA_VISIBLE_DEVICES=0,1 ollama serve

AMD GPU (ROCm):

# ROCm 드라이버 확인
rocm-smi

# 특정 GPU 지정
ROCR_VISIBLE_DEVICES=0 ollama serve

Apple Silicon (Metal):

macOS에서는 자동으로 Metal GPU 가속이 활성화된다. 별도 설정이 불필요하다.

# GPU 사용 확인 (ollama ps에서 Processor 컬럼)
ollama ps
# NAME           ID            SIZE    PROCESSOR     UNTIL
# llama3.1:8b    a]f2e33d4e25  6.7 GB  100% GPU      4 minutes from now

13.3 Docker 배포

# docker-compose.yaml
version: '3.8'

services:
  ollama:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - '11434:11434'
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_KV_CACHE_TYPE=q8_0
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_KEEP_ALIVE=24h
    restart: unless-stopped
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:11434/api/version']
      interval: 30s
      timeout: 5s
      retries: 3

  # 모델 초기화 (선택)
  ollama-init:
    image: curlimages/curl:latest
    depends_on:
      ollama:
        condition: service_healthy
    entrypoint: >
      sh -c "
        curl -s http://ollama:11434/api/pull -d '{\"name\": \"llama3.1:8b\"}' &&
        curl -s http://ollama:11434/api/pull -d '{\"name\": \"nomic-embed-text\"}'
      "

volumes:
  ollama_data:

13.4 멀티모달 모델 활용

# LLaVA 모델 실행
ollama run llava "What's in this image? /path/to/photo.jpg"

# Llama 3.2 Vision
ollama run llama3.2-vision "이 이미지를 한국어로 설명해주세요. /path/to/image.png"
import requests
import base64

# 이미지를 base64로 인코딩
with open("image.jpg", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode("utf-8")

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llava",
    "messages": [
        {
            "role": "user",
            "content": "이 이미지에 무엇이 있나요?",
            "images": [image_base64],
        }
    ],
    "stream": False,
})
print(response.json()["message"]["content"])

13.5 Tool Calling / Function Calling

Ollama는 OpenAI 호환 Tool Calling을 지원한다.

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "서울의 현재 날씨는?"}
    ],
    tools=tools,
    tool_choice="auto",
)

message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

Part 3: 비교 및 실전

14. vLLM vs Ollama 비교

14.1 종합 비교표

항목vLLMOllama
주요 용도프로덕션 API 서빙, 고처리량 추론로컬 개발, 프로토타이핑, 개인 사용
엔진자체 엔진 (PagedAttention)llama.cpp
모델 형식HF Safetensors, AWQ, GPTQ, FP8GGUF (양자화)
APIOpenAI 호환네이티브 + OpenAI 호환
설치 난이도중간 (Python/CUDA 환경 필요)매우 쉬움 (단일 바이너리)
GPU 요구거의 필수 (NVIDIA/AMD)선택 (CPU에서도 동작)
멀티 GPUTP + PP (최대 수백 GPU)자동 분산 (제한적)
동시 처리수백~수천 요청기본 1~4 병렬
양자화AWQ, GPTQ, FP8, BnBGGUF Q2~Q8, F16
Continuous Batching지원미지원 (llama.cpp 제한)
PagedAttention핵심 기술미지원
Prefix Caching지원 (자동)미지원
LoRA 서빙멀티 LoRA 동시 서빙단일 LoRA
Structured OutputJSON Schema, Regex, GrammarJSON 모드
Speculative Decoding지원 (Draft model, N-gram)미지원
Streaming지원지원
Docker 배포공식 이미지 (GPU)공식 이미지 (CPU/GPU)
Kubernetes공식 가이드 + Production Stack커뮤니티 Helm Chart
메모리 효율매우 높음 (< 4% 낭비)높음 (GGUF 양자화)
라이선스Apache 2.0MIT

14.2 처리량 비교 (Llama 3.1 8B, RTX 4090)

동시 사용자vLLM (tokens/s)Ollama (tokens/s)배수
1~140~652.2x
5~500~1204.2x
10~800~1505.3x
50~1,200~1508.0x
100~1,500~150 (큐 대기)10.0x

Red Hat의 벤치마크에서는 동일 하드웨어에서 vLLM이 793 TPS vs Ollama 41 TPS로 19배 차이를 보인 사례도 있다. 이는 동시 요청 수, 배치 크기, 모델 크기에 따라 달라진다.


15. 성능 벤치마크

15.1 Throughput (처리량) 비교

메트릭vLLMOllama비고
단일 요청 TPS100~140 tok/s50~70 tok/sRTX 4090, Llama 3.1 8B
10 동시 요청 총 TPS700~900 tok/s120~200 tok/sContinuous Batching 효과
50 동시 요청 총 TPS1,000~1,500 tok/s~150 tok/sOllama는 큐 대기 발생
배치 추론 (1K prompts)2,000~3,000 tok/s지원 안 함vLLM offline inference

15.2 Latency (지연시간) 비교

메트릭vLLMOllama비고
TTFT (Time To First Token)50~200 ms100~500 ms프롬프트 길이에 따라 변동
TPOT (Time Per Output Token)7~15 ms15~25 ms단일 요청 기준
P99 Latency80~150 ms500~700 ms10 동시 요청 기준
모델 로딩 시간30~120 초5~30 초GGUF가 더 빠름

15.3 메모리 사용량 비교 (Llama 3.1 8B)

설정vLLM GPU 메모리Ollama GPU 메모리비고
FP16~16 GBN/AvLLM 기본
FP8~9 GBN/AH100 전용
AWQ 4-bit~5 GBN/AvLLM 양자화
GPTQ 4-bit~5 GBN/AvLLM 양자화
Q4_K_M (GGUF)N/A~5.5 GBOllama 기본
Q5_K_M (GGUF)N/A~6.2 GB더 높은 품질
Q8_0 (GGUF)N/A~9 GB최고 품질 양자화
KV Cache 포함 (4K ctx)+0.5~2 GB+0.5~1.5 GB시퀀스 수에 비례

16. 실전 시나리오별 추천

16.1 개인 개발자 로컬 환경

추천: Ollama

# 설치 후 즉시 사용
ollama run llama3.1

# VS Code + Continue 확장 연동
# settings.json에 Ollama 엔드포인트 설정

이유: 설치가 간단하고, CPU에서도 동작하며, macOS/Windows/Linux 모두 지원. IDE 확장과의 연동이 쉬움.

16.2 프로덕션 API 서빙

추천: vLLM

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --api-key ${API_KEY}

이유: Continuous Batching으로 동시 요청 처리 능력이 압도적. PagedAttention으로 메모리 효율성이 높음. 멀티 GPU 지원, Kubernetes 배포, 모니터링 연동이 성숙함.

16.3 엣지/IoT 환경

추천: Ollama + 높은 양자화

# 작은 모델 + 고양자화
ollama run phi3:3.8b-mini-instruct-4k-q4_0

# 또는 Qwen 0.5B
ollama run qwen2.5:0.5b

이유: 단일 바이너리로 배포 간편. GGUF 양자화로 저사양에서도 동작. CPU 전용 추론 지원.

16.4 대규모 배치 추론

추천: vLLM Offline Inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.95,
)

# 수천 개의 프롬프트를 한 번에 처리
prompts = load_prompts_from_file("prompts.jsonl")  # 10,000+ prompts
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)

outputs = llm.generate(prompts, sampling_params)
save_outputs(outputs, "results.jsonl")

이유: GPU 메모리를 최대한 활용하는 배치 스케줄링. 수천~수만 개 프롬프트를 효율적으로 처리.

16.5 RAG 파이프라인

둘 다 가능 -- 상황에 따라 선택:

# Ollama 기반 RAG (개발/소규모)
from langchain_ollama import OllamaLLM, OllamaEmbeddings

llm = OllamaLLM(model="llama3.1")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# vLLM 기반 RAG (프로덕션)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(
    base_url="http://vllm-server:8000/v1",
    api_key="token",
    model="meta-llama/Llama-3.1-8B-Instruct",
)

17. 요청 추적 (Request Tracing) 연동

프로덕션 환경에서 LLM 요청을 추적하는 것은 디버깅, 감사(audit), 성능 모니터링에 필수적이다.

17.1 vLLM의 Request ID 추적

vLLM은 OpenAI API 호환 서버에서 자동으로 request_id를 생성한다. 커스텀 ID를 전달하려면 extra_body를 사용한다.

from openai import OpenAI
import uuid

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# 커스텀 request_id 전달
xid = str(uuid.uuid4())

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"X-Request-ID": xid},
)

print(f"XID: {xid}")
print(f"Response ID: {response.id}")

17.2 Ollama의 요청 추적

Ollama 네이티브 API는 별도의 request ID를 지원하지 않으므로, 리버스 프록시에서 처리한다.

import requests
import uuid

xid = str(uuid.uuid4())

response = requests.post(
    "http://localhost:11434/api/chat",
    headers={"X-Request-ID": xid},
    json={
        "model": "llama3.1",
        "messages": [{"role": "user", "content": "Hello"}],
        "stream": False,
    },
)

# 로깅에 xid 포함
import logging
logger = logging.getLogger(__name__)
logger.info(f"[xid={xid}] Response: {response.status_code}")

17.3 API Gateway에서 X-Request-ID 전달

NGINX 설정:

upstream vllm_backend {
    server vllm-server:8000;
}

server {
    listen 80;

    location /v1/ {
        # X-Request-ID가 없으면 자동 생성
        set $request_id $http_x_request_id;
        if ($request_id = "") {
            set $request_id $request_id;
        }

        proxy_pass http://vllm_backend;
        proxy_set_header X-Request-ID $request_id;
        proxy_set_header Host $host;

        # 응답 헤더에 X-Request-ID 추가
        add_header X-Request-ID $request_id always;

        # 액세스 로그에 request_id 포함
        access_log /var/log/nginx/vllm_access.log combined_with_xid;
    }
}

# 로그 포맷 정의
log_format combined_with_xid '$remote_addr - $remote_user [$time_local] '
    '"$request" $status $body_bytes_sent '
    '"$http_referer" "$http_user_agent" '
    'xid="$http_x_request_id"';

17.4 OpenTelemetry 연동

# vLLM + OpenTelemetry 분산 추적
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Tracer 초기화
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://jaeger:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

# LLM 호출을 Span으로 래핑
def call_llm(prompt: str, xid: str) -> str:
    with tracer.start_as_current_span("llm_inference") as span:
        span.set_attribute("xid", xid)
        span.set_attribute("model", "llama-3.1-8b")
        span.set_attribute("prompt_length", len(prompt))

        response = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": prompt}],
            extra_headers={"X-Request-ID": xid},
        )

        result = response.choices[0].message.content
        span.set_attribute("response_length", len(result))
        span.set_attribute("tokens_used", response.usage.total_tokens)

        return result

17.5 로깅에서 xid 활용 패턴

Python 예제:

import logging
import uuid
from contextvars import ContextVar

# Context Variable로 xid 관리
request_xid: ContextVar[str] = ContextVar("request_xid", default="")

class XIDFilter(logging.Filter):
    def filter(self, record):
        record.xid = request_xid.get("")
        return True

# 로거 설정
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
    "%(asctime)s [%(levelname)s] [xid=%(xid)s] %(message)s"
))
handler.addFilter(XIDFilter())

logger = logging.getLogger("llm_service")
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# 사용
async def handle_request(prompt: str):
    xid = str(uuid.uuid4())
    request_xid.set(xid)

    logger.info(f"Received prompt: {prompt[:50]}...")

    response = await call_llm(prompt, xid)

    logger.info(f"Generated {len(response)} chars")
    return {"xid": xid, "response": response}

Go 예제:

package main

import (
    "context"
    "fmt"
    "log/slog"
    "net/http"

    "github.com/google/uuid"
)

type contextKey string
const xidKey contextKey = "xid"

// XID 미들웨어
func xidMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        xid := r.Header.Get("X-Request-ID")
        if xid == "" {
            xid = uuid.New().String()
        }

        ctx := context.WithValue(r.Context(), xidKey, xid)
        w.Header().Set("X-Request-ID", xid)

        slog.Info("request received",
            "xid", xid,
            "method", r.Method,
            "path", r.URL.Path,
        )

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

// Ollama 호출 함수
func callOllama(ctx context.Context, prompt string) (string, error) {
    xid := ctx.Value(xidKey).(string)

    slog.Info("calling ollama",
        "xid", xid,
        "prompt_len", len(prompt),
    )

    // ... Ollama API 호출 로직 ...

    slog.Info("ollama response received",
        "xid", xid,
        "response_len", len(response),
    )

    return response, nil
}

18. 참고 자료 (References)

vLLM

Ollama

논문 및 기술 자료

관련 프로젝트

The Complete Guide to vLLM & Ollama: LLM Serving Engine Setup, Parameters, and Environment Variables


Part 1: vLLM

1. Introduction to vLLM

vLLM is a high-performance LLM inference and serving engine developed at UC Berkeley. Since its release alongside the PagedAttention paper in 2023, it has established itself as the de facto standard for production LLM serving. As of March 2026, the latest version is v0.16.x, with the transition to V1 architecture underway.

1.1 Core Principles of PagedAttention

In traditional LLM inference, KV Cache is allocated in contiguous GPU memory blocks per sequence. This approach pre-reserves memory based on the maximum sequence length, resulting in 60-80% memory waste in practice.

PagedAttention introduces the operating system's Virtual Memory Paging concept to KV Cache management.

┌─────────────────────────────────────────────────┐
Traditional KV Cache│  ┌──────────────────────────────────┐            │
│  │ Seq 1: [used][used][used][waste][waste][waste]│  │ Seq 2: [used][waste][waste][waste][waste]│  │ Seq 3: [used][used][waste][waste][waste]│  └──────────────────────────────────┘            │
│              → 60~80% Memory Waste├─────────────────────────────────────────────────┤
PagedAttention KV CachePhysical Blocks: [B0][B1][B2][B3][B4][B5]...Block Table:Seq 1[B0, B3, B5]  (logical → physical)Seq 2[B1, B4]Seq 3[B2, B6]│              → < 4% Memory Waste└─────────────────────────────────────────────────┘

The core mechanisms are as follows.

  • Fixed-size blocks: KV Cache is split into fixed-size blocks (default 16 tokens)
  • Block Table: Maintains a table mapping logical block numbers of sequences to physical block addresses
  • Dynamic allocation: Physical blocks are allocated only as needed during token generation
  • Copy-on-Write: When branching sequences (e.g., Beam Search), physical blocks are shared and copied only when modification is needed

1.2 Continuous Batching

Traditional Static Batching waits until all sequences in a batch complete. Continuous Batching removes completed sequences and inserts new requests at every decoding step.

Static Batching:
Step 1: [Seq1, Seq2, Seq3, Seq4]Waits even if Seq2 completes
Step 2: [Seq1, Seq2, Seq3, Seq4]
Step 3: [Seq1, ___, Seq3, Seq4]Slot wasted after Seq2 ends
...
Step N: Next batch starts after all sequences complete

Continuous Batching:
Step 1: [Seq1, Seq2, Seq3, Seq4]
Step 2: [Seq1, Seq5, Seq3, Seq4]Seq5 inserted immediately after Seq2 completes
Step 3: [Seq1, Seq5, Seq6, Seq4]Seq6 inserted immediately after Seq3 completes
Minimizes GPU idle time, maximizes throughput

1.3 Supported Models

vLLM supports virtually all major Transformer-based LLM architectures.

CategorySupported Models
Meta Llama FamilyLlama 2, Llama 3, Llama 3.1, Llama 3.2, Llama 3.3, Llama 4
Mistral FamilyMistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Large, Mistral Small
Qwen FamilyQwen, Qwen 1.5, Qwen 2, Qwen 2.5, Qwen 3, QwQ
Google FamilyGemma, Gemma 2, Gemma 3
DeepSeek FamilyDeepSeek V2, DeepSeek V3, DeepSeek-R1
OthersPhi-3/4, Yi, InternLM 2/3, Command R, DBRX, Falcon, StarCoder 2
MultimodalLLaVA, InternVL, Pixtral, Qwen-VL, MiniCPM-V
EmbeddingE5-Mistral, GTE-Qwen, Jina Embeddings

1.4 LLM Serving Engine Comparison

ItemvLLMTGITensorRT-LLMllama.cpp
DeveloperUC Berkeley / vLLM ProjectHugging FaceNVIDIAGeorgi Gerganov
LanguagePython/C++/CUDARust/PythonC++/CUDAC/C++
Core TechnologyPagedAttentionContinuous BatchingFP8/INT4 kernel optimizationGGUF quantization
Multi-GPUTP + PPTPTP + PPLimited
QuantizationAWQ, GPTQ, FP8, BnBAWQ, GPTQ, BnBFP8, INT4, INT8GGUF (Q2~Q8)
API CompatOpenAI compatibleOpenAI compatibleTritonCustom API
Install DifficultyMediumMediumHighLow
Production ReadyVery HighHighVery HighLow~Medium
CommunityVery ActiveActiveNVIDIA-ledVery Active

2. vLLM Installation and Startup

2.1 pip Installation

# Basic installation (CUDA 12.x)
pip install vllm

# Specific version installation
pip install vllm==0.16.0

# CUDA 11.8 environment
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

2.2 conda Installation

conda create -n vllm python=3.11 -y
conda activate vllm
pip install vllm

2.3 Docker Installation

# Official Docker image (NVIDIA GPU)
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<hf_token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

# ROCm (AMD GPU)
docker run --device /dev/kfd --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest-rocm \
  --model meta-llama/Llama-3.1-8B-Instruct

2.4 Basic Server Start

# vllm serve command (recommended)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# Direct Python module execution (legacy)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# Start with YAML config file
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --config config.yaml

config.yaml example:

# vLLM server configuration file
host: '0.0.0.0'
port: 8000
tensor_parallel_size: 2
gpu_memory_utilization: 0.90
max_model_len: 8192
dtype: 'auto'
enforce_eager: false
enable_prefix_caching: true

2.5 Offline Batch Inference

You can perform batch inference directly from Python code without starting a server.

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.90,
)

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Prompt list
prompts = [
    "Explain PagedAttention in simple terms.",
    "What is continuous batching?",
    "Compare vLLM and TensorRT-LLM.",
]

# Run batch inference
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Output: {generated!r}\n")

2.6 OpenAI-Compatible API Server

The vLLM server provides OpenAI API-compatible endpoints.

# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --served-model-name llama-3.1-8b \
  --api-key my-secret-key

# Call Chat Completion with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret-key" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is PagedAttention?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'
# Call with OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="my-secret-key",
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the advantages of vLLM."},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

3. Complete vLLM CLI Arguments Reference

Here is a categorized summary of key CLI arguments that can be passed to vllm serve. You can check the full list with vllm serve --help, or query by group with vllm serve --help=ModelConfig.

ArgumentTypeDefaultDescription
--modelstrfacebook/opt-125mHuggingFace model ID or local path
--tokenizerstrNone (same as model)Specify a separate tokenizer
--revisionstrNoneSpecific Git revision of the model (branch, tag, commit hash)
--tokenizer-revisionstrNoneSpecific revision of the tokenizer
--dtypestr"auto"Model weight data type (auto, float16, bfloat16, float32)
--max-model-lenintNone (follows model config)Maximum sequence length (sum of input + output tokens)
--trust-remote-codeflagFalseAllow HuggingFace remote code execution
--download-dirstrNoneModel download directory
--load-formatstr"auto"Model load format (auto, pt, safetensors, npcache, dummy, bitsandbytes)
--config-formatstr"auto"Model configuration format (auto, hf, mistral)
--seedint0Random seed for reproducibility
ArgumentTypeDefaultDescription
--hoststr"0.0.0.0"Host address to bind
--portint8000Server port number
--uvicorn-log-levelstr"info"Uvicorn log level
--api-keystrNoneAPI authentication key (Bearer token)
--served-model-namestrNoneModel name for the API (uses --model value if unset)
--chat-templatestrNoneJinja2 chat template file path or string
--response-rolestr"assistant"Role in chat completion responses
--ssl-keyfilestrNoneSSL key file path
--ssl-certfilestrNoneSSL certificate file path
--allowed-originslist["*"]CORS allowed origin list
--middlewarelistNoneFastAPI middleware classes
--max-log-lenintNoneMaximum prompt/output length in logs
--disable-log-requestsflagFalseDisable request logging
ArgumentTypeDefaultDescription
--tensor-parallel-size (-tp)int1Number of GPUs for Tensor Parallelism
--pipeline-parallel-size (-pp)int1Number of Pipeline Parallelism stages
--distributed-executor-backendstrNoneDistributed execution backend (ray, mp)
--ray-workers-use-nsightflagFalseUse Nsight profiler with Ray workers
--data-parallel-size (-dp)int1Number of Data Parallelism processes

Usage examples:

# 4-GPU Tensor Parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# 2-GPU Tensor + 2-way Pipeline (4 GPUs total)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

# Ray distributed backend (multi-node)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --distributed-executor-backend ray

3.4 Memory and Performance Arguments

ArgumentTypeDefaultDescription
--gpu-memory-utilizationfloat0.90GPU memory usage ratio (0.0~1.0)
--max-num-seqsint256Maximum concurrent sequences
--max-num-batched-tokensintNone (auto)Maximum tokens processed per step
--block-sizeint16PagedAttention block size (in tokens)
--swap-spacefloat4CPU swap space size (GiB)
--enforce-eagerflagFalseDisable CUDA Graph, force Eager mode
--max-seq-len-to-captureint8192Maximum sequence length for CUDA Graph capture
--disable-custom-all-reduceflagFalseDisable custom All-Reduce
--enable-prefix-cachingflagTrue (v1)Enable Automatic Prefix Caching
--enable-chunked-prefillflagTrue (v1)Enable Chunked Prefill
--num-scheduler-stepsint1Decoding steps per scheduler (Multi-Step Scheduling)
--kv-cache-dtypestr"auto"KV Cache data type (auto, fp8, fp8_e5m2, fp8_e4m3)

Usage examples:

# Memory optimization settings
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 128 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill

# Eager mode (debugging/compatibility)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enforce-eager \
  --gpu-memory-utilization 0.85
ArgumentTypeDefaultDescription
--quantization (-q)strNoneSelect quantization method
--load-formatstr"auto"Model load format

--quantization supported values:

ValueDescriptionNotes
awqAWQ (Activation-aware Weight Quantization)4-bit, fast inference
gptqGPTQ (Post-Training Quantization)4-bit, ExLlamaV2 kernel
gptq_marlinGPTQ + Marlin kernel4-bit, faster kernel
awq_marlinAWQ + Marlin kernel4-bit, faster kernel
squeezellmSqueezeLLMSparse quantization
fp8FP8 (8-bit floating point)H100/MI300x only
bitsandbytesBitsAndBytes4-bit NF4
ggufGGUF formatllama.cpp compatible
compressed-tensorsCompressed TensorsGeneral purpose
experts_int8MoE Expert INT8MoE models only

Usage examples:

# AWQ quantized model
vllm serve TheBloke/Llama-2-7B-AWQ \
  --quantization awq

# GPTQ quantized model
vllm serve TheBloke/Llama-2-7B-GPTQ \
  --quantization gptq

# FP8 quantization (H100 and above)
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --quantization fp8

# BitsAndBytes 4-bit (GPU memory saving)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes
ArgumentTypeDefaultDescription
--enable-loraflagFalseEnable LoRA adapter serving
--max-lorasint1Maximum number of simultaneously loaded LoRAs
--max-lora-rankint16Maximum LoRA rank
--lora-extra-vocab-sizeint256Extra vocabulary size for LoRA adapters
--lora-moduleslistNoneLoRA adapter list (name=path format)
--long-lora-scaling-factorslistNoneLong LoRA scaling factors

Usage example:

# LoRA adapter serving
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --lora-modules \
    adapter1=/path/to/lora1 \
    adapter2=/path/to/lora2

3.7 Speculative Decoding Arguments

ArgumentTypeDefaultDescription
--speculative-modelstrNoneDraft model (small model or [ngram])
--num-speculative-tokensintNoneNumber of tokens to speculatively generate
--speculative-draft-tensor-parallel-sizeintNoneTP size for the draft model
--speculative-disable-by-batch-sizeintNoneDisable when batch size exceeds threshold
--ngram-prompt-lookup-maxintNoneMaximum lookup size for N-gram speculation
--ngram-prompt-lookup-minintNoneMinimum lookup size for N-gram speculation

Usage examples:

# Using a separate draft model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4

# N-gram based speculative decoding (no additional model needed)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model "[ngram]" \
  --num-speculative-tokens 5 \
  --ngram-prompt-lookup-max 4

4. vLLM Sampling Parameters

vLLM supports OpenAI API-compatible parameters plus additional advanced parameters.

4.1 Complete Parameter Reference

ParameterTypeDefaultRangeDescription
temperaturefloat1.0>= 0.0Lower is more deterministic, higher is more creative. 0 = Greedy
top_pfloat1.0(0.0, 1.0]Nucleus sampling. Sample only top tokens by cumulative probability
top_kint-1-1 or >= 1Consider only top k tokens. -1 disables
min_pfloat0.0[0.0, 1.0]Minimum probability threshold. Filters by ratio to highest probability
frequency_penaltyfloat0.0[-2.0, 2.0]Frequency-based penalty. Positive suppresses repetition
presence_penaltyfloat0.0[-2.0, 2.0]Presence-based penalty. Penalizes tokens that appeared at least once
repetition_penaltyfloat1.0> 0.0Repetition penalty (1.0 disables, greater than 1.0 suppresses)
max_tokensint16>= 1Maximum tokens to generate
stoplistNone-List of stop strings
seedintNone-Random seed (ensures reproducibility)
nint1>= 1Number of responses per prompt
best_ofintNone>= nGenerate best_of candidates and select the best
use_beam_searchboolFalse-Enable Beam Search
logprobsintNone[0, 20]Number of per-token log probabilities to return
prompt_logprobsintNone[0, 20]Number of prompt token log probabilities to return
skip_special_tokensboolTrue-Whether to skip special tokens
spaces_between_special_tokensboolTrue-Insert spaces between special tokens
guided_jsonobjectNone-JSON Schema-based structured output
guided_regexstrNone-Regex-based structured output
guided_choicelistNone-Choice-based structured output

4.2 API Call Examples with curl

# Basic Chat Completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the population of Seoul?"}
    ],
    "temperature": 0.3,
    "top_p": 0.9,
    "max_tokens": 256,
    "frequency_penalty": 0.5,
    "seed": 42
  }'

# Structured Output (JSON mode)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Give me the population of Seoul, Busan, and Daegu in JSON"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "city_population",
        "schema": {
          "type": "object",
          "properties": {
            "cities": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "population": {"type": "integer"}
                },
                "required": ["name", "population"]
              }
            }
          },
          "required": ["cities"]
        }
      }
    },
    "temperature": 0.1,
    "max_tokens": 512
  }'

# Returning logprobs
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "1+1=?"}
    ],
    "logprobs": true,
    "top_logprobs": 5,
    "max_tokens": 10
  }'

4.3 Python requests Example

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

payload = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful Korean assistant."},
        {"role": "user", "content": "What is quantum computing?"},
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "max_tokens": 1024,
    "repetition_penalty": 1.1,
    "stop": ["\n\n\n"],
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(result["choices"][0]["message"]["content"])

4.4 Streaming Example with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# Streaming response
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Implement quicksort in Python"},
    ],
    temperature=0.2,
    max_tokens=2048,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

5. Complete vLLM Environment Variables Reference

vLLM controls runtime behavior through various environment variables. Here is a categorized summary of key environment variables.

5.1 Core Environment Variables

Environment VariableDefaultDescription
VLLM_TARGET_DEVICE"cuda"Target device (cuda, rocm, neuron, cpu, xpu)
VLLM_USE_V1TrueUse V1 code path
VLLM_WORKER_MULTIPROC_METHOD"fork"Multiprocess spawn method (spawn, fork)
VLLM_ALLOW_LONG_MAX_MODEL_LENFalseAllow max_model_len longer than model config
CUDA_VISIBLE_DEVICESNoneGPU device numbers to use
Environment VariableDefaultDescription
VLLM_ATTENTION_BACKENDNoneAttention backend (deprecated, use --attention-backend from v0.14)
VLLM_USE_TRITON_FLASH_ATTNTrueUse Triton Flash Attention
VLLM_FLASH_ATTN_VERSIONNoneForce Flash Attention version (2 or 3)
VLLM_USE_FLASHINFER_SAMPLERNoneUse FlashInfer sampler
VLLM_FLASHINFER_FORCE_TENSOR_CORESFalseForce FlashInfer tensor core usage
VLLM_USE_TRITON_AWQFalseUse Triton AWQ kernel
VLLM_USE_DEEP_GEMMFalseUse DeepGemm kernel (MoE operations)
VLLM_MLA_DISABLEFalseDisable MLA Attention optimization
Environment VariableDefaultDescription
VLLM_CONFIGURE_LOGGING1Auto-configure vLLM logging (0 to disable)
VLLM_LOGGING_LEVEL"INFO"Default logging level
VLLM_LOGGING_CONFIG_PATHNoneCustom logging config file path
VLLM_LOGGING_PREFIX""Prefix to prepend to log messages
VLLM_LOG_BATCHSIZE_INTERVAL-1Batch size logging interval (seconds, -1 disables)
VLLM_TRACE_FUNCTION0Enable function call tracing
VLLM_DEBUG_LOG_API_SERVER_RESPONSEFalseAPI response debug logging
Environment VariableDefaultDescription
VLLM_HOST_IP""Node IP for distributed setup
VLLM_PORT0Distributed communication port
VLLM_NCCL_SO_PATHNoneNCCL library file path
NCCL_DEBUGNoneNCCL debug level (INFO, WARN, TRACE)
NCCL_SOCKET_IFNAMENoneNetwork interface for NCCL communication
VLLM_PP_LAYER_PARTITIONNonePipeline Parallelism layer partition strategy
VLLM_DP_RANK0Data Parallel process rank
VLLM_DP_SIZE1Data Parallel world size
VLLM_DP_MASTER_IP"127.0.0.1"Data Parallel master node IP
VLLM_DP_MASTER_PORT0Data Parallel master node port
VLLM_USE_RAY_SPMD_WORKERFalseRay SPMD worker execution
VLLM_USE_RAY_COMPILED_DAGFalseUse Ray Compiled Graph API
VLLM_SKIP_P2P_CHECKFalseSkip GPU P2P capability check

5.5 HuggingFace and External Services

Environment VariableDefaultDescription
HF_TOKENNoneHuggingFace API token
HUGGING_FACE_HUB_TOKENNoneHuggingFace Hub token (legacy)
VLLM_USE_MODELSCOPEFalseLoad models from ModelScope
VLLM_API_KEYNonevLLM API server auth key
VLLM_NO_USAGE_STATSFalseDisable usage stats collection
VLLM_DO_NOT_TRACKFalseOpt out of tracking

5.6 Cache and Paths

Environment VariableDefaultDescription
VLLM_CONFIG_ROOT~/.config/vllmConfig file root directory
VLLM_CACHE_ROOT~/.cache/vllmCache file root directory
VLLM_ASSETS_CACHE~/.cache/vllm/assetsDownloaded assets cache path
VLLM_RPC_BASE_PATHSystem tempIPC multiprocessing path

5.7 Environment Variable Usage Examples

# Multi-GPU + logging + HF token setup
export CUDA_VISIBLE_DEVICES=0,1,2,3
export HF_TOKEN="hf_xxxxxxxxxxxx"
export VLLM_LOGGING_LEVEL="DEBUG"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

# Passing environment variables in Docker
docker run --runtime nvidia --gpus all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e HF_TOKEN="hf_xxxxxxxxxxxx" \
  -e VLLM_LOGGING_LEVEL="INFO" \
  -e VLLM_WORKER_MULTIPROC_METHOD="spawn" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2

6. Advanced vLLM Configuration

6.1 Multi-GPU Setup

Tensor Parallelism (TP): Distributes each layer of the model across multiple GPUs. The most commonly used approach on a single node.

# TP=4 (distribute model across 4 GPUs)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

Pipeline Parallelism (PP): Places model layers sequentially across multiple GPUs. Advantageous in slow interconnect environments.

# PP=2, TP=2 (4 GPUs total, 2x2 configuration)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

Multi-node setup (using Ray):

# Master node
ray start --head --port=6379

# Worker node
ray start --address=<master-ip>:6379

# Run vLLM (from master)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --distributed-executor-backend ray

6.2 Quantization Details

AWQ (Activation-aware Weight Quantization):

# Using pre-quantized AWQ model
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq \
  --max-model-len 4096

# Faster with Marlin kernel (SM 80+ GPU)
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq_marlin

GPTQ (Post-Training Quantization):

# GPTQ model (ExLlamaV2 kernel auto-used)
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
  --quantization gptq

# Using Marlin kernel
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
  --quantization gptq_marlin

FP8 (8-bit Floating Point): Hardware acceleration supported on H100, MI300x and above GPUs.

# Pre-quantized FP8 model
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --quantization fp8

# Dynamic FP8 quantization (no pre-quantization needed)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8

BitsAndBytes 4-bit NF4: Instant quantization without calibration data.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --enforce-eager  # BnB requires Eager mode

6.3 LoRA Serving

# Enable LoRA adapters
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --lora-modules \
    korean-chat=/path/to/korean-lora \
    code-assist=/path/to/code-lora

Specifying a LoRA model in API calls:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# Use a specific LoRA adapter
response = client.chat.completions.create(
    model="korean-chat",  # LoRA adapter name
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=256,
)

6.4 Prefix Caching & Chunked Prefill

Automatic Prefix Caching: Reuses KV Cache for common prompt prefixes to reduce TTFT. Especially effective when many requests share the same system prompt.

# Enabled by default in v1, requires explicit flag in v0
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching

Chunked Prefill: Splits long prompts into chunks and interleaves Prefill and Decode. Prevents long prompts from blocking Decode of shorter requests.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048

6.5 Structured Output (Guided Decoding)

# JSON Schema-based structured output
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Please provide Seoul weather info in JSON"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "weather_info",
        "schema": {
          "type": "object",
          "properties": {
            "city": {"type": "string"},
            "temperature_celsius": {"type": "number"},
            "condition": {"type": "string"},
            "humidity_percent": {"type": "integer"}
          },
          "required": ["city", "temperature_celsius", "condition"]
        }
      }
    }
  }'

# Regex-based output (Completion API)
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Generate a valid email address:",
    "extra_body": {
      "guided_regex": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
    },
    "max_tokens": 50
  }'

6.6 Docker Deployment

# docker-compose.yaml
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - '8000:8000'
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_LOGGING_LEVEL=INFO
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    ipc: host
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.90
      --max-model-len 8192
      --enable-prefix-caching
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
# Run with Docker Compose
HF_TOKEN=hf_xxxx docker compose up -d

# Check logs
docker compose logs -f vllm

6.7 Kubernetes Deployment

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
  namespace: ai-serving
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          ports:
            - containerPort: 8000
              name: http
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-8B-Instruct'
            - '--host'
            - '0.0.0.0'
            - '--port'
            - '8000'
            - '--tensor-parallel-size'
            - '2'
            - '--gpu-memory-utilization'
            - '0.90'
            - '--max-model-len'
            - '8192'
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
            - name: VLLM_WORKER_MULTIPROC_METHOD
              value: 'spawn'
          resources:
            limits:
              nvidia.com/gpu: '2'
            requests:
              nvidia.com/gpu: '2'
              memory: '32Gi'
              cpu: '8'
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
            - name: model-cache
              mountPath: /root/.cache/huggingface
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
      nodeSelector:
        nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ai-serving
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000
      name: http
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama3
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: '50'

Part 2: Ollama

7. Introduction to Ollama

Ollama is an open-source tool that makes it easy to run LLMs in local environments. Like Docker, you can download a model and start chatting immediately with a single command like ollama run llama3.1.

7.1 Architecture Features

  • GGUF-based: Uses llama.cpp's GGUF (GPT-Generated Unified Format) quantized models
  • llama.cpp engine: Internally uses llama.cpp as the inference engine
  • Single binary: Go server + llama.cpp C++ engine distributed as a single binary
  • Automatic GPU acceleration: Auto-detects NVIDIA CUDA, AMD ROCm, Apple Metal for GPU offloading
  • Model registry: Pull/push pre-quantized models from ollama.com/library like Docker Hub

7.2 Supported Models

CategoryModelsSize
Meta Llamallama3.1, llama3.2, llama3.31B ~ 405B
Mistralmistral, mixtral7B ~ 8x22B
Googlegemma, gemma2, gemma32B ~ 27B
Microsoftphi3, phi43.8B ~ 14B
DeepSeekdeepseek-r1, deepseek-v3, deepseek-coder-v21.5B ~ 671B
Qwenqwen, qwen2, qwen2.5, qwen30.5B ~ 72B
Codecodellama, starcoder2, qwen2.5-coder3B ~ 34B
Embeddingnomic-embed-text, mxbai-embed-large, all-minilm-
Multimodalllava, bakllava, llama3.2-vision7B ~ 90B

8. Ollama Installation and Startup

8.1 Platform-Specific Installation

macOS:

# Homebrew
brew install ollama

# Or official install script
curl -fsSL https://ollama.com/install.sh | sh

Linux:

# Official install script (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Or manual installation
curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama

Windows:

Download and run the Windows installer from the official website (ollama.com).

Docker:

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# NVIDIA GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# AMD GPU (ROCm)
docker run -d --device /dev/kfd --device /dev/dri \
  -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama:rocm

8.2 Basic Usage

# Start server (if not auto-started in background)
ollama serve

# Download model and start chatting
ollama run llama3.1

# Specify a tag (size/quantization)
ollama run llama3.1:8b
ollama run llama3.1:70b-instruct-q4_K_M
ollama run qwen2.5:32b-instruct-q5_K_M

# Download model only (without running)
ollama pull llama3.1:8b

# One-line prompt
ollama run llama3.1 "What is PagedAttention?"

9. Complete Ollama CLI Commands Reference

9.1 Command Summary

CommandDescriptionKey Options
ollama serveStart Ollama serverCheck env vars with --help
ollama run <model>Run model (auto-pulls if missing)--verbose, --nowordwrap, --format json
ollama pull <model>Download model--insecure
ollama push <model>Upload model to registry--insecure
ollama create <model>Create custom model from Modelfile-f <Modelfile>, --quantize
ollama list / ollama lsList installed models-
ollama show <model>Show model details--modelfile, --parameters, --system, --template, --license
ollama cp <src> <dst>Copy model-
ollama rm <model>Delete model-
ollama psList running models-
ollama stop <model>Stop a running model-
ollama signinSign in to ollama.com-
ollama signoutSign out from ollama.com-

9.2 Detailed Command Examples

ollama serve - Start server:

# Default start (localhost:11434)
ollama serve

# Change bind address via environment variable
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Debug mode
OLLAMA_DEBUG=1 ollama serve

ollama run - Run model:

# Interactive mode
ollama run llama3.1

# One-line prompt
ollama run llama3.1 "Explain quantum computing"

# JSON format output
ollama run llama3.1 "List 3 Korean cities" --format json

# Multimodal (image input)
ollama run llama3.2-vision "What's in this image? /path/to/image.png"

# Verbose mode (display performance stats)
ollama run llama3.1 --verbose

# With system prompt
ollama run llama3.1 --system "You are a Korean translator."

ollama create - Create custom model:

# Create from Modelfile
ollama create my-model -f ./Modelfile

# Create from GGUF file
ollama create my-model -f ./Modelfile-from-gguf

# Quantization conversion
ollama create my-model-q4 --quantize q4_K_M -f ./Modelfile

ollama show - Check model info:

# Full info
ollama show llama3.1

# Output Modelfile
ollama show llama3.1 --modelfile

# Check parameters
ollama show llama3.1 --parameters

# Check system prompt
ollama show llama3.1 --system

# Check template
ollama show llama3.1 --template

ollama ps - Running models:

$ ollama ps
NAME              ID            SIZE     PROCESSOR    UNTIL
llama3.1:8b       a]f2e33d4e25  6.7 GB   100% GPU     4 minutes from now
qwen2.5:7b        845dbda0ea48  4.7 GB   100% GPU     3 minutes from now

10. Ollama API Endpoints

Ollama provides both a REST API and an OpenAI-compatible API. The default address is http://localhost:11434.

10.1 Native API Endpoints

EndpointMethodDescription
/api/generatePOSTText Completion generation
/api/chatPOSTChat Completion generation
/api/embedPOSTGenerate embedding vectors
/api/tagsGETList local models
/api/showPOSTModel details
/api/pullPOSTDownload model
/api/pushPOSTUpload model
/api/createPOSTCreate custom model
/api/copyPOSTCopy model
/api/deleteDELETEDelete model
/api/psGETList running models
/api/versionGETOllama version info

10.2 OpenAI-Compatible Endpoints

EndpointMethodDescription
/v1/chat/completionsPOSTOpenAI Chat Completion compatible
/v1/completionsPOSTOpenAI Completion compatible
/v1/modelsGETModel list (OpenAI format)
/v1/embeddingsPOSTEmbeddings (OpenAI format)

10.3 API Call Examples

Generate (Completion):

# Basic generation
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Streaming (default)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a haiku about coding",
  "options": {
    "temperature": 0.7,
    "num_predict": 100
  }
}'

# JSON format output
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "List 3 programming languages as JSON",
  "format": "json",
  "stream": false
}'

Chat (Conversation):

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "system", "content": "You are a helpful Korean assistant."},
    {"role": "user", "content": "Recommend tourist spots in Seoul."}
  ],
  "stream": false,
  "options": {
    "temperature": 0.8,
    "top_p": 0.9,
    "num_ctx": 4096,
    "num_predict": 512
  }
}'

Embed (Embeddings):

# Single text embedding
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Hello, world!"
}'

# Multiple text embeddings
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["Hello world", "Goodbye world"]
}'

OpenAI-Compatible API:

# OpenAI format Chat Completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

# Model list
curl http://localhost:11434/v1/models

Calling from Python:

import requests

# Generate API
response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.1",
    "prompt": "Explain Docker in Korean",
    "stream": False,
    "options": {
        "temperature": 0.7,
        "num_predict": 512,
    },
})
print(response.json()["response"])

# Chat API
response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.1",
    "messages": [
        {"role": "user", "content": "What is Kubernetes?"},
    ],
    "stream": False,
})
print(response.json()["message"]["content"])
# Using Ollama with OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't require an API key, any value works
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Explain Python's GIL."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

11. Ollama Parameters (Modelfile & API)

11.1 Modelfile Structure

A Modelfile defines an Ollama custom model. It has a structure similar to a Dockerfile.

# Specify base model (required)
FROM llama3.1:8b

# Parameter settings
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

# System prompt
SYSTEM """
You are a friendly Korean AI assistant.
You provide accurate and concise answers, explaining with examples when needed.
"""

# Chat template (Jinja2 or Go template)
TEMPLATE """
{{- if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}
{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
{{ .Content }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""

# LoRA adapter (optional)
ADAPTER /path/to/lora-adapter.gguf

# License info (optional)
LICENSE """
Apache 2.0
"""
DirectiveDescriptionRequired
FROMBase model (model name or GGUF file path)Required
PARAMETERModel parameter settingsOptional
TEMPLATEPrompt templateOptional
SYSTEMSystem promptOptional
ADAPTERLoRA/QLoRA adapter pathOptional
LICENSELicense informationOptional
MESSAGEPre-set conversation historyOptional

11.2 PARAMETER Options Detail

ParameterTypeDefaultRange/Description
temperaturefloat0.80.0~2.0. Higher is more creative, lower is more deterministic
top_pfloat0.90.0~1.0. Nucleus sampling probability threshold
top_kint401~100. Consider only top k tokens
min_pfloat0.00.0~1.0. Minimum probability filtering
num_predictint-1Maximum tokens to generate (-1: unlimited, -2: until context fills)
num_ctxint2048Context window size (in tokens)
repeat_penaltyfloat1.1Repetition penalty (1.0 disables)
repeat_last_nint64Repetition check range (0: disabled, -1: num_ctx)
seedint0Random seed (0 means different results each time)
stopstring-Stop string (multiple can be specified)
num_gpuintautoNumber of layers to offload to GPU (0: CPU only)
num_threadintautoNumber of CPU threads
num_batchint512Prompt processing batch size
mirostatint0Mirostat sampling (0: disabled, 1: Mirostat, 2: Mirostat 2.0)
mirostat_etafloat0.1Mirostat learning rate
mirostat_taufloat5.0Mirostat target entropy
tfs_zfloat1.0Tail-Free Sampling (1.0 disables)
typical_pfloat1.0Locally Typical Sampling (1.0 disables)
use_mlockboolfalseLock model in memory (prevent swap)
num_keepint0Number of tokens to keep during context recycling
penalize_newlinebooltrueApply penalty to newline tokens

11.3 Using Parameters in API

Pass parameters via the options field in API calls.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "options": {
    "temperature": 0.3,
    "top_p": 0.9,
    "top_k": 50,
    "num_ctx": 8192,
    "num_predict": 1024,
    "repeat_penalty": 1.2,
    "seed": 42,
    "stop": ["<|eot_id|>"]
  }
}'

12. Complete Ollama Environment Variables Reference

12.1 Server and Network

Environment VariableDefaultDescription
OLLAMA_HOST127.0.0.1:11434Server bind address and port
OLLAMA_ORIGINSNoneCORS allowed origins (comma-separated)
OLLAMA_KEEP_ALIVE5mIdle time before model unload (5m, 1h, -1=permanent)
OLLAMA_MAX_QUEUE512Maximum queue size (requests rejected when exceeded)
OLLAMA_NUM_PARALLEL1Concurrent requests per model
OLLAMA_MAX_LOADED_MODELS1 (CPU), GPUs*3Maximum simultaneously loaded models

12.2 Storage and Paths

Environment VariableDefaultDescription
OLLAMA_MODELSOS default pathModel storage directory
OLLAMA_TMPDIRSystem tempTemporary file directory
OLLAMA_NOPRUNENoneDisable unused blob cleanup on boot

Default model storage paths by platform:

OSDefault Path
macOS~/.ollama/models
Linux/usr/share/ollama/.ollama/models
WindowsC:\Users\<user>\.ollama\models

12.3 GPU and Performance

Environment VariableDefaultDescription
OLLAMA_FLASH_ATTENTION0Enable Flash Attention (set to 1)
OLLAMA_KV_CACHE_TYPEf16KV Cache quantization type (f16, q8_0, q4_0)
OLLAMA_GPU_OVERHEAD0VRAM to reserve per GPU (bytes)
OLLAMA_LLM_LIBRARYautoForce specific LLM library
CUDA_VISIBLE_DEVICESAll GPUsNVIDIA GPU device numbers to use
ROCR_VISIBLE_DEVICESAll GPUsAMD GPU device numbers to use
GPU_DEVICE_ORDINALAll GPUsGPU order to use

12.4 Logging and Debug

Environment VariableDefaultDescription
OLLAMA_DEBUG0Enable debug logging (set to 1)
OLLAMA_NOHISTORY0Disable readline history in interactive mode

12.5 Context and Inference

Environment VariableDefaultDescription
OLLAMA_CONTEXT_LENGTH4096Default context window size
OLLAMA_NO_CLOUD0Disable cloud features (set to 1)
HTTPS_PROXY / HTTP_PROXYNoneProxy server settings
NO_PROXYNoneProxy bypass hosts

12.6 How to Set Environment Variables

macOS (launchctl):

# Set environment variables
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_MODELS "/Volumes/ExternalSSD/ollama/models"
launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_NUM_PARALLEL "4"
launchctl setenv OLLAMA_KEEP_ALIVE "-1"

# Restart Ollama
brew services restart ollama

Linux (systemd):

# Create systemd service override
sudo systemctl edit ollama

# Add the following in the editor:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="CUDA_VISIBLE_DEVICES=0,1"

# Restart service
sudo systemctl daemon-reload
sudo systemctl restart ollama

Docker:

docker run -d --gpus=all \
  -e OLLAMA_HOST=0.0.0.0:11434 \
  -e OLLAMA_FLASH_ATTENTION=1 \
  -e OLLAMA_KV_CACHE_TYPE=q8_0 \
  -e OLLAMA_NUM_PARALLEL=4 \
  -e OLLAMA_KEEP_ALIVE=-1 \
  -v /data/ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

13. Advanced Ollama Usage

13.1 Modelfile Writing Guide

Korean Assistant Model:

FROM llama3.1:8b

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER repeat_penalty 1.15
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

SYSTEM """
You are a Korean AI assistant well-versed in Korean culture and history.
You always respond accurately and kindly in Korean, using English technical terms alongside when needed.
You provide answers in a structured format.
"""

MESSAGE user Hello, please introduce yourself.
MESSAGE assistant Hello! I'm an AI assistant specialized in Korean. I can help with various topics including Korean culture, history, technology, and more. Feel free to ask me anything!
# Create model
ollama create korean-assistant -f ./Modelfile-korean

# Run
ollama run korean-assistant "Tell me about the three grand palaces of Seoul"

Code Review Model:

FROM qwen2.5-coder:7b

PARAMETER temperature 0.2
PARAMETER top_p 0.85
PARAMETER num_ctx 8192
PARAMETER num_predict 2048

SYSTEM """
You are an expert code reviewer. Analyze code for:
1. Bugs and potential issues
2. Performance improvements
3. Security vulnerabilities
4. Code style and best practices

Provide specific, actionable feedback with corrected code examples.
"""

Quantization Level Selection Guide:

QuantizationSize RatioQualitySpeedRecommended Use
Q2_K~30%LowVery FastTesting only
Q3_K_M~37%FairFastMemory-constrained
Q4_0~42%GoodFastGeneral use (default)
Q4_K_M~45%Good+FastGeneral use (recommended)
Q5_K_M~53%GreatMediumQuality-focused
Q6_K~62%ExcellentMediumHigh quality required
Q8_0~80%BestSlowNear-original quality
F16100%OriginalSlowBaseline/benchmark

13.2 GPU Acceleration Setup

NVIDIA GPU:

# Check NVIDIA driver
nvidia-smi

# Use specific GPU only
CUDA_VISIBLE_DEVICES=0 ollama serve

# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1 ollama serve

AMD GPU (ROCm):

# Check ROCm driver
rocm-smi

# Specify GPU
ROCR_VISIBLE_DEVICES=0 ollama serve

Apple Silicon (Metal):

On macOS, Metal GPU acceleration is automatically enabled. No separate configuration needed.

# Check GPU usage (Processor column in ollama ps)
ollama ps
# NAME           ID            SIZE    PROCESSOR     UNTIL
# llama3.1:8b    a]f2e33d4e25  6.7 GB  100% GPU      4 minutes from now

13.3 Docker Deployment

# docker-compose.yaml
version: '3.8'

services:
  ollama:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - '11434:11434'
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_KV_CACHE_TYPE=q8_0
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_KEEP_ALIVE=24h
    restart: unless-stopped
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:11434/api/version']
      interval: 30s
      timeout: 5s
      retries: 3

  # Model initialization (optional)
  ollama-init:
    image: curlimages/curl:latest
    depends_on:
      ollama:
        condition: service_healthy
    entrypoint: >
      sh -c "
        curl -s http://ollama:11434/api/pull -d '{\"name\": \"llama3.1:8b\"}' &&
        curl -s http://ollama:11434/api/pull -d '{\"name\": \"nomic-embed-text\"}'
      "

volumes:
  ollama_data:

13.4 Multimodal Model Usage

# Run LLaVA model
ollama run llava "What's in this image? /path/to/photo.jpg"

# Llama 3.2 Vision
ollama run llama3.2-vision "Describe this image in Korean. /path/to/image.png"
import requests
import base64

# Encode image to base64
with open("image.jpg", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode("utf-8")

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llava",
    "messages": [
        {
            "role": "user",
            "content": "What's in this image?",
            "images": [image_base64],
        }
    ],
    "stream": False,
})
print(response.json()["message"]["content"])

13.5 Tool Calling / Function Calling

Ollama supports OpenAI-compatible Tool Calling.

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "What's the current weather in Seoul?"}
    ],
    tools=tools,
    tool_choice="auto",
)

message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

Part 3: Comparison and Practice

14. vLLM vs Ollama Comparison

14.1 Comprehensive Comparison Table

ItemvLLMOllama
Primary UseProduction API serving, high-throughput inferenceLocal development, prototyping, personal use
EngineCustom engine (PagedAttention)llama.cpp
Model FormatHF Safetensors, AWQ, GPTQ, FP8GGUF (quantized)
APIOpenAI compatibleNative + OpenAI compatible
Install DifficultyMedium (Python/CUDA env required)Very Easy (single binary)
GPU RequiredNearly essential (NVIDIA/AMD)Optional (runs on CPU)
Multi-GPUTP + PP (up to hundreds of GPUs)Auto-distributed (limited)
ConcurrencyHundreds~thousands of requestsDefault 1~4 parallel
QuantizationAWQ, GPTQ, FP8, BnBGGUF Q2~Q8, F16
Continuous BatchingSupportedNot supported (llama.cpp limitation)
PagedAttentionCore technologyNot supported
Prefix CachingSupported (automatic)Not supported
LoRA ServingMulti-LoRA concurrent servingSingle LoRA
Structured OutputJSON Schema, Regex, GrammarJSON mode
Speculative DecodingSupported (Draft model, N-gram)Not supported
StreamingSupportedSupported
Docker DeploymentOfficial image (GPU)Official image (CPU/GPU)
KubernetesOfficial guide + Production StackCommunity Helm Chart
Memory EfficiencyVery high (less than 4% waste)High (GGUF quantization)
LicenseApache 2.0MIT

14.2 Throughput Comparison (Llama 3.1 8B, RTX 4090)

Concurrent UsersvLLM (tokens/s)Ollama (tokens/s)Ratio
1~140~652.2x
5~500~1204.2x
10~800~1505.3x
50~1,200~1508.0x
100~1,500~150 (queued)10.0x

In Red Hat's benchmark, vLLM showed 793 TPS vs Ollama 41 TPS on the same hardware -- a 19x difference. This varies depending on concurrent requests, batch size, and model size.


15. Performance Benchmarks

15.1 Throughput Comparison

MetricvLLMOllamaNotes
Single Request TPS100~140 tok/s50~70 tok/sRTX 4090, Llama 3.1 8B
10 Concurrent Total TPS700~900 tok/s120~200 tok/sContinuous Batching effect
50 Concurrent Total TPS1,000~1,500 tok/s~150 tok/sOllama queues requests
Batch Inference (1K prompts)2,000~3,000 tok/sNot supportedvLLM offline inference

15.2 Latency Comparison

MetricvLLMOllamaNotes
TTFT (Time To First Token)50~200 ms100~500 msVaries by prompt length
TPOT (Time Per Output Token)7~15 ms15~25 msSingle request basis
P99 Latency80~150 ms500~700 ms10 concurrent requests
Model Loading Time30~120 sec5~30 secGGUF loads faster

15.3 Memory Usage Comparison (Llama 3.1 8B)

ConfigurationvLLM GPU MemoryOllama GPU MemoryNotes
FP16~16 GBN/AvLLM default
FP8~9 GBN/AH100 only
AWQ 4-bit~5 GBN/AvLLM quantized
GPTQ 4-bit~5 GBN/AvLLM quantized
Q4_K_M (GGUF)N/A~5.5 GBOllama default
Q5_K_M (GGUF)N/A~6.2 GBHigher quality
Q8_0 (GGUF)N/A~9 GBBest quantization quality
KV Cache included (4K ctx)+0.5~2 GB+0.5~1.5 GBProportional to sequences

16.1 Individual Developer Local Environment

Recommended: Ollama

# Install and use immediately
ollama run llama3.1

# VS Code + Continue extension integration
# Set Ollama endpoint in settings.json

Reason: Simple installation, runs on CPU, supports macOS/Windows/Linux. Easy integration with IDE extensions.

16.2 Production API Serving

Recommended: vLLM

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --api-key ${API_KEY}

Reason: Overwhelming concurrent request handling with Continuous Batching. High memory efficiency with PagedAttention. Mature multi-GPU support, Kubernetes deployment, and monitoring integration.

16.3 Edge/IoT Environments

Recommended: Ollama + High Quantization

# Small model + high quantization
ollama run phi3:3.8b-mini-instruct-4k-q4_0

# Or Qwen 0.5B
ollama run qwen2.5:0.5b

Reason: Simple deployment as single binary. Runs on low-spec hardware with GGUF quantization. CPU-only inference support.

16.4 Large-Scale Batch Inference

Recommended: vLLM Offline Inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.95,
)

# Process thousands of prompts at once
prompts = load_prompts_from_file("prompts.jsonl")  # 10,000+ prompts
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)

outputs = llm.generate(prompts, sampling_params)
save_outputs(outputs, "results.jsonl")

Reason: Batch scheduling that maximizes GPU memory utilization. Efficiently processes thousands to tens of thousands of prompts.

16.5 RAG Pipeline

Both work -- choose based on situation:

# Ollama-based RAG (development/small-scale)
from langchain_ollama import OllamaLLM, OllamaEmbeddings

llm = OllamaLLM(model="llama3.1")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# vLLM-based RAG (production)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(
    base_url="http://vllm-server:8000/v1",
    api_key="token",
    model="meta-llama/Llama-3.1-8B-Instruct",
)

17. Request Tracing Integration

Tracking LLM requests in production environments is essential for debugging, auditing, and performance monitoring.

17.1 vLLM Request ID Tracking

vLLM automatically generates a request_id in its OpenAI API-compatible server. To pass a custom ID, use extra_body.

from openai import OpenAI
import uuid

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# Pass custom request_id
xid = str(uuid.uuid4())

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"X-Request-ID": xid},
)

print(f"XID: {xid}")
print(f"Response ID: {response.id}")

17.2 Ollama Request Tracking

Ollama's native API does not support a separate request ID, so handle it at the reverse proxy level.

import requests
import uuid

xid = str(uuid.uuid4())

response = requests.post(
    "http://localhost:11434/api/chat",
    headers={"X-Request-ID": xid},
    json={
        "model": "llama3.1",
        "messages": [{"role": "user", "content": "Hello"}],
        "stream": False,
    },
)

# Include xid in logging
import logging
logger = logging.getLogger(__name__)
logger.info(f"[xid={xid}] Response: {response.status_code}")

17.3 X-Request-ID Forwarding at API Gateway

NGINX Configuration:

upstream vllm_backend {
    server vllm-server:8000;
}

server {
    listen 80;

    location /v1/ {
        # Auto-generate X-Request-ID if missing
        set $request_id $http_x_request_id;
        if ($request_id = "") {
            set $request_id $request_id;
        }

        proxy_pass http://vllm_backend;
        proxy_set_header X-Request-ID $request_id;
        proxy_set_header Host $host;

        # Add X-Request-ID to response headers
        add_header X-Request-ID $request_id always;

        # Include request_id in access log
        access_log /var/log/nginx/vllm_access.log combined_with_xid;
    }
}

# Log format definition
log_format combined_with_xid '$remote_addr - $remote_user [$time_local] '
    '"$request" $status $body_bytes_sent '
    '"$http_referer" "$http_user_agent" '
    'xid="$http_x_request_id"';

17.4 OpenTelemetry Integration

# vLLM + OpenTelemetry distributed tracing
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Initialize Tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://jaeger:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

# Wrap LLM call as a Span
def call_llm(prompt: str, xid: str) -> str:
    with tracer.start_as_current_span("llm_inference") as span:
        span.set_attribute("xid", xid)
        span.set_attribute("model", "llama-3.1-8b")
        span.set_attribute("prompt_length", len(prompt))

        response = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": prompt}],
            extra_headers={"X-Request-ID": xid},
        )

        result = response.choices[0].message.content
        span.set_attribute("response_length", len(result))
        span.set_attribute("tokens_used", response.usage.total_tokens)

        return result

17.5 xid Usage Patterns in Logging

Python Example:

import logging
import uuid
from contextvars import ContextVar

# Manage xid with Context Variable
request_xid: ContextVar[str] = ContextVar("request_xid", default="")

class XIDFilter(logging.Filter):
    def filter(self, record):
        record.xid = request_xid.get("")
        return True

# Logger setup
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
    "%(asctime)s [%(levelname)s] [xid=%(xid)s] %(message)s"
))
handler.addFilter(XIDFilter())

logger = logging.getLogger("llm_service")
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
async def handle_request(prompt: str):
    xid = str(uuid.uuid4())
    request_xid.set(xid)

    logger.info(f"Received prompt: {prompt[:50]}...")

    response = await call_llm(prompt, xid)

    logger.info(f"Generated {len(response)} chars")
    return {"xid": xid, "response": response}

Go Example:

package main

import (
    "context"
    "fmt"
    "log/slog"
    "net/http"

    "github.com/google/uuid"
)

type contextKey string
const xidKey contextKey = "xid"

// XID Middleware
func xidMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        xid := r.Header.Get("X-Request-ID")
        if xid == "" {
            xid = uuid.New().String()
        }

        ctx := context.WithValue(r.Context(), xidKey, xid)
        w.Header().Set("X-Request-ID", xid)

        slog.Info("request received",
            "xid", xid,
            "method", r.Method,
            "path", r.URL.Path,
        )

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

// Ollama call function
func callOllama(ctx context.Context, prompt string) (string, error) {
    xid := ctx.Value(xidKey).(string)

    slog.Info("calling ollama",
        "xid", xid,
        "prompt_len", len(prompt),
    )

    // ... Ollama API call logic ...

    slog.Info("ollama response received",
        "xid", xid,
        "response_len", len(response),
    )

    return response, nil
}

18. References

vLLM

Ollama

Papers and Technical Resources

Quiz

Q1: What is the main topic covered in "The Complete Guide to vLLM & Ollama: LLM Serving Engine Setup, Parameters, and Environment Variables"?

A comprehensive deep dive comparing vLLM PagedAttention architecture and Ollama local LLM runtime environments. Covers everything from installation, server startup, API calls, key CLI arguments, sampling parameters, environment variables, quantization (AWQ/GPTQ/GGUF), multi-GPU c...

Q2: What are the key steps for vLLM Installation and Startup? 2.1 pip Installation 2.2 conda Installation 2.3 Docker Installation 2.4 Basic Server Start config.yaml example: 2.5 Offline Batch Inference You can perform batch inference directly from Python code without starting a server.

Q3: Explain the core concept of vLLM Sampling Parameters. vLLM supports OpenAI API-compatible parameters plus additional advanced parameters. 4.1 Complete Parameter Reference 4.2 API Call Examples with curl 4.3 Python requests Example 4.4 Streaming Example with OpenAI SDK

Q4: What are the key steps for Advanced vLLM Configuration? 6.1 Multi-GPU Setup Tensor Parallelism (TP): Distributes each layer of the model across multiple GPUs. The most commonly used approach on a single node. Pipeline Parallelism (PP): Places model layers sequentially across multiple GPUs.

Q5: What are the key steps for Ollama Installation and Startup? 8.1 Platform-Specific Installation macOS: Linux: Windows: Download and run the Windows installer from the official website (ollama.com). Docker: 8.2 Basic Usage