Split View: LiteLLM 완전 가이드 2025: 100+ LLM을 하나의 API로 통합하는 프록시 서버
LiteLLM 완전 가이드 2025: 100+ LLM을 하나의 API로 통합하는 프록시 서버
- 서론: 왜 LiteLLM인가?
- 1. LiteLLM 아키텍처: SDK vs Proxy
- 2. SDK 사용법: 다양한 LLM 통합 호출
- 3. Proxy 서버 설정
- 4. 지원 제공자 및 모델
- 5. 모델 라우팅 전략
- 6. 로드 밸런싱과 폴백
- 7. 가상 키와 팀 관리
- 8. 예산 관리와 비용 추적
- 9. 레이트 리밋 (Rate Limiting)
- 10. Guardrails: PII 마스킹과 콘텐츠 필터링
- 11. 로깅과 모니터링
- 12. 캐싱
- 13. 프로덕션 배포
- 14. LiteLLM vs 대안 비교
- 15. 실전 사례: 완전한 프로덕션 설정
- 퀴즈
- 참고 자료
서론: 왜 LiteLLM인가?
2025년 현재, AI 시장은 그 어느 때보다 빠르게 변화하고 있습니다. OpenAI의 GPT-4o, Anthropic의 Claude Opus 4, Google의 Gemini 2.0, Meta의 Llama 3.1, Mistral의 Large 등 새로운 모델이 매월 출시되고 있으며, 각 제공자마다 서로 다른 API 형식, 인증 방식, 가격 정책을 가지고 있습니다.
이런 상황에서 기업들이 직면하는 3가지 핵심 문제가 있습니다.
1. 벤더 락인(Vendor Lock-in) 문제
한 LLM 제공자의 API에 코드를 맞추면, 더 좋은 모델이 나와도 전환 비용이 큽니다.
# 문제: 각 제공자마다 다른 API 형식
# OpenAI
import openai
response = openai.ChatCompletion.create(model="gpt-4", messages=[...])
# Anthropic
import anthropic
response = anthropic.messages.create(model="claude-3-opus", messages=[...])
# Google
import google.generativeai as genai
response = genai.GenerativeModel("gemini-pro").generate_content(...)
2. 비용 관리의 어려움
여러 모델을 사용하면 각 제공자별 비용을 추적하고 예산을 관리하기가 복잡해집니다.
3. 가용성과 안정성
단일 제공자에 의존하면 해당 서비스에 장애가 생겼을 때 전체 시스템이 멈춥니다.
LiteLLM은 이 세 가지 문제를 하나의 솔루션으로 해결합니다.
┌──────────────────────────────────────────────────┐
│ Your Application │
│ (OpenAI SDK Format) │
└─────────────────────┬────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ LiteLLM Proxy Server │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Cost │ │ Load │ │ Virtual Keys │ │
│ │ Tracking│ │ Balance │ │ + Budget Mgmt │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
└───┬──────────────┬──────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌───────────┐
│ OpenAI │ │ Anthropic│ │ Google │
│ GPT-4o │ │ Claude │ │ Gemini │
└────────┘ └──────────┘ └───────────┘
▲ ▲ ▲
│ │ │
┌────────┐ ┌──────────┐ ┌───────────┐
│ AWS │ │ Azure │ │ Ollama │
│Bedrock │ │ OpenAI │ │ (Local) │
└────────┘ └──────────┘ └───────────┘
1. LiteLLM 아키텍처: SDK vs Proxy
LiteLLM은 두 가지 모드로 사용할 수 있습니다.
1.1 SDK 모드 (Python 패키지)
애플리케이션에 직접 임포트하여 사용합니다. 간단한 프로젝트에 적합합니다.
# pip install litellm
from litellm import completion
# OpenAI 호출
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
# Anthropic 호출 (같은 형식!)
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello!"}]
)
# Google Gemini 호출 (같은 형식!)
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "Hello!"}]
)
1.2 Proxy 모드 (서버)
독립 서버로 실행하여 모든 애플리케이션이 하나의 엔드포인트를 사용합니다. 팀/조직에 적합합니다.
# 프록시 서버 시작
litellm --config config.yaml --port 4000
# 어떤 언어든 OpenAI SDK 형식으로 호출
import openai
client = openai.OpenAI(
api_key="sk-litellm-your-virtual-key",
base_url="http://localhost:4000"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
1.3 SDK vs Proxy 비교
| 항목 | SDK 모드 | Proxy 모드 |
|---|---|---|
| 사용 방식 | Python 패키지 임포트 | HTTP 서버 (REST API) |
| 언어 지원 | Python만 | 모든 언어 (OpenAI SDK 호환) |
| 팀 관리 | 불가능 | 가상 키로 팀별 관리 |
| 예산 관리 | 기본적 | 고급 (팀별/키별) |
| 로드 밸런싱 | 코드에서 직접 | 설정 파일로 자동 |
| 적합한 경우 | 개인 프로젝트, 프로토타입 | 팀/조직, 프로덕션 |
2. SDK 사용법: 다양한 LLM 통합 호출
2.1 Chat Completion
LiteLLM SDK의 핵심 기능입니다. 100+ 모델을 동일한 인터페이스로 호출합니다.
from litellm import completion
import os
# 환경 변수로 API 키 설정
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
os.environ["GEMINI_API_KEY"] = "AI..."
# 동일한 함수, 다른 모델
models = [
"gpt-4o",
"anthropic/claude-sonnet-4-20250514",
"gemini/gemini-2.0-flash",
"groq/llama-3.1-70b-versatile",
"mistral/mistral-large-latest",
]
for model in models:
response = completion(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in 2 sentences."}
],
temperature=0.7,
max_tokens=200,
)
print(f"[{model}]: {response.choices[0].message.content}")
2.2 스트리밍 응답
from litellm import completion
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Write a short poem about coding."}],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
2.3 Embedding 호출
from litellm import embedding
# OpenAI Embedding
response = embedding(
model="text-embedding-3-small",
input=["Hello world", "How are you?"]
)
# Cohere Embedding (같은 형식!)
response = embedding(
model="cohere/embed-english-v3.0",
input=["Hello world", "How are you?"]
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")
2.4 이미지 생성
from litellm import image_generation
response = image_generation(
model="dall-e-3",
prompt="A futuristic city powered by AI, digital art style",
n=1,
size="1024x1024",
)
print(f"Image URL: {response.data[0]['url']}")
2.5 비동기 호출
import asyncio
from litellm import acompletion
async def generate_responses():
tasks = [
acompletion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Tell me fact #{i} about space."}]
)
for i in range(5)
]
responses = await asyncio.gather(*tasks)
for i, resp in enumerate(responses):
print(f"Fact #{i}: {resp.choices[0].message.content[:100]}")
asyncio.run(generate_responses())
3. Proxy 서버 설정
3.1 기본 설정 파일 (config.yaml)
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gemini-flash
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
- model_name: llama-local
litellm_params:
model: ollama/llama3.1:70b
api_base: http://localhost:11434
litellm_settings:
drop_params: true
set_verbose: false
num_retries: 3
request_timeout: 120
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
3.2 Docker로 시작하기
# Docker Compose 파일
cat > docker-compose.yaml << 'EOF'
version: "3.9"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./config.yaml:/app/config.yaml
command: ["--config", "/app/config.yaml", "--port", "4000"]
environment:
- OPENAI_API_KEY
- ANTHROPIC_API_KEY
- GEMINI_API_KEY
- DATABASE_URL=postgresql://user:pass@db:5432/litellm
depends_on:
- db
db:
image: postgres:16
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
POSTGRES_DB: litellm
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
EOF
# 시작
docker compose up -d
3.3 프록시 서버 테스트
# 헬스 체크
curl http://localhost:4000/health
# Chat Completion 호출
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# 모델 목록 조회
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-master-key-1234"
4. 지원 제공자 및 모델
LiteLLM은 100+ LLM 제공자를 지원합니다. 주요 제공자를 정리합니다.
4.1 주요 제공자 목록
┌────────────────────┬──────────────────────────────────────┐
│ 제공자 │ 지원 모델 (예시) │
├────────────────────┼──────────────────────────────────────┤
│ OpenAI │ gpt-4o, gpt-4o-mini, o1, o3-mini │
│ Anthropic │ claude-opus-4, claude-sonnet, haiku │
│ Google (Gemini) │ gemini-2.0-flash, gemini-1.5-pro │
│ AWS Bedrock │ Claude, Titan, Llama via Bedrock │
│ Azure OpenAI │ gpt-4o (Azure 호스팅) │
│ Mistral AI │ mistral-large, mistral-small │
│ Groq │ llama-3.1-70b, mixtral-8x7b │
│ Together AI │ llama-3.1, CodeLlama, Qwen │
│ Ollama (로컬) │ llama3.1, codellama, mistral 등 │
│ vLLM (자체 호스팅) │ 모든 HuggingFace 모델 │
│ Cohere │ command-r-plus, embed 모델 │
│ Deepseek │ deepseek-chat, deepseek-coder │
│ Fireworks AI │ llama, mixtral 등 │
│ Perplexity │ pplx-70b-online 등 │
└────────────────────┴──────────────────────────────────────┘
4.2 AWS Bedrock 설정
model_list:
- model_name: bedrock-claude
litellm_params:
model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: us-east-1
- model_name: bedrock-llama
litellm_params:
model: bedrock/meta.llama3-1-70b-instruct-v1:0
aws_region_name: us-west-2
4.3 Azure OpenAI 설정
model_list:
- model_name: azure-gpt4
litellm_params:
model: azure/gpt-4o-deployment-name
api_base: https://your-resource.openai.azure.com/
api_key: os.environ/AZURE_API_KEY
api_version: "2024-06-01"
4.4 Ollama (로컬 LLM) 설정
model_list:
- model_name: local-llama
litellm_params:
model: ollama/llama3.1:70b
api_base: http://localhost:11434
stream: true
- model_name: local-codellama
litellm_params:
model: ollama/codellama:34b
api_base: http://localhost:11434
5. 모델 라우팅 전략
5.1 비용 기반 라우팅 (Lowest Cost)
가장 저렴한 모델을 자동 선택합니다.
router_settings:
routing_strategy: cost-based-routing
model_list:
- model_name: general-chat
litellm_params:
model: gpt-4o-mini
# Input: $0.15/1M tokens, Output: $0.60/1M tokens
- model_name: general-chat
litellm_params:
model: anthropic/claude-3-haiku-20240307
# Input: $0.25/1M tokens, Output: $1.25/1M tokens
- model_name: general-chat
litellm_params:
model: gemini/gemini-2.0-flash
# Input: $0.075/1M tokens, Output: $0.30/1M tokens
위 설정에서 general-chat을 호출하면 Gemini Flash가 가장 저렴하므로 자동 선택됩니다.
5.2 레이턴시 기반 라우팅 (Lowest Latency)
가장 빠른 응답 시간의 모델을 선택합니다.
router_settings:
routing_strategy: latency-based-routing
routing_strategy_args:
ttl: 60 # 60초마다 레이턴시 재측정
5.3 사용량 기반 라우팅 (Usage-Based)
RPM(분당 요청 수) 기준으로 여유 있는 모델에 라우팅합니다.
router_settings:
routing_strategy: usage-based-routing
model_list:
- model_name: fast-model
litellm_params:
model: gpt-4o-mini
rpm: 500 # 분당 최대 500 요청
- model_name: fast-model
litellm_params:
model: gemini/gemini-2.0-flash
rpm: 1000 # 분당 최대 1000 요청
5.4 커스텀 태그 기반 라우팅
# 요청 시 metadata로 라우팅 제어
response = client.chat.completions.create(
model="general-chat",
messages=[{"role": "user", "content": "Complex analysis..."}],
extra_body={
"metadata": {
"tags": ["high-quality", "long-context"]
}
}
)
6. 로드 밸런싱과 폴백
6.1 로드 밸런싱 설정
같은 model_name에 여러 배포를 등록하면 자동으로 로드 밸런싱됩니다.
model_list:
# 같은 모델의 여러 API 키로 분산
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-key-1
rpm: 100
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-key-2
rpm: 100
# 다른 제공자 간 분산
- model_name: fast-model
litellm_params:
model: openai/gpt-4o-mini
rpm: 500
- model_name: fast-model
litellm_params:
model: anthropic/claude-3-haiku-20240307
rpm: 300
router_settings:
routing_strategy: simple-shuffle # round-robin 방식
allowed_fails: 3 # 3번 실패 시 해당 배포 비활성화
cooldown_time: 60 # 60초 후 재활성화
6.2 폴백 체인 설정
주 모델이 실패하면 대체 모델로 자동 전환합니다.
litellm_settings:
fallbacks:
- model_name: gpt-4o
fallback_models:
- anthropic/claude-sonnet-4-20250514
- gemini/gemini-1.5-pro
- model_name: claude-sonnet
fallback_models:
- gpt-4o
- gemini/gemini-1.5-pro
# 콘텐츠 정책 위반 시 다른 모델로
content_policy_fallbacks:
- model_name: gpt-4o
fallback_models:
- anthropic/claude-sonnet-4-20250514
# 컨텍스트 윈도우 초과 시 더 큰 모델로
context_window_fallbacks:
- model_name: gpt-4o-mini
fallback_models:
- gpt-4o
- anthropic/claude-sonnet-4-20250514
6.3 폴백 동작 흐름
사용자 요청: "gpt-4o"로 Chat Completion
│
▼
[1] gpt-4o 호출 시도
│
├── 성공 → 응답 반환
│
└── 실패 (429 Rate Limit / 500 Error)
│
▼
[2] claude-sonnet-4 호출 시도 (1차 폴백)
│
├── 성공 → 응답 반환
│
└── 실패
│
▼
[3] gemini-1.5-pro 호출 시도 (2차 폴백)
│
├── 성공 → 응답 반환
│
└── 실패 → 에러 반환
7. 가상 키와 팀 관리
7.1 가상 키 생성
# 마스터 키로 새 가상 키 생성
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "frontend-team-key",
"duration": "30d",
"max_budget": 100.0,
"models": ["gpt-4o", "claude-sonnet"],
"max_parallel_requests": 10,
"tpm_limit": 100000,
"rpm_limit": 100,
"metadata": {
"team": "frontend",
"environment": "production"
}
}'
7.2 팀 생성 및 관리
# 팀 생성
curl -X POST http://localhost:4000/team/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_alias": "ml-engineering",
"max_budget": 500.0,
"models": ["gpt-4o", "claude-sonnet", "gemini-flash"],
"members_with_roles": [
{"role": "admin", "user_id": "user-alice@company.com"},
{"role": "user", "user_id": "user-bob@company.com"}
],
"metadata": {
"department": "engineering",
"cost_center": "ENG-001"
}
}'
# 팀에 키 할당
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_id": "team-ml-engineering-id",
"key_alias": "ml-team-prod-key",
"max_budget": 200.0
}'
7.3 키별 모델 접근 제어
# config.yaml에서 모델 접근 그룹 설정
model_list:
- model_name: premium-model
litellm_params:
model: openai/gpt-4o
model_info:
access_groups: ["premium-tier"]
- model_name: basic-model
litellm_params:
model: openai/gpt-4o-mini
model_info:
access_groups: ["basic-tier", "premium-tier"]
8. 예산 관리와 비용 추적
8.1 글로벌 예산 설정
general_settings:
max_budget: 10000.0 # 전체 월 예산 $10,000
budget_duration: 1m # 월간 리셋
8.2 키별/팀별 예산
# 키별 예산 업데이트
curl -X POST http://localhost:4000/key/update \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key": "sk-virtual-key-xxx",
"max_budget": 50.0,
"budget_duration": "1m",
"soft_budget": 40.0
}'
soft_budget에 도달하면 알림을 보내고, max_budget에 도달하면 요청이 차단됩니다.
8.3 비용 추적 API
# 전체 비용 조회
curl http://localhost:4000/spend/logs \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "start_date=2025-01-01" \
-d "end_date=2025-01-31"
# 모델별 비용 조회
curl http://localhost:4000/spend/tags \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "group_by=model"
# 팀별 비용 조회
curl http://localhost:4000/spend/tags \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "group_by=team"
8.4 예산 알림 Webhook
general_settings:
alerting:
- slack
alerting_threshold: 300 # 예산의 80% 소진 시 알림
alert_types:
- budget_alerts
- spend_reports
- failed_tracking
environment_variables:
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
8.5 비용 최적화 대시보드 예시
┌─────────────────────────────────────────────────────┐
│ LiteLLM Cost Dashboard (January 2025) │
├─────────────────────────────────────────────────────┤
│ │
│ Total Spend: $3,245.67 / $10,000.00 budget │
│ ████████████████░░░░░░░░░░ 32.5% │
│ │
│ By Model: │
│ ├─ gpt-4o: $1,823.45 (56.2%) │
│ ├─ claude-sonnet: $891.23 (27.5%) │
│ ├─ gemini-flash: $234.56 (7.2%) │
│ └─ gpt-4o-mini: $296.43 (9.1%) │
│ │
│ By Team: │
│ ├─ ML Engineering: $1,456.78 │
│ ├─ Frontend: $987.65 │
│ └─ Data Science: $801.24 │
│ │
│ Daily Trend: ▁▂▃▅▆▇█▇▆▅▃▂▁▂▃▅▆ │
└─────────────────────────────────────────────────────┘
9. 레이트 리밋 (Rate Limiting)
9.1 키 레벨 레이트 리밋
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "rate-limited-key",
"rpm_limit": 60,
"tpm_limit": 100000,
"max_parallel_requests": 5
}'
9.2 모델 레벨 레이트 리밋
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 200 # 이 배포의 분당 최대 요청
tpm: 400000 # 이 배포의 분당 최대 토큰
9.3 글로벌 레이트 리밋
general_settings:
global_max_parallel_requests: 100 # 전체 동시 요청 제한
10. Guardrails: PII 마스킹과 콘텐츠 필터링
10.1 PII(개인정보) 마스킹
litellm_settings:
guardrails:
- guardrail_name: pii-masking
litellm_params:
guardrail: presidio
mode: during_call # 요청 전에 PII 마스킹
output_parse_pii: true # 응답에서도 PII 마스킹
# PII가 포함된 요청
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "My email is john@example.com and phone is 010-1234-5678"
}],
extra_body={
"metadata": {
"guardrails": ["pii-masking"]
}
}
)
# LLM에는 "My email is [EMAIL] and phone is [PHONE]"로 전송됨
10.2 콘텐츠 필터링 (Lakera Guard)
litellm_settings:
guardrails:
- guardrail_name: content-filter
litellm_params:
guardrail: lakera
mode: pre_call # 요청 전에 필터링
api_key: os.environ/LAKERA_API_KEY
10.3 프롬프트 인젝션 방어
litellm_settings:
guardrails:
- guardrail_name: prompt-injection
litellm_params:
guardrail: lakera
mode: pre_call
prompt_injection: true
10.4 커스텀 Guardrail
# custom_guardrail.py
from litellm.proxy.guardrails.guardrail_hooks import GuardrailHook
class CustomContentFilter(GuardrailHook):
def __init__(self):
self.blocked_words = ["harmful", "dangerous", "illegal"]
async def pre_call_hook(self, data, call_type):
user_message = data.get("messages", [])[-1].get("content", "")
for word in self.blocked_words:
if word.lower() in user_message.lower():
raise ValueError(f"Blocked content detected: {word}")
return data
async def post_call_hook(self, data, response, call_type):
# 응답 후처리 로직
return response
11. 로깅과 모니터링
11.1 Langfuse 연동
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
LANGFUSE_HOST: https://cloud.langfuse.com
11.2 LangSmith 연동
litellm_settings:
success_callback: ["langsmith"]
environment_variables:
LANGCHAIN_API_KEY: os.environ/LANGCHAIN_API_KEY
LANGCHAIN_PROJECT: my-litellm-project
11.3 커스텀 콜백
# custom_callback.py
from litellm.integrations.custom_logger import CustomLogger
class MyCustomLogger(CustomLogger):
async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
print(f"Model: {kwargs.get('model')}")
print(f"Cost: ${kwargs.get('response_cost', 0):.6f}")
print(f"Latency: {(end_time - start_time).total_seconds():.2f}s")
print(f"Tokens: {response_obj.usage.total_tokens}")
async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
print(f"FAILED - Model: {kwargs.get('model')}")
print(f"Error: {kwargs.get('exception')}")
11.4 Prometheus 메트릭
litellm_settings:
success_callback: ["prometheus"]
failure_callback: ["prometheus"]
주요 메트릭:
# 요청 수
litellm_requests_total
litellm_requests_failed_total
# 레이턴시
litellm_request_duration_seconds_bucket
litellm_request_duration_seconds_sum
# 비용
litellm_spend_total
# 토큰
litellm_tokens_total
litellm_input_tokens_total
litellm_output_tokens_total
12. 캐싱
12.1 Redis 캐싱
litellm_settings:
cache: true
cache_params:
type: redis
host: redis-host
port: 6379
password: os.environ/REDIS_PASSWORD
ttl: 3600 # 1시간 캐시
# 의미적 캐싱 (유사한 질문에 같은 응답)
supported_call_types:
- acompletion
- completion
12.2 인메모리 캐싱
litellm_settings:
cache: true
cache_params:
type: local
ttl: 600 # 10분 캐시
12.3 캐시 제어
# 캐시 사용
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
extra_body={
"metadata": {
"cache": {
"no-cache": False, # 캐시 사용
"ttl": 3600 # 이 요청의 TTL
}
}
}
)
# 캐시 무시
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the latest news?"}],
extra_body={
"metadata": {
"cache": {
"no-cache": True # 항상 새로운 응답
}
}
}
)
13. 프로덕션 배포
13.1 Kubernetes 배포
# litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
labels:
app: litellm
spec:
replicas: 3
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
ports:
- containerPort: 4000
args: ["--config", "/app/config.yaml", "--port", "4000"]
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: anthropic-api-key
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: litellm-secrets
key: database-url
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
livenessProbe:
httpGet:
path: /health/liveliness
port: 4000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: config
configMap:
name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
name: litellm-service
spec:
selector:
app: litellm
ports:
- port: 80
targetPort: 4000
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: litellm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: litellm-proxy
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
13.2 Helm Chart 사용
# Helm으로 설치
helm repo add litellm https://litellm.github.io/litellm-helm
helm repo update
helm install litellm litellm/litellm-helm \
--set masterKey=sk-your-master-key \
--set replicaCount=3 \
--set database.useExisting=true \
--set database.url=postgresql://user:pass@host:5432/litellm \
--namespace litellm \
--create-namespace
13.3 프로덕션 체크리스트
Production Deployment Checklist
================================
[ ] PostgreSQL DB 설정 및 연결 확인
[ ] Master Key를 강력한 랜덤 값으로 설정
[ ] 모든 API 키를 환경 변수/시크릿으로 관리
[ ] HTTPS/TLS 설정 (Ingress 또는 LoadBalancer)
[ ] 레플리카 2개 이상 설정
[ ] HPA (Horizontal Pod Autoscaler) 설정
[ ] Liveness/Readiness Probe 설정
[ ] Prometheus + Grafana 모니터링 설정
[ ] Slack/PagerDuty 알림 설정
[ ] Redis 캐시 설정 (선택사항)
[ ] 예산 및 레이트 리밋 설정
[ ] Guardrails 설정 (필요 시)
[ ] 로깅 (Langfuse/LangSmith) 설정
[ ] 백업 및 복구 전략 수립
[ ] 로드 테스트 완료
14. LiteLLM vs 대안 비교
14.1 LiteLLM vs OpenRouter
| 항목 | LiteLLM | OpenRouter |
|---|---|---|
| 호스팅 | 셀프 호스팅 | 클라우드 서비스 |
| 데이터 프라이버시 | 완전 제어 | 제3자 서버 경유 |
| 비용 | 오픈소스 무료 | API 마진 추가 |
| 팀 관리 | 가상 키/팀/예산 | 제한적 |
| 커스터마이징 | 완전 커스터마이징 | 제한적 |
| 셋업 난이도 | 중간 (Docker/K8s 필요) | 쉬움 (API 키만) |
| 로컬 모델 | 지원 (Ollama/vLLM) | 미지원 |
14.2 LiteLLM vs Portkey
| 항목 | LiteLLM | Portkey |
|---|---|---|
| 오픈소스 | 예 (Apache 2.0) | 부분적 (Gateway만) |
| 프록시 서버 | 포함 | 포함 |
| 가상 키 | 포함 | 포함 |
| AI Gateway | 기본적 | 고급 (Prompt 관리 등) |
| 가격 | 무료 | 프리미엄 플랜 존재 |
| 커뮤니티 | 활발 (12K+ GitHub Stars) | 성장 중 |
14.3 언제 무엇을 선택할까
┌─────────────────────────────────────────────────┐
│ 상황별 추천 솔루션 │
├─────────────────────────────────────────────────┤
│ │
│ "빠르게 시작하고 싶다" │
│ → OpenRouter (가입 후 즉시 사용) │
│ │
│ "데이터 보안이 중요하다" │
│ → LiteLLM (셀프 호스팅, 완전 제어) │
│ │
│ "팀/조직 관리가 필요하다" │
│ → LiteLLM 또는 Portkey │
│ │
│ "로컬 모델도 사용한다" │
│ → LiteLLM (Ollama/vLLM 네이티브 지원) │
│ │
│ "엔터프라이즈 기능이 필요하다" │
│ → LiteLLM Enterprise 또는 Portkey Enterprise │
│ │
└─────────────────────────────────────────────────┘
15. 실전 사례: 완전한 프로덕션 설정
아래는 실제 프로덕션에서 사용할 수 있는 완전한 config.yaml 예시입니다.
# production-config.yaml
model_list:
# === 프리미엄 티어 ===
- model_name: premium-chat
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 200
tpm: 400000
model_info:
access_groups: ["premium"]
- model_name: premium-chat
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 150
model_info:
access_groups: ["premium"]
# === 기본 티어 ===
- model_name: basic-chat
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
rpm: 500
model_info:
access_groups: ["basic", "premium"]
- model_name: basic-chat
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
rpm: 1000
model_info:
access_groups: ["basic", "premium"]
# === 코딩 전용 ===
- model_name: code-model
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
model_info:
access_groups: ["premium"]
# === 임베딩 ===
- model_name: embedding
litellm_params:
model: openai/text-embedding-3-small
api_key: os.environ/OPENAI_API_KEY
# === 라우터 설정 ===
router_settings:
routing_strategy: latency-based-routing
allowed_fails: 3
cooldown_time: 60
num_retries: 3
timeout: 120
retry_after: 5
# === LiteLLM 설정 ===
litellm_settings:
drop_params: true
set_verbose: false
num_retries: 3
request_timeout: 120
# 폴백
fallbacks:
- model_name: premium-chat
fallback_models: ["basic-chat"]
- model_name: basic-chat
fallback_models: ["premium-chat"]
context_window_fallbacks:
- model_name: basic-chat
fallback_models: ["premium-chat"]
# 캐싱
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: 6379
password: os.environ/REDIS_PASSWORD
ttl: 3600
# 콜백
success_callback: ["langfuse", "prometheus"]
failure_callback: ["langfuse", "prometheus"]
# Guardrails
guardrails:
- guardrail_name: pii-masking
litellm_params:
guardrail: presidio
mode: during_call
# === 일반 설정 ===
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
max_budget: 10000
budget_duration: 1m
alerting:
- slack
alerting_threshold: 300
global_max_parallel_requests: 200
# === 환경 변수 ===
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
퀴즈
지금까지 학습한 내용을 확인해보겠습니다.
Q1. LiteLLM의 두 가지 사용 모드는 무엇이며, 각각 어떤 상황에 적합한가요?
A1. SDK 모드와 Proxy 모드가 있습니다.
- SDK 모드: Python 패키지를 직접 임포트하여 사용. 개인 프로젝트나 프로토타입에 적합
- Proxy 모드: 독립 HTTP 서버로 실행. 팀/조직 환경, 멀티 언어 클라이언트, 가상 키 관리, 예산 관리가 필요한 프로덕션에 적합
핵심 차이는 Proxy 모드가 가상 키, 팀 관리, 예산 관리, 로깅 등 엔터프라이즈 기능을 제공한다는 것입니다.
Q2. 모델 폴백(Fallback)이란 무엇이며, LiteLLM에서 어떤 종류의 폴백을 지원하나요?
A2. 모델 폴백은 주 모델이 실패했을 때 자동으로 대체 모델로 전환하는 메커니즘입니다.
LiteLLM은 3가지 종류의 폴백을 지원합니다.
- 일반 폴백 (fallbacks): 모델 호출 실패 시 (429, 500 에러 등) 대체 모델로 전환
- 콘텐츠 정책 폴백 (content_policy_fallbacks): 콘텐츠 정책 위반 시 다른 모델로 전환
- 컨텍스트 윈도우 폴백 (context_window_fallbacks): 입력이 모델의 컨텍스트 윈도우를 초과할 때 더 큰 모델로 전환
Q3. LiteLLM의 비용 기반 라우팅(Cost-Based Routing)은 어떻게 작동하나요?
A3. 같은 model_name에 여러 제공자의 모델을 등록하고 routing_strategy를 cost-based-routing으로 설정하면, LiteLLM이 각 모델의 input/output 토큰 가격을 비교하여 가장 저렴한 모델에 자동으로 라우팅합니다.
예를 들어, GPT-4o-mini, Claude Haiku, Gemini Flash를 같은 이름으로 등록하면, 가격이 가장 낮은 모델이 자동 선택됩니다. 이를 통해 품질을 유지하면서 비용을 최적화할 수 있습니다.
Q4. 가상 키(Virtual Keys)의 역할과 주요 기능을 설명해주세요.
A4. 가상 키는 LiteLLM 프록시가 생성하는 API 키입니다. 실제 LLM 제공자의 API 키를 노출하지 않으면서 다양한 제어 기능을 제공합니다.
주요 기능:
- 예산 제한: 키별 최대 예산 설정 (max_budget)
- 모델 접근 제어: 특정 모델만 사용 가능하도록 제한
- 레이트 리밋: RPM, TPM, 동시 요청 수 제한
- 팀 연결: 키를 팀에 할당하여 팀 단위 관리
- 기간 설정: 키의 유효 기간 지정
- 사용량 추적: 키별 비용, 토큰, 요청 수 추적
Q5. LiteLLM을 프로덕션에 배포할 때 반드시 고려해야 할 항목 5가지는?
A5. 프로덕션 배포 시 반드시 고려해야 할 항목:
- 데이터베이스: PostgreSQL 연결 필수 (가상 키, 비용 추적, 팀 관리 데이터 저장)
- 보안: Master Key를 강력한 랜덤 값으로 설정, 모든 API 키를 시크릿으로 관리, HTTPS/TLS 적용
- 고가용성: 레플리카 2개 이상, HPA 설정, Liveness/Readiness Probe 구성
- 모니터링: Prometheus 메트릭 수집, Grafana 대시보드, Slack/PagerDuty 알림 설정
- 비용 제어: 글로벌 예산, 팀별/키별 예산, 소프트 예산 알림, 레이트 리밋 설정
추가로 Redis 캐싱, 로깅(Langfuse), Guardrails 설정, 백업 전략도 권장됩니다.
참고 자료
- LiteLLM 공식 문서 - https://docs.litellm.ai/
- LiteLLM GitHub - https://github.com/BerriAI/litellm
- LiteLLM Proxy Server 가이드 - https://docs.litellm.ai/docs/proxy/quick_start
- LiteLLM 지원 제공자 목록 - https://docs.litellm.ai/docs/providers
- LiteLLM 가상 키 관리 - https://docs.litellm.ai/docs/proxy/virtual_keys
- LiteLLM 라우팅 전략 - https://docs.litellm.ai/docs/routing
- LiteLLM 예산 관리 - https://docs.litellm.ai/docs/proxy/users
- LiteLLM Guardrails - https://docs.litellm.ai/docs/proxy/guardrails
- LiteLLM Kubernetes 배포 - https://docs.litellm.ai/docs/proxy/deploy
- OpenRouter 공식 사이트 - https://openrouter.ai/
- Portkey AI 공식 사이트 - https://portkey.ai/
- Langfuse LLM Observability - https://langfuse.com/
- LiteLLM vs Alternatives 비교 - https://docs.litellm.ai/docs/proxy/enterprise
- Prometheus + Grafana 모니터링 - https://prometheus.io/
LiteLLM Complete Guide 2025: Unify 100+ LLMs with a Single API Proxy Server
- Introduction: Why LiteLLM?
- 1. LiteLLM Architecture: SDK vs Proxy
- 2. SDK Usage: Unified LLM Calls
- 3. Proxy Server Configuration
- 4. Supported Providers and Models
- 5. Model Routing Strategies
- 6. Load Balancing and Fallback
- 7. Virtual Keys and Team Management
- 8. Budget Management and Cost Tracking
- 9. Rate Limiting
- 10. Guardrails: PII Masking and Content Filtering
- 11. Logging and Monitoring
- 12. Caching
- 13. Production Deployment
- 14. LiteLLM vs Alternatives
- 15. Real-World Example: Complete Production Config
- Quiz
- References
Introduction: Why LiteLLM?
In 2025, the AI landscape is evolving faster than ever. New models launch every month: OpenAI's GPT-4o, Anthropic's Claude Opus 4, Google's Gemini 2.0, Meta's Llama 3.1, Mistral's Large, and more. Each provider has its own API format, authentication scheme, and pricing model.
Organizations face three critical challenges in this environment.
1. Vendor Lock-In
When you build against one provider's API, switching to a better model incurs significant refactoring costs.
# Problem: Each provider has a different API format
# OpenAI
import openai
response = openai.ChatCompletion.create(model="gpt-4", messages=[...])
# Anthropic
import anthropic
response = anthropic.messages.create(model="claude-3-opus", messages=[...])
# Google
import google.generativeai as genai
response = genai.GenerativeModel("gemini-pro").generate_content(...)
2. Cost Management Complexity
Using multiple models means tracking costs across different providers with different pricing structures becomes a nightmare.
3. Availability and Reliability
Depending on a single provider means a service outage takes your entire system down.
LiteLLM solves all three problems with a single solution.
┌──────────────────────────────────────────────────┐
│ Your Application │
│ (OpenAI SDK Format) │
└─────────────────────┬────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ LiteLLM Proxy Server │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Cost │ │ Load │ │ Virtual Keys │ │
│ │ Tracking│ │ Balance │ │ + Budget Mgmt │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
└───┬──────────────┬──────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌───────────┐
│ OpenAI │ │ Anthropic│ │ Google │
│ GPT-4o │ │ Claude │ │ Gemini │
└────────┘ └──────────┘ └───────────┘
▲ ▲ ▲
│ │ │
┌────────┐ ┌──────────┐ ┌───────────┐
│ AWS │ │ Azure │ │ Ollama │
│Bedrock │ │ OpenAI │ │ (Local) │
└────────┘ └──────────┘ └───────────┘
1. LiteLLM Architecture: SDK vs Proxy
LiteLLM operates in two modes.
1.1 SDK Mode (Python Package)
Import directly into your application. Best for simple projects and prototypes.
# pip install litellm
from litellm import completion
# Call OpenAI
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
# Call Anthropic (same format!)
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello!"}]
)
# Call Google Gemini (same format!)
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "Hello!"}]
)
1.2 Proxy Mode (Server)
Run as a standalone server so all applications use a single endpoint. Best for teams and organizations.
# Start proxy server
litellm --config config.yaml --port 4000
# Any language can call using OpenAI SDK format
import openai
client = openai.OpenAI(
api_key="sk-litellm-your-virtual-key",
base_url="http://localhost:4000"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
1.3 SDK vs Proxy Comparison
| Feature | SDK Mode | Proxy Mode |
|---|---|---|
| Usage | Python package import | HTTP Server (REST API) |
| Language Support | Python only | All languages (OpenAI SDK compatible) |
| Team Management | Not available | Virtual keys with team management |
| Budget Management | Basic | Advanced (per-team/per-key) |
| Load Balancing | Manual in code | Automatic via config |
| Best For | Personal projects, prototypes | Teams, organizations, production |
2. SDK Usage: Unified LLM Calls
2.1 Chat Completion
The core SDK feature. Call 100+ models with an identical interface.
from litellm import completion
import os
# Set API keys via environment variables
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
os.environ["GEMINI_API_KEY"] = "AI..."
# Same function, different models
models = [
"gpt-4o",
"anthropic/claude-sonnet-4-20250514",
"gemini/gemini-2.0-flash",
"groq/llama-3.1-70b-versatile",
"mistral/mistral-large-latest",
]
for model in models:
response = completion(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in 2 sentences."}
],
temperature=0.7,
max_tokens=200,
)
print(f"[{model}]: {response.choices[0].message.content}")
2.2 Streaming Responses
from litellm import completion
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Write a short poem about coding."}],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
2.3 Embedding Calls
from litellm import embedding
# OpenAI Embedding
response = embedding(
model="text-embedding-3-small",
input=["Hello world", "How are you?"]
)
# Cohere Embedding (same format!)
response = embedding(
model="cohere/embed-english-v3.0",
input=["Hello world", "How are you?"]
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")
2.4 Image Generation
from litellm import image_generation
response = image_generation(
model="dall-e-3",
prompt="A futuristic city powered by AI, digital art style",
n=1,
size="1024x1024",
)
print(f"Image URL: {response.data[0]['url']}")
2.5 Async Calls
import asyncio
from litellm import acompletion
async def generate_responses():
tasks = [
acompletion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Tell me fact #{i} about space."}]
)
for i in range(5)
]
responses = await asyncio.gather(*tasks)
for i, resp in enumerate(responses):
print(f"Fact #{i}: {resp.choices[0].message.content[:100]}")
asyncio.run(generate_responses())
3. Proxy Server Configuration
3.1 Basic Configuration File (config.yaml)
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gemini-flash
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
- model_name: llama-local
litellm_params:
model: ollama/llama3.1:70b
api_base: http://localhost:11434
litellm_settings:
drop_params: true
set_verbose: false
num_retries: 3
request_timeout: 120
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
3.2 Getting Started with Docker
# Docker Compose file
cat > docker-compose.yaml << 'EOF'
version: "3.9"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./config.yaml:/app/config.yaml
command: ["--config", "/app/config.yaml", "--port", "4000"]
environment:
- OPENAI_API_KEY
- ANTHROPIC_API_KEY
- GEMINI_API_KEY
- DATABASE_URL=postgresql://user:pass@db:5432/litellm
depends_on:
- db
db:
image: postgres:16
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
POSTGRES_DB: litellm
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
EOF
# Launch
docker compose up -d
3.3 Testing the Proxy Server
# Health check
curl http://localhost:4000/health
# Chat Completion call
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# List models
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-master-key-1234"
4. Supported Providers and Models
LiteLLM supports 100+ LLM providers. Here are the major ones.
4.1 Major Providers
┌────────────────────┬──────────────────────────────────────┐
│ Provider │ Supported Models (examples) │
├────────────────────┼──────────────────────────────────────┤
│ OpenAI │ gpt-4o, gpt-4o-mini, o1, o3-mini │
│ Anthropic │ claude-opus-4, claude-sonnet, haiku │
│ Google (Gemini) │ gemini-2.0-flash, gemini-1.5-pro │
│ AWS Bedrock │ Claude, Titan, Llama via Bedrock │
│ Azure OpenAI │ gpt-4o (Azure hosted) │
│ Mistral AI │ mistral-large, mistral-small │
│ Groq │ llama-3.1-70b, mixtral-8x7b │
│ Together AI │ llama-3.1, CodeLlama, Qwen │
│ Ollama (Local) │ llama3.1, codellama, mistral │
│ vLLM (Self-hosted) │ Any HuggingFace model │
│ Cohere │ command-r-plus, embed models │
│ Deepseek │ deepseek-chat, deepseek-coder │
│ Fireworks AI │ llama, mixtral, etc. │
│ Perplexity │ pplx-70b-online, etc. │
└────────────────────┴──────────────────────────────────────┘
4.2 AWS Bedrock Configuration
model_list:
- model_name: bedrock-claude
litellm_params:
model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: us-east-1
- model_name: bedrock-llama
litellm_params:
model: bedrock/meta.llama3-1-70b-instruct-v1:0
aws_region_name: us-west-2
4.3 Azure OpenAI Configuration
model_list:
- model_name: azure-gpt4
litellm_params:
model: azure/gpt-4o-deployment-name
api_base: https://your-resource.openai.azure.com/
api_key: os.environ/AZURE_API_KEY
api_version: "2024-06-01"
4.4 Ollama (Local LLM) Configuration
model_list:
- model_name: local-llama
litellm_params:
model: ollama/llama3.1:70b
api_base: http://localhost:11434
stream: true
- model_name: local-codellama
litellm_params:
model: ollama/codellama:34b
api_base: http://localhost:11434
5. Model Routing Strategies
5.1 Cost-Based Routing (Lowest Cost)
Automatically selects the cheapest model.
router_settings:
routing_strategy: cost-based-routing
model_list:
- model_name: general-chat
litellm_params:
model: gpt-4o-mini
# Input: $0.15/1M tokens, Output: $0.60/1M tokens
- model_name: general-chat
litellm_params:
model: anthropic/claude-3-haiku-20240307
# Input: $0.25/1M tokens, Output: $1.25/1M tokens
- model_name: general-chat
litellm_params:
model: gemini/gemini-2.0-flash
# Input: $0.075/1M tokens, Output: $0.30/1M tokens
With this setup, calling general-chat will automatically select Gemini Flash as the cheapest option.
5.2 Latency-Based Routing (Lowest Latency)
Selects the model with the fastest response time.
router_settings:
routing_strategy: latency-based-routing
routing_strategy_args:
ttl: 60 # Re-measure latency every 60 seconds
5.3 Usage-Based Routing
Routes to the model with the most available capacity based on RPM (requests per minute).
router_settings:
routing_strategy: usage-based-routing
model_list:
- model_name: fast-model
litellm_params:
model: gpt-4o-mini
rpm: 500 # Max 500 requests per minute
- model_name: fast-model
litellm_params:
model: gemini/gemini-2.0-flash
rpm: 1000 # Max 1000 requests per minute
6. Load Balancing and Fallback
6.1 Load Balancing Configuration
Registering multiple deployments under the same model_name automatically enables load balancing.
model_list:
# Distribute across multiple API keys for the same model
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-key-1
rpm: 100
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-key-2
rpm: 100
# Distribute across different providers
- model_name: fast-model
litellm_params:
model: openai/gpt-4o-mini
rpm: 500
- model_name: fast-model
litellm_params:
model: anthropic/claude-3-haiku-20240307
rpm: 300
router_settings:
routing_strategy: simple-shuffle # Round-robin style
allowed_fails: 3 # Deactivate deployment after 3 failures
cooldown_time: 60 # Reactivate after 60 seconds
6.2 Fallback Chain Configuration
Automatically switch to an alternative model when the primary fails.
litellm_settings:
fallbacks:
- model_name: gpt-4o
fallback_models:
- anthropic/claude-sonnet-4-20250514
- gemini/gemini-1.5-pro
- model_name: claude-sonnet
fallback_models:
- gpt-4o
- gemini/gemini-1.5-pro
# Fallback on content policy violations
content_policy_fallbacks:
- model_name: gpt-4o
fallback_models:
- anthropic/claude-sonnet-4-20250514
# Fallback when context window is exceeded
context_window_fallbacks:
- model_name: gpt-4o-mini
fallback_models:
- gpt-4o
- anthropic/claude-sonnet-4-20250514
6.3 Fallback Flow
User Request: Chat Completion with "gpt-4o"
|
v
[1] Try gpt-4o
|
+-- Success --> Return response
|
+-- Failure (429 Rate Limit / 500 Error)
|
v
[2] Try claude-sonnet-4 (1st fallback)
|
+-- Success --> Return response
|
+-- Failure
|
v
[3] Try gemini-1.5-pro (2nd fallback)
|
+-- Success --> Return response
|
+-- Failure --> Return error
7. Virtual Keys and Team Management
7.1 Creating Virtual Keys
# Generate a new virtual key with the master key
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "frontend-team-key",
"duration": "30d",
"max_budget": 100.0,
"models": ["gpt-4o", "claude-sonnet"],
"max_parallel_requests": 10,
"tpm_limit": 100000,
"rpm_limit": 100,
"metadata": {
"team": "frontend",
"environment": "production"
}
}'
7.2 Team Creation and Management
# Create a team
curl -X POST http://localhost:4000/team/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_alias": "ml-engineering",
"max_budget": 500.0,
"models": ["gpt-4o", "claude-sonnet", "gemini-flash"],
"members_with_roles": [
{"role": "admin", "user_id": "user-alice@company.com"},
{"role": "user", "user_id": "user-bob@company.com"}
],
"metadata": {
"department": "engineering",
"cost_center": "ENG-001"
}
}'
# Assign keys to a team
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_id": "team-ml-engineering-id",
"key_alias": "ml-team-prod-key",
"max_budget": 200.0
}'
7.3 Model Access Control per Key
# Model access groups in config.yaml
model_list:
- model_name: premium-model
litellm_params:
model: openai/gpt-4o
model_info:
access_groups: ["premium-tier"]
- model_name: basic-model
litellm_params:
model: openai/gpt-4o-mini
model_info:
access_groups: ["basic-tier", "premium-tier"]
8. Budget Management and Cost Tracking
8.1 Global Budget Settings
general_settings:
max_budget: 10000.0 # Total monthly budget $10,000
budget_duration: 1m # Monthly reset
8.2 Per-Key/Per-Team Budget
# Update key budget
curl -X POST http://localhost:4000/key/update \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key": "sk-virtual-key-xxx",
"max_budget": 50.0,
"budget_duration": "1m",
"soft_budget": 40.0
}'
When soft_budget is reached, an alert is sent. When max_budget is reached, requests are blocked.
8.3 Cost Tracking API
# Query total spend
curl http://localhost:4000/spend/logs \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "start_date=2025-01-01" \
-d "end_date=2025-01-31"
# Query spend by model
curl http://localhost:4000/spend/tags \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "group_by=model"
# Query spend by team
curl http://localhost:4000/spend/tags \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "group_by=team"
8.4 Budget Alert Webhooks
general_settings:
alerting:
- slack
alerting_threshold: 300 # Alert when 80% of budget is consumed
alert_types:
- budget_alerts
- spend_reports
- failed_tracking
environment_variables:
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
9. Rate Limiting
9.1 Key-Level Rate Limiting
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "rate-limited-key",
"rpm_limit": 60,
"tpm_limit": 100000,
"max_parallel_requests": 5
}'
9.2 Model-Level Rate Limiting
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 200 # Max requests per minute for this deployment
tpm: 400000 # Max tokens per minute for this deployment
9.3 Global Rate Limiting
general_settings:
global_max_parallel_requests: 100 # Global concurrent request limit
10. Guardrails: PII Masking and Content Filtering
10.1 PII (Personally Identifiable Information) Masking
litellm_settings:
guardrails:
- guardrail_name: pii-masking
litellm_params:
guardrail: presidio
mode: during_call # Mask PII before sending to LLM
output_parse_pii: true # Also mask PII in responses
# Request containing PII
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "My email is john@example.com and phone is 555-123-4567"
}],
extra_body={
"metadata": {
"guardrails": ["pii-masking"]
}
}
)
# LLM receives: "My email is [EMAIL] and phone is [PHONE]"
10.2 Content Filtering (Lakera Guard)
litellm_settings:
guardrails:
- guardrail_name: content-filter
litellm_params:
guardrail: lakera
mode: pre_call # Filter before sending request
api_key: os.environ/LAKERA_API_KEY
10.3 Prompt Injection Defense
litellm_settings:
guardrails:
- guardrail_name: prompt-injection
litellm_params:
guardrail: lakera
mode: pre_call
prompt_injection: true
10.4 Custom Guardrails
# custom_guardrail.py
from litellm.proxy.guardrails.guardrail_hooks import GuardrailHook
class CustomContentFilter(GuardrailHook):
def __init__(self):
self.blocked_words = ["harmful", "dangerous", "illegal"]
async def pre_call_hook(self, data, call_type):
user_message = data.get("messages", [])[-1].get("content", "")
for word in self.blocked_words:
if word.lower() in user_message.lower():
raise ValueError(f"Blocked content detected: {word}")
return data
async def post_call_hook(self, data, response, call_type):
return response
11. Logging and Monitoring
11.1 Langfuse Integration
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
LANGFUSE_HOST: https://cloud.langfuse.com
11.2 LangSmith Integration
litellm_settings:
success_callback: ["langsmith"]
environment_variables:
LANGCHAIN_API_KEY: os.environ/LANGCHAIN_API_KEY
LANGCHAIN_PROJECT: my-litellm-project
11.3 Custom Callbacks
# custom_callback.py
from litellm.integrations.custom_logger import CustomLogger
class MyCustomLogger(CustomLogger):
async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
print(f"Model: {kwargs.get('model')}")
print(f"Cost: ${kwargs.get('response_cost', 0):.6f}")
print(f"Latency: {(end_time - start_time).total_seconds():.2f}s")
print(f"Tokens: {response_obj.usage.total_tokens}")
async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
print(f"FAILED - Model: {kwargs.get('model')}")
print(f"Error: {kwargs.get('exception')}")
11.4 Prometheus Metrics
litellm_settings:
success_callback: ["prometheus"]
failure_callback: ["prometheus"]
Key metrics:
# Request counts
litellm_requests_total
litellm_requests_failed_total
# Latency
litellm_request_duration_seconds_bucket
litellm_request_duration_seconds_sum
# Cost
litellm_spend_total
# Tokens
litellm_tokens_total
litellm_input_tokens_total
litellm_output_tokens_total
12. Caching
12.1 Redis Caching
litellm_settings:
cache: true
cache_params:
type: redis
host: redis-host
port: 6379
password: os.environ/REDIS_PASSWORD
ttl: 3600 # 1-hour cache
# Semantic caching (same response for similar questions)
supported_call_types:
- acompletion
- completion
12.2 In-Memory Caching
litellm_settings:
cache: true
cache_params:
type: local
ttl: 600 # 10-minute cache
12.3 Cache Control
# Use cache
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
extra_body={
"metadata": {
"cache": {
"no-cache": False, # Use cache
"ttl": 3600 # TTL for this request
}
}
}
)
# Skip cache
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the latest news?"}],
extra_body={
"metadata": {
"cache": {
"no-cache": True # Always get fresh response
}
}
}
)
13. Production Deployment
13.1 Kubernetes Deployment
# litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
labels:
app: litellm
spec:
replicas: 3
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
ports:
- containerPort: 4000
args: ["--config", "/app/config.yaml", "--port", "4000"]
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: anthropic-api-key
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: litellm-secrets
key: database-url
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
livenessProbe:
httpGet:
path: /health/liveliness
port: 4000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: config
configMap:
name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
name: litellm-service
spec:
selector:
app: litellm
ports:
- port: 80
targetPort: 4000
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: litellm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: litellm-proxy
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
13.2 Using Helm Chart
# Install with Helm
helm repo add litellm https://litellm.github.io/litellm-helm
helm repo update
helm install litellm litellm/litellm-helm \
--set masterKey=sk-your-master-key \
--set replicaCount=3 \
--set database.useExisting=true \
--set database.url=postgresql://user:pass@host:5432/litellm \
--namespace litellm \
--create-namespace
13.3 Production Checklist
Production Deployment Checklist
================================
[ ] PostgreSQL DB configured and connectivity verified
[ ] Master Key set to a strong random value
[ ] All API keys managed via env vars / secrets
[ ] HTTPS/TLS configured (Ingress or LoadBalancer)
[ ] 2+ replicas configured
[ ] HPA (Horizontal Pod Autoscaler) configured
[ ] Liveness/Readiness Probes configured
[ ] Prometheus + Grafana monitoring set up
[ ] Slack/PagerDuty alerting configured
[ ] Redis cache set up (optional)
[ ] Budget and rate limits configured
[ ] Guardrails configured (if needed)
[ ] Logging (Langfuse/LangSmith) set up
[ ] Backup and recovery strategy documented
[ ] Load testing completed
14. LiteLLM vs Alternatives
14.1 LiteLLM vs OpenRouter
| Feature | LiteLLM | OpenRouter |
|---|---|---|
| Hosting | Self-hosted | Cloud service |
| Data Privacy | Full control | Third-party servers |
| Cost | Open-source free | API margin added |
| Team Management | Virtual keys/teams/budgets | Limited |
| Customization | Fully customizable | Limited |
| Setup Difficulty | Medium (Docker/K8s) | Easy (API key only) |
| Local Models | Supported (Ollama/vLLM) | Not supported |
14.2 LiteLLM vs Portkey
| Feature | LiteLLM | Portkey |
|---|---|---|
| Open Source | Yes (Apache 2.0) | Partial (Gateway only) |
| Proxy Server | Included | Included |
| Virtual Keys | Included | Included |
| AI Gateway | Basic | Advanced (Prompt mgmt) |
| Pricing | Free | Premium plans available |
| Community | Active (12K+ GitHub Stars) | Growing |
14.3 When to Choose What
Decision Guide
================================
"I want to get started quickly"
-> OpenRouter (sign up and use immediately)
"Data security is critical"
-> LiteLLM (self-hosted, full control)
"I need team/org management"
-> LiteLLM or Portkey
"I also use local models"
-> LiteLLM (native Ollama/vLLM support)
"I need enterprise features"
-> LiteLLM Enterprise or Portkey Enterprise
15. Real-World Example: Complete Production Config
Below is a complete config.yaml suitable for production use.
# production-config.yaml
model_list:
# === Premium Tier ===
- model_name: premium-chat
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 200
tpm: 400000
model_info:
access_groups: ["premium"]
- model_name: premium-chat
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 150
model_info:
access_groups: ["premium"]
# === Basic Tier ===
- model_name: basic-chat
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
rpm: 500
model_info:
access_groups: ["basic", "premium"]
- model_name: basic-chat
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
rpm: 1000
model_info:
access_groups: ["basic", "premium"]
# === Coding Dedicated ===
- model_name: code-model
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
model_info:
access_groups: ["premium"]
# === Embedding ===
- model_name: embedding
litellm_params:
model: openai/text-embedding-3-small
api_key: os.environ/OPENAI_API_KEY
# === Router Settings ===
router_settings:
routing_strategy: latency-based-routing
allowed_fails: 3
cooldown_time: 60
num_retries: 3
timeout: 120
retry_after: 5
# === LiteLLM Settings ===
litellm_settings:
drop_params: true
set_verbose: false
num_retries: 3
request_timeout: 120
# Fallbacks
fallbacks:
- model_name: premium-chat
fallback_models: ["basic-chat"]
- model_name: basic-chat
fallback_models: ["premium-chat"]
context_window_fallbacks:
- model_name: basic-chat
fallback_models: ["premium-chat"]
# Caching
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: 6379
password: os.environ/REDIS_PASSWORD
ttl: 3600
# Callbacks
success_callback: ["langfuse", "prometheus"]
failure_callback: ["langfuse", "prometheus"]
# Guardrails
guardrails:
- guardrail_name: pii-masking
litellm_params:
guardrail: presidio
mode: during_call
# === General Settings ===
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
max_budget: 10000
budget_duration: 1m
alerting:
- slack
alerting_threshold: 300
global_max_parallel_requests: 200
# === Environment Variables ===
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
Quiz
Test your understanding of LiteLLM concepts.
Q1. What are the two modes of using LiteLLM, and when is each appropriate?
A1. LiteLLM has SDK Mode and Proxy Mode.
- SDK Mode: Import the Python package directly. Best for personal projects and prototypes.
- Proxy Mode: Run as a standalone HTTP server. Best for teams/organizations, multi-language clients, virtual key management, budget management, and production deployments.
The key difference is that Proxy Mode provides enterprise features like virtual keys, team management, budget management, and logging.
Q2. What is Model Fallback, and what types of fallback does LiteLLM support?
A2. Model fallback is a mechanism that automatically switches to an alternative model when the primary model fails.
LiteLLM supports three types of fallback:
- General fallbacks: Switch to alternative models on API failures (429, 500 errors)
- Content policy fallbacks: Switch when content policy is violated
- Context window fallbacks: Switch to a model with a larger context window when input exceeds limits
Q3. How does cost-based routing work in LiteLLM?
A3. When you register multiple provider models under the same model_name and set routing_strategy to cost-based-routing, LiteLLM compares the input/output token prices for each model and automatically routes to the cheapest one.
For example, if GPT-4o-mini, Claude Haiku, and Gemini Flash are all registered under the same name, the model with the lowest price per token is automatically selected. This optimizes cost while maintaining quality.
Q4. What are Virtual Keys and what are their main capabilities?
A4. Virtual Keys are API keys generated by the LiteLLM proxy. They avoid exposing actual LLM provider API keys while providing extensive control features.
Key capabilities:
- Budget limits: Set maximum budget per key (max_budget)
- Model access control: Restrict which models a key can access
- Rate limiting: Set RPM, TPM, and concurrent request limits
- Team association: Assign keys to teams for team-level management
- Duration control: Set key expiration dates
- Usage tracking: Track cost, tokens, and request counts per key
Q5. What are the top 5 must-consider items when deploying LiteLLM to production?
A5. Critical production deployment considerations:
- Database: PostgreSQL connection required (stores virtual keys, cost tracking, team management data)
- Security: Strong random Master Key, all API keys in secrets, HTTPS/TLS enabled
- High Availability: 2+ replicas, HPA configured, Liveness/Readiness Probes set up
- Monitoring: Prometheus metrics collection, Grafana dashboards, Slack/PagerDuty alerts configured
- Cost Control: Global budget, per-team/per-key budgets, soft budget alerts, rate limits configured
Additionally recommended: Redis caching, logging (Langfuse), Guardrails, and backup strategy.
References
- LiteLLM Official Documentation - https://docs.litellm.ai/
- LiteLLM GitHub Repository - https://github.com/BerriAI/litellm
- LiteLLM Proxy Server Quick Start - https://docs.litellm.ai/docs/proxy/quick_start
- LiteLLM Supported Providers - https://docs.litellm.ai/docs/providers
- LiteLLM Virtual Keys Guide - https://docs.litellm.ai/docs/proxy/virtual_keys
- LiteLLM Routing Strategies - https://docs.litellm.ai/docs/routing
- LiteLLM Budget Management - https://docs.litellm.ai/docs/proxy/users
- LiteLLM Guardrails - https://docs.litellm.ai/docs/proxy/guardrails
- LiteLLM Kubernetes Deployment - https://docs.litellm.ai/docs/proxy/deploy
- OpenRouter Official Site - https://openrouter.ai/
- Portkey AI Official Site - https://portkey.ai/
- Langfuse LLM Observability - https://langfuse.com/
- LiteLLM vs Alternatives - https://docs.litellm.ai/docs/proxy/enterprise
- Prometheus Monitoring - https://prometheus.io/