- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction: Why LiteLLM?
- 1. LiteLLM Architecture: SDK vs Proxy
- 2. SDK Usage: Unified LLM Calls
- 3. Proxy Server Configuration
- 4. Supported Providers and Models
- 5. Model Routing Strategies
- 6. Load Balancing and Fallback
- 7. Virtual Keys and Team Management
- 8. Budget Management and Cost Tracking
- 9. Rate Limiting
- 10. Guardrails: PII Masking and Content Filtering
- 11. Logging and Monitoring
- 12. Caching
- 13. Production Deployment
- 14. LiteLLM vs Alternatives
- 15. Real-World Example: Complete Production Config
- Quiz
- References
Introduction: Why LiteLLM?
In 2025, the AI landscape is evolving faster than ever. New models launch every month: OpenAI's GPT-4o, Anthropic's Claude Opus 4, Google's Gemini 2.0, Meta's Llama 3.1, Mistral's Large, and more. Each provider has its own API format, authentication scheme, and pricing model.
Organizations face three critical challenges in this environment.
1. Vendor Lock-In
When you build against one provider's API, switching to a better model incurs significant refactoring costs.
# Problem: Each provider has a different API format
# OpenAI
import openai
response = openai.ChatCompletion.create(model="gpt-4", messages=[...])
# Anthropic
import anthropic
response = anthropic.messages.create(model="claude-3-opus", messages=[...])
# Google
import google.generativeai as genai
response = genai.GenerativeModel("gemini-pro").generate_content(...)
2. Cost Management Complexity
Using multiple models means tracking costs across different providers with different pricing structures becomes a nightmare.
3. Availability and Reliability
Depending on a single provider means a service outage takes your entire system down.
LiteLLM solves all three problems with a single solution.
┌──────────────────────────────────────────────────┐
│ Your Application │
│ (OpenAI SDK Format) │
└─────────────────────┬────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ LiteLLM Proxy Server │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Cost │ │ Load │ │ Virtual Keys │ │
│ │ Tracking│ │ Balance │ │ + Budget Mgmt │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
└───┬──────────────┬──────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌───────────┐
│ OpenAI │ │ Anthropic│ │ Google │
│ GPT-4o │ │ Claude │ │ Gemini │
└────────┘ └──────────┘ └───────────┘
▲ ▲ ▲
│ │ │
┌────────┐ ┌──────────┐ ┌───────────┐
│ AWS │ │ Azure │ │ Ollama │
│Bedrock │ │ OpenAI │ │ (Local) │
└────────┘ └──────────┘ └───────────┘
1. LiteLLM Architecture: SDK vs Proxy
LiteLLM operates in two modes.
1.1 SDK Mode (Python Package)
Import directly into your application. Best for simple projects and prototypes.
# pip install litellm
from litellm import completion
# Call OpenAI
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
# Call Anthropic (same format!)
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello!"}]
)
# Call Google Gemini (same format!)
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "Hello!"}]
)
1.2 Proxy Mode (Server)
Run as a standalone server so all applications use a single endpoint. Best for teams and organizations.
# Start proxy server
litellm --config config.yaml --port 4000
# Any language can call using OpenAI SDK format
import openai
client = openai.OpenAI(
api_key="sk-litellm-your-virtual-key",
base_url="http://localhost:4000"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
1.3 SDK vs Proxy Comparison
| Feature | SDK Mode | Proxy Mode |
|---|---|---|
| Usage | Python package import | HTTP Server (REST API) |
| Language Support | Python only | All languages (OpenAI SDK compatible) |
| Team Management | Not available | Virtual keys with team management |
| Budget Management | Basic | Advanced (per-team/per-key) |
| Load Balancing | Manual in code | Automatic via config |
| Best For | Personal projects, prototypes | Teams, organizations, production |
2. SDK Usage: Unified LLM Calls
2.1 Chat Completion
The core SDK feature. Call 100+ models with an identical interface.
from litellm import completion
import os
# Set API keys via environment variables
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
os.environ["GEMINI_API_KEY"] = "AI..."
# Same function, different models
models = [
"gpt-4o",
"anthropic/claude-sonnet-4-20250514",
"gemini/gemini-2.0-flash",
"groq/llama-3.1-70b-versatile",
"mistral/mistral-large-latest",
]
for model in models:
response = completion(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in 2 sentences."}
],
temperature=0.7,
max_tokens=200,
)
print(f"[{model}]: {response.choices[0].message.content}")
2.2 Streaming Responses
from litellm import completion
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Write a short poem about coding."}],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
2.3 Embedding Calls
from litellm import embedding
# OpenAI Embedding
response = embedding(
model="text-embedding-3-small",
input=["Hello world", "How are you?"]
)
# Cohere Embedding (same format!)
response = embedding(
model="cohere/embed-english-v3.0",
input=["Hello world", "How are you?"]
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")
2.4 Image Generation
from litellm import image_generation
response = image_generation(
model="dall-e-3",
prompt="A futuristic city powered by AI, digital art style",
n=1,
size="1024x1024",
)
print(f"Image URL: {response.data[0]['url']}")
2.5 Async Calls
import asyncio
from litellm import acompletion
async def generate_responses():
tasks = [
acompletion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Tell me fact #{i} about space."}]
)
for i in range(5)
]
responses = await asyncio.gather(*tasks)
for i, resp in enumerate(responses):
print(f"Fact #{i}: {resp.choices[0].message.content[:100]}")
asyncio.run(generate_responses())
3. Proxy Server Configuration
3.1 Basic Configuration File (config.yaml)
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gemini-flash
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
- model_name: llama-local
litellm_params:
model: ollama/llama3.1:70b
api_base: http://localhost:11434
litellm_settings:
drop_params: true
set_verbose: false
num_retries: 3
request_timeout: 120
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
3.2 Getting Started with Docker
# Docker Compose file
cat > docker-compose.yaml << 'EOF'
version: "3.9"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./config.yaml:/app/config.yaml
command: ["--config", "/app/config.yaml", "--port", "4000"]
environment:
- OPENAI_API_KEY
- ANTHROPIC_API_KEY
- GEMINI_API_KEY
- DATABASE_URL=postgresql://user:pass@db:5432/litellm
depends_on:
- db
db:
image: postgres:16
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
POSTGRES_DB: litellm
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
EOF
# Launch
docker compose up -d
3.3 Testing the Proxy Server
# Health check
curl http://localhost:4000/health
# Chat Completion call
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# List models
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-master-key-1234"
4. Supported Providers and Models
LiteLLM supports 100+ LLM providers. Here are the major ones.
4.1 Major Providers
┌────────────────────┬──────────────────────────────────────┐
│ Provider │ Supported Models (examples) │
├────────────────────┼──────────────────────────────────────┤
│ OpenAI │ gpt-4o, gpt-4o-mini, o1, o3-mini │
│ Anthropic │ claude-opus-4, claude-sonnet, haiku │
│ Google (Gemini) │ gemini-2.0-flash, gemini-1.5-pro │
│ AWS Bedrock │ Claude, Titan, Llama via Bedrock │
│ Azure OpenAI │ gpt-4o (Azure hosted) │
│ Mistral AI │ mistral-large, mistral-small │
│ Groq │ llama-3.1-70b, mixtral-8x7b │
│ Together AI │ llama-3.1, CodeLlama, Qwen │
│ Ollama (Local) │ llama3.1, codellama, mistral │
│ vLLM (Self-hosted) │ Any HuggingFace model │
│ Cohere │ command-r-plus, embed models │
│ Deepseek │ deepseek-chat, deepseek-coder │
│ Fireworks AI │ llama, mixtral, etc. │
│ Perplexity │ pplx-70b-online, etc. │
└────────────────────┴──────────────────────────────────────┘
4.2 AWS Bedrock Configuration
model_list:
- model_name: bedrock-claude
litellm_params:
model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: us-east-1
- model_name: bedrock-llama
litellm_params:
model: bedrock/meta.llama3-1-70b-instruct-v1:0
aws_region_name: us-west-2
4.3 Azure OpenAI Configuration
model_list:
- model_name: azure-gpt4
litellm_params:
model: azure/gpt-4o-deployment-name
api_base: https://your-resource.openai.azure.com/
api_key: os.environ/AZURE_API_KEY
api_version: "2024-06-01"
4.4 Ollama (Local LLM) Configuration
model_list:
- model_name: local-llama
litellm_params:
model: ollama/llama3.1:70b
api_base: http://localhost:11434
stream: true
- model_name: local-codellama
litellm_params:
model: ollama/codellama:34b
api_base: http://localhost:11434
5. Model Routing Strategies
5.1 Cost-Based Routing (Lowest Cost)
Automatically selects the cheapest model.
router_settings:
routing_strategy: cost-based-routing
model_list:
- model_name: general-chat
litellm_params:
model: gpt-4o-mini
# Input: $0.15/1M tokens, Output: $0.60/1M tokens
- model_name: general-chat
litellm_params:
model: anthropic/claude-3-haiku-20240307
# Input: $0.25/1M tokens, Output: $1.25/1M tokens
- model_name: general-chat
litellm_params:
model: gemini/gemini-2.0-flash
# Input: $0.075/1M tokens, Output: $0.30/1M tokens
With this setup, calling general-chat will automatically select Gemini Flash as the cheapest option.
5.2 Latency-Based Routing (Lowest Latency)
Selects the model with the fastest response time.
router_settings:
routing_strategy: latency-based-routing
routing_strategy_args:
ttl: 60 # Re-measure latency every 60 seconds
5.3 Usage-Based Routing
Routes to the model with the most available capacity based on RPM (requests per minute).
router_settings:
routing_strategy: usage-based-routing
model_list:
- model_name: fast-model
litellm_params:
model: gpt-4o-mini
rpm: 500 # Max 500 requests per minute
- model_name: fast-model
litellm_params:
model: gemini/gemini-2.0-flash
rpm: 1000 # Max 1000 requests per minute
6. Load Balancing and Fallback
6.1 Load Balancing Configuration
Registering multiple deployments under the same model_name automatically enables load balancing.
model_list:
# Distribute across multiple API keys for the same model
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-key-1
rpm: 100
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-key-2
rpm: 100
# Distribute across different providers
- model_name: fast-model
litellm_params:
model: openai/gpt-4o-mini
rpm: 500
- model_name: fast-model
litellm_params:
model: anthropic/claude-3-haiku-20240307
rpm: 300
router_settings:
routing_strategy: simple-shuffle # Round-robin style
allowed_fails: 3 # Deactivate deployment after 3 failures
cooldown_time: 60 # Reactivate after 60 seconds
6.2 Fallback Chain Configuration
Automatically switch to an alternative model when the primary fails.
litellm_settings:
fallbacks:
- model_name: gpt-4o
fallback_models:
- anthropic/claude-sonnet-4-20250514
- gemini/gemini-1.5-pro
- model_name: claude-sonnet
fallback_models:
- gpt-4o
- gemini/gemini-1.5-pro
# Fallback on content policy violations
content_policy_fallbacks:
- model_name: gpt-4o
fallback_models:
- anthropic/claude-sonnet-4-20250514
# Fallback when context window is exceeded
context_window_fallbacks:
- model_name: gpt-4o-mini
fallback_models:
- gpt-4o
- anthropic/claude-sonnet-4-20250514
6.3 Fallback Flow
User Request: Chat Completion with "gpt-4o"
|
v
[1] Try gpt-4o
|
+-- Success --> Return response
|
+-- Failure (429 Rate Limit / 500 Error)
|
v
[2] Try claude-sonnet-4 (1st fallback)
|
+-- Success --> Return response
|
+-- Failure
|
v
[3] Try gemini-1.5-pro (2nd fallback)
|
+-- Success --> Return response
|
+-- Failure --> Return error
7. Virtual Keys and Team Management
7.1 Creating Virtual Keys
# Generate a new virtual key with the master key
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "frontend-team-key",
"duration": "30d",
"max_budget": 100.0,
"models": ["gpt-4o", "claude-sonnet"],
"max_parallel_requests": 10,
"tpm_limit": 100000,
"rpm_limit": 100,
"metadata": {
"team": "frontend",
"environment": "production"
}
}'
7.2 Team Creation and Management
# Create a team
curl -X POST http://localhost:4000/team/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_alias": "ml-engineering",
"max_budget": 500.0,
"models": ["gpt-4o", "claude-sonnet", "gemini-flash"],
"members_with_roles": [
{"role": "admin", "user_id": "user-alice@company.com"},
{"role": "user", "user_id": "user-bob@company.com"}
],
"metadata": {
"department": "engineering",
"cost_center": "ENG-001"
}
}'
# Assign keys to a team
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_id": "team-ml-engineering-id",
"key_alias": "ml-team-prod-key",
"max_budget": 200.0
}'
7.3 Model Access Control per Key
# Model access groups in config.yaml
model_list:
- model_name: premium-model
litellm_params:
model: openai/gpt-4o
model_info:
access_groups: ["premium-tier"]
- model_name: basic-model
litellm_params:
model: openai/gpt-4o-mini
model_info:
access_groups: ["basic-tier", "premium-tier"]
8. Budget Management and Cost Tracking
8.1 Global Budget Settings
general_settings:
max_budget: 10000.0 # Total monthly budget $10,000
budget_duration: 1m # Monthly reset
8.2 Per-Key/Per-Team Budget
# Update key budget
curl -X POST http://localhost:4000/key/update \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key": "sk-virtual-key-xxx",
"max_budget": 50.0,
"budget_duration": "1m",
"soft_budget": 40.0
}'
When soft_budget is reached, an alert is sent. When max_budget is reached, requests are blocked.
8.3 Cost Tracking API
# Query total spend
curl http://localhost:4000/spend/logs \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "start_date=2025-01-01" \
-d "end_date=2025-01-31"
# Query spend by model
curl http://localhost:4000/spend/tags \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "group_by=model"
# Query spend by team
curl http://localhost:4000/spend/tags \
-H "Authorization: Bearer sk-master-key-1234" \
-G \
-d "group_by=team"
8.4 Budget Alert Webhooks
general_settings:
alerting:
- slack
alerting_threshold: 300 # Alert when 80% of budget is consumed
alert_types:
- budget_alerts
- spend_reports
- failed_tracking
environment_variables:
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
9. Rate Limiting
9.1 Key-Level Rate Limiting
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "rate-limited-key",
"rpm_limit": 60,
"tpm_limit": 100000,
"max_parallel_requests": 5
}'
9.2 Model-Level Rate Limiting
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 200 # Max requests per minute for this deployment
tpm: 400000 # Max tokens per minute for this deployment
9.3 Global Rate Limiting
general_settings:
global_max_parallel_requests: 100 # Global concurrent request limit
10. Guardrails: PII Masking and Content Filtering
10.1 PII (Personally Identifiable Information) Masking
litellm_settings:
guardrails:
- guardrail_name: pii-masking
litellm_params:
guardrail: presidio
mode: during_call # Mask PII before sending to LLM
output_parse_pii: true # Also mask PII in responses
# Request containing PII
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "My email is john@example.com and phone is 555-123-4567"
}],
extra_body={
"metadata": {
"guardrails": ["pii-masking"]
}
}
)
# LLM receives: "My email is [EMAIL] and phone is [PHONE]"
10.2 Content Filtering (Lakera Guard)
litellm_settings:
guardrails:
- guardrail_name: content-filter
litellm_params:
guardrail: lakera
mode: pre_call # Filter before sending request
api_key: os.environ/LAKERA_API_KEY
10.3 Prompt Injection Defense
litellm_settings:
guardrails:
- guardrail_name: prompt-injection
litellm_params:
guardrail: lakera
mode: pre_call
prompt_injection: true
10.4 Custom Guardrails
# custom_guardrail.py
from litellm.proxy.guardrails.guardrail_hooks import GuardrailHook
class CustomContentFilter(GuardrailHook):
def __init__(self):
self.blocked_words = ["harmful", "dangerous", "illegal"]
async def pre_call_hook(self, data, call_type):
user_message = data.get("messages", [])[-1].get("content", "")
for word in self.blocked_words:
if word.lower() in user_message.lower():
raise ValueError(f"Blocked content detected: {word}")
return data
async def post_call_hook(self, data, response, call_type):
return response
11. Logging and Monitoring
11.1 Langfuse Integration
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
LANGFUSE_HOST: https://cloud.langfuse.com
11.2 LangSmith Integration
litellm_settings:
success_callback: ["langsmith"]
environment_variables:
LANGCHAIN_API_KEY: os.environ/LANGCHAIN_API_KEY
LANGCHAIN_PROJECT: my-litellm-project
11.3 Custom Callbacks
# custom_callback.py
from litellm.integrations.custom_logger import CustomLogger
class MyCustomLogger(CustomLogger):
async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
print(f"Model: {kwargs.get('model')}")
print(f"Cost: ${kwargs.get('response_cost', 0):.6f}")
print(f"Latency: {(end_time - start_time).total_seconds():.2f}s")
print(f"Tokens: {response_obj.usage.total_tokens}")
async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
print(f"FAILED - Model: {kwargs.get('model')}")
print(f"Error: {kwargs.get('exception')}")
11.4 Prometheus Metrics
litellm_settings:
success_callback: ["prometheus"]
failure_callback: ["prometheus"]
Key metrics:
# Request counts
litellm_requests_total
litellm_requests_failed_total
# Latency
litellm_request_duration_seconds_bucket
litellm_request_duration_seconds_sum
# Cost
litellm_spend_total
# Tokens
litellm_tokens_total
litellm_input_tokens_total
litellm_output_tokens_total
12. Caching
12.1 Redis Caching
litellm_settings:
cache: true
cache_params:
type: redis
host: redis-host
port: 6379
password: os.environ/REDIS_PASSWORD
ttl: 3600 # 1-hour cache
# Semantic caching (same response for similar questions)
supported_call_types:
- acompletion
- completion
12.2 In-Memory Caching
litellm_settings:
cache: true
cache_params:
type: local
ttl: 600 # 10-minute cache
12.3 Cache Control
# Use cache
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
extra_body={
"metadata": {
"cache": {
"no-cache": False, # Use cache
"ttl": 3600 # TTL for this request
}
}
}
)
# Skip cache
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the latest news?"}],
extra_body={
"metadata": {
"cache": {
"no-cache": True # Always get fresh response
}
}
}
)
13. Production Deployment
13.1 Kubernetes Deployment
# litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
labels:
app: litellm
spec:
replicas: 3
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
ports:
- containerPort: 4000
args: ["--config", "/app/config.yaml", "--port", "4000"]
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: anthropic-api-key
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: litellm-secrets
key: database-url
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
livenessProbe:
httpGet:
path: /health/liveliness
port: 4000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: config
configMap:
name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
name: litellm-service
spec:
selector:
app: litellm
ports:
- port: 80
targetPort: 4000
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: litellm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: litellm-proxy
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
13.2 Using Helm Chart
# Install with Helm
helm repo add litellm https://litellm.github.io/litellm-helm
helm repo update
helm install litellm litellm/litellm-helm \
--set masterKey=sk-your-master-key \
--set replicaCount=3 \
--set database.useExisting=true \
--set database.url=postgresql://user:pass@host:5432/litellm \
--namespace litellm \
--create-namespace
13.3 Production Checklist
Production Deployment Checklist
================================
[ ] PostgreSQL DB configured and connectivity verified
[ ] Master Key set to a strong random value
[ ] All API keys managed via env vars / secrets
[ ] HTTPS/TLS configured (Ingress or LoadBalancer)
[ ] 2+ replicas configured
[ ] HPA (Horizontal Pod Autoscaler) configured
[ ] Liveness/Readiness Probes configured
[ ] Prometheus + Grafana monitoring set up
[ ] Slack/PagerDuty alerting configured
[ ] Redis cache set up (optional)
[ ] Budget and rate limits configured
[ ] Guardrails configured (if needed)
[ ] Logging (Langfuse/LangSmith) set up
[ ] Backup and recovery strategy documented
[ ] Load testing completed
14. LiteLLM vs Alternatives
14.1 LiteLLM vs OpenRouter
| Feature | LiteLLM | OpenRouter |
|---|---|---|
| Hosting | Self-hosted | Cloud service |
| Data Privacy | Full control | Third-party servers |
| Cost | Open-source free | API margin added |
| Team Management | Virtual keys/teams/budgets | Limited |
| Customization | Fully customizable | Limited |
| Setup Difficulty | Medium (Docker/K8s) | Easy (API key only) |
| Local Models | Supported (Ollama/vLLM) | Not supported |
14.2 LiteLLM vs Portkey
| Feature | LiteLLM | Portkey |
|---|---|---|
| Open Source | Yes (Apache 2.0) | Partial (Gateway only) |
| Proxy Server | Included | Included |
| Virtual Keys | Included | Included |
| AI Gateway | Basic | Advanced (Prompt mgmt) |
| Pricing | Free | Premium plans available |
| Community | Active (12K+ GitHub Stars) | Growing |
14.3 When to Choose What
Decision Guide
================================
"I want to get started quickly"
-> OpenRouter (sign up and use immediately)
"Data security is critical"
-> LiteLLM (self-hosted, full control)
"I need team/org management"
-> LiteLLM or Portkey
"I also use local models"
-> LiteLLM (native Ollama/vLLM support)
"I need enterprise features"
-> LiteLLM Enterprise or Portkey Enterprise
15. Real-World Example: Complete Production Config
Below is a complete config.yaml suitable for production use.
# production-config.yaml
model_list:
# === Premium Tier ===
- model_name: premium-chat
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 200
tpm: 400000
model_info:
access_groups: ["premium"]
- model_name: premium-chat
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 150
model_info:
access_groups: ["premium"]
# === Basic Tier ===
- model_name: basic-chat
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
rpm: 500
model_info:
access_groups: ["basic", "premium"]
- model_name: basic-chat
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
rpm: 1000
model_info:
access_groups: ["basic", "premium"]
# === Coding Dedicated ===
- model_name: code-model
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
model_info:
access_groups: ["premium"]
# === Embedding ===
- model_name: embedding
litellm_params:
model: openai/text-embedding-3-small
api_key: os.environ/OPENAI_API_KEY
# === Router Settings ===
router_settings:
routing_strategy: latency-based-routing
allowed_fails: 3
cooldown_time: 60
num_retries: 3
timeout: 120
retry_after: 5
# === LiteLLM Settings ===
litellm_settings:
drop_params: true
set_verbose: false
num_retries: 3
request_timeout: 120
# Fallbacks
fallbacks:
- model_name: premium-chat
fallback_models: ["basic-chat"]
- model_name: basic-chat
fallback_models: ["premium-chat"]
context_window_fallbacks:
- model_name: basic-chat
fallback_models: ["premium-chat"]
# Caching
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: 6379
password: os.environ/REDIS_PASSWORD
ttl: 3600
# Callbacks
success_callback: ["langfuse", "prometheus"]
failure_callback: ["langfuse", "prometheus"]
# Guardrails
guardrails:
- guardrail_name: pii-masking
litellm_params:
guardrail: presidio
mode: during_call
# === General Settings ===
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
max_budget: 10000
budget_duration: 1m
alerting:
- slack
alerting_threshold: 300
global_max_parallel_requests: 200
# === Environment Variables ===
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
Quiz
Test your understanding of LiteLLM concepts.
Q1. What are the two modes of using LiteLLM, and when is each appropriate?
A1. LiteLLM has SDK Mode and Proxy Mode.
- SDK Mode: Import the Python package directly. Best for personal projects and prototypes.
- Proxy Mode: Run as a standalone HTTP server. Best for teams/organizations, multi-language clients, virtual key management, budget management, and production deployments.
The key difference is that Proxy Mode provides enterprise features like virtual keys, team management, budget management, and logging.
Q2. What is Model Fallback, and what types of fallback does LiteLLM support?
A2. Model fallback is a mechanism that automatically switches to an alternative model when the primary model fails.
LiteLLM supports three types of fallback:
- General fallbacks: Switch to alternative models on API failures (429, 500 errors)
- Content policy fallbacks: Switch when content policy is violated
- Context window fallbacks: Switch to a model with a larger context window when input exceeds limits
Q3. How does cost-based routing work in LiteLLM?
A3. When you register multiple provider models under the same model_name and set routing_strategy to cost-based-routing, LiteLLM compares the input/output token prices for each model and automatically routes to the cheapest one.
For example, if GPT-4o-mini, Claude Haiku, and Gemini Flash are all registered under the same name, the model with the lowest price per token is automatically selected. This optimizes cost while maintaining quality.
Q4. What are Virtual Keys and what are their main capabilities?
A4. Virtual Keys are API keys generated by the LiteLLM proxy. They avoid exposing actual LLM provider API keys while providing extensive control features.
Key capabilities:
- Budget limits: Set maximum budget per key (max_budget)
- Model access control: Restrict which models a key can access
- Rate limiting: Set RPM, TPM, and concurrent request limits
- Team association: Assign keys to teams for team-level management
- Duration control: Set key expiration dates
- Usage tracking: Track cost, tokens, and request counts per key
Q5. What are the top 5 must-consider items when deploying LiteLLM to production?
A5. Critical production deployment considerations:
- Database: PostgreSQL connection required (stores virtual keys, cost tracking, team management data)
- Security: Strong random Master Key, all API keys in secrets, HTTPS/TLS enabled
- High Availability: 2+ replicas, HPA configured, Liveness/Readiness Probes set up
- Monitoring: Prometheus metrics collection, Grafana dashboards, Slack/PagerDuty alerts configured
- Cost Control: Global budget, per-team/per-key budgets, soft budget alerts, rate limits configured
Additionally recommended: Redis caching, logging (Langfuse), Guardrails, and backup strategy.
References
- LiteLLM Official Documentation - https://docs.litellm.ai/
- LiteLLM GitHub Repository - https://github.com/BerriAI/litellm
- LiteLLM Proxy Server Quick Start - https://docs.litellm.ai/docs/proxy/quick_start
- LiteLLM Supported Providers - https://docs.litellm.ai/docs/providers
- LiteLLM Virtual Keys Guide - https://docs.litellm.ai/docs/proxy/virtual_keys
- LiteLLM Routing Strategies - https://docs.litellm.ai/docs/routing
- LiteLLM Budget Management - https://docs.litellm.ai/docs/proxy/users
- LiteLLM Guardrails - https://docs.litellm.ai/docs/proxy/guardrails
- LiteLLM Kubernetes Deployment - https://docs.litellm.ai/docs/proxy/deploy
- OpenRouter Official Site - https://openrouter.ai/
- Portkey AI Official Site - https://portkey.ai/
- Langfuse LLM Observability - https://langfuse.com/
- LiteLLM vs Alternatives - https://docs.litellm.ai/docs/proxy/enterprise
- Prometheus Monitoring - https://prometheus.io/