- Authors

- Name
- Youngju Kim
- @fjvbn20031
Overview
As services leveraging LLMs (Large Language Models) proliferate, efficiently managing models from multiple providers has become critical. When each provider -- OpenAI, Anthropic, Azure OpenAI, AWS Bedrock -- uses different SDKs and API formats, code complexity increases rapidly.
LiteLLM is an open-source tool that solves this problem by providing an OpenAI SDK-compatible interface to unify 100+ LLMs into a single API. This post covers everything from LiteLLM SDK usage to Proxy Server setup, cost tracking, and production deployment.
1. What is LiteLLM
1.1 Core Value
LiteLLM is an open-source project by BerriAI that provides two core components.
1. Python SDK: Call 100+ LLMs through a unified interface
litellm.completion()
|
+-- model="gpt-4o" --> OpenAI API
+-- model="claude-sonnet-4-20250514" --> Anthropic API
+-- model="azure/gpt-4o" --> Azure OpenAI
+-- model="bedrock/claude-3" --> AWS Bedrock
+-- model="vertex_ai/gemini" --> Google Vertex AI
+-- model="ollama/llama3" --> Local Ollama
2. Proxy Server (AI Gateway): OpenAI-compatible REST API server
Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers
(Rate Limiting, Cost Tracking, Load Balancing)
1.2 Why LiteLLM
| Problem | LiteLLM Solution |
|---|---|
| Different SDK per provider | Unified completion() function |
| Scattered API key management | Centralized management via Proxy |
| Difficulty tracking costs | Automatic cost calculation and tracking |
| Provider outages | Automatic fallback support |
| Rate limit management | Built-in rate limiting |
| Model switching costs | Switch models without code changes |
1.3 Supported Providers
Commercial Providers:
- OpenAI (GPT-4o, GPT-4o-mini, o1, o3, etc.)
- Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)
- Azure OpenAI
- AWS Bedrock (Claude, Titan, Llama)
- Google Vertex AI (Gemini)
- Google AI Studio
- Cohere (Command R+)
- Mistral AI
- Together AI
- Groq
- Fireworks AI
- Perplexity
- DeepSeek
Self-Hosted / Local:
- Ollama
- vLLM
- Hugging Face TGI
- NVIDIA NIM
- OpenAI-compatible endpoints
2. LiteLLM SDK Usage
2.1 Installation
pip install litellm
2.2 Basic Usage: completion()
import litellm
# OpenAI
response = litellm.completion(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Kubernetes?"},
],
temperature=0.7,
max_tokens=1000,
)
print(response.choices[0].message.content)
# Anthropic Claude
response = litellm.completion(
model="claude-sonnet-4-20250514",
messages=[
{"role": "user", "content": "Explain microservices architecture."},
],
max_tokens=1000,
)
print(response.choices[0].message.content)
# Azure OpenAI
response = litellm.completion(
model="azure/gpt-4o-deployment",
messages=[
{"role": "user", "content": "Hello!"},
],
api_base="https://my-resource.openai.azure.com",
api_version="2024-02-15-preview",
api_key="your-azure-key",
)
# AWS Bedrock
response = litellm.completion(
model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
messages=[
{"role": "user", "content": "Summarize this text."},
],
)
# Ollama (local)
response = litellm.completion(
model="ollama/llama3",
messages=[
{"role": "user", "content": "Write a Python function."},
],
api_base="http://localhost:11434",
)
2.3 Streaming
import litellm
# Synchronous streaming
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a long essay."}],
stream=True,
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
2.4 Async Calls
import asyncio
import litellm
async def main():
# Single async call
response = await litellm.acompletion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
# Async streaming
response = await litellm.acompletion(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Explain Docker."}],
stream=True,
)
async for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
# Parallel calls
tasks = [
litellm.acompletion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Question {i}"}],
)
for i in range(5)
]
responses = await asyncio.gather(*tasks)
for resp in responses:
print(resp.choices[0].message.content[:50])
asyncio.run(main())
2.5 Function Calling (Tool Use)
import litellm
import json
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
},
},
"required": ["location"],
},
},
}
]
# Same interface for both OpenAI and Claude
for model in ["gpt-4o", "claude-sonnet-4-20250514"]:
response = litellm.completion(
model=model,
messages=[
{"role": "user", "content": "What's the weather in Seoul?"}
],
tools=tools,
tool_choice="auto",
)
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Model: {model}")
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
2.6 Embedding
import litellm
# OpenAI Embedding
response = litellm.embedding(
model="text-embedding-3-small",
input=["Hello world", "How are you?"],
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")
# Cohere Embedding
response = litellm.embedding(
model="cohere/embed-english-v3.0",
input=["Search query text"],
input_type="search_query",
)
# Bedrock Embedding
response = litellm.embedding(
model="bedrock/amazon.titan-embed-text-v2:0",
input=["Document text for embedding"],
)
2.7 Image/Vision Models
import litellm
# GPT-4o Vision
response = litellm.completion(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png",
},
},
],
}
],
)
# Claude Vision
response = litellm.completion(
model="claude-sonnet-4-20250514",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this architecture diagram."},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgo...",
},
},
],
}
],
)
3. LiteLLM Proxy Server (AI Gateway)
3.1 What is the Proxy
LiteLLM Proxy is a self-hostable OpenAI-compatible API Gateway. Any existing client using the OpenAI SDK can connect to the Proxy without code changes.
+-------------------+
| Your Application |
| (OpenAI SDK) |
+--------+----------+
|
v
+--------+----------+
| LiteLLM Proxy |
| - Rate Limiting |
| - Cost Tracking |
| - Load Balancing |
| - Fallback |
| - Key Management |
+--------+----------+
|
+----+----+----+----+
| | | | |
v v v v v
OpenAI Azure Anthropic Bedrock Ollama
3.2 Installation and Running
# pip install
pip install 'litellm[proxy]'
# Basic run
litellm --model gpt-4o --port 4000
# Run with config file
litellm --config config.yaml --port 4000
# Docker run
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-v ./config.yaml:/app/config.yaml \
-e OPENAI_API_KEY=sk-xxx \
-e ANTHROPIC_API_KEY=sk-ant-xxx \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml
3.3 config.yaml Configuration
# config.yaml
model_list:
# OpenAI models
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 500 # Rate limit: requests per minute
tpm: 100000 # Rate limit: tokens per minute
# Claude models (load balancing across deployments)
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 200
tpm: 80000
# Azure OpenAI
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-deployment
api_base: https://my-resource.openai.azure.com
api_version: '2024-02-15-preview'
api_key: os.environ/AZURE_API_KEY
rpm: 300
# AWS Bedrock Claude
- model_name: bedrock-claude
litellm_params:
model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: us-east-1
# Local Ollama
- model_name: local-llama
litellm_params:
model: ollama/llama3
api_base: http://ollama:11434
# Router settings
router_settings:
routing_strategy: 'latency-based-routing'
num_retries: 3
timeout: 60
allowed_fails: 2
cooldown_time: 30
# General settings
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
store_model_in_db: true
litellm_settings:
drop_params: true
set_verbose: false
cache: true
cache_params:
type: redis
host: redis
port: 6379
3.4 Model Routing and Load Balancing
# Registering multiple deployments with the same model_name enables auto load balancing
model_list:
# gpt-4o group: 3 deployments
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY_1
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-east
api_base: https://east.openai.azure.com
api_key: os.environ/AZURE_KEY_EAST
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-west
api_base: https://west.openai.azure.com
api_key: os.environ/AZURE_KEY_WEST
router_settings:
# Routing strategy
routing_strategy: 'latency-based-routing'
# Options:
# simple-shuffle: Random selection
# least-busy: Fewest in-progress requests
# usage-based-routing: Based on TPM/RPM usage
# latency-based-routing: Based on response time (recommended)
Routing Strategy Comparison:
| Strategy | Description | Best For |
|---|---|---|
| simple-shuffle | Random distribution | All deployments have similar performance |
| least-busy | Based on in-progress request count | Varying request processing times |
| usage-based-routing | Based on RPM/TPM usage | Approaching rate limits |
| latency-based-routing | Based on response time | Latency optimization is critical |
3.5 Fallback Configuration
model_list:
- model_name: primary-model
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: fallback-model
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
num_retries: 2
timeout: 30
fallbacks: [{ 'primary-model': ['fallback-model'] }]
# Fallback only on specific errors
retry_policy:
RateLimitError: 3 # Retry 3 times on 429 errors
ContentPolicyViolationError: 0 # No retry on content policy violations
AuthenticationError: 0 # No retry on auth errors
3.6 API Key Management (Virtual Keys)
# Generate virtual key with master key
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"models": ["gpt-4o", "claude-sonnet"],
"max_budget": 100.0,
"budget_duration": "monthly",
"metadata": {
"team": "backend",
"user": "developer-1"
},
"tpm_limit": 50000,
"rpm_limit": 100
}'
Response:
{
"key": "sk-generated-key-abc123",
"expires": "2026-04-20T00:00:00Z",
"max_budget": 100.0,
"models": ["gpt-4o", "claude-sonnet"]
}
# Make API calls with the generated key
from openai import OpenAI
client = OpenAI(
api_key="sk-generated-key-abc123",
base_url="http://localhost:4000",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
3.7 Rate Limiting Configuration
# Rate Limiting in config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 500 # Model deployment level RPM
tpm: 100000 # Model deployment level TPM
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
# Per-key rate limiting
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"rpm_limit": 50,
"tpm_limit": 20000,
"max_budget": 10.0,
"budget_duration": "daily"
}'
# Per-team rate limiting
curl -X POST http://localhost:4000/team/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_alias": "backend-team",
"rpm_limit": 200,
"tpm_limit": 80000,
"max_budget": 500.0,
"budget_duration": "monthly"
}'
3.8 Budget Management
# Set per-user budget
curl -X POST http://localhost:4000/user/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"user_id": "user-123",
"max_budget": 50.0,
"budget_duration": "monthly",
"models": ["gpt-4o-mini", "claude-sonnet"]
}'
# Check budget usage
curl http://localhost:4000/user/info?user_id=user-123 \
-H "Authorization: Bearer sk-master-key-1234"
3.9 Caching
# config.yaml
litellm_settings:
cache: true
cache_params:
type: redis
host: redis
port: 6379
ttl: 3600 # 1 hour cache
# Cache control from client
from openai import OpenAI
client = OpenAI(
api_key="sk-key",
base_url="http://localhost:4000",
)
# Use cache (default)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}],
)
# Skip cache
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}],
extra_body={"cache": {"no-cache": True}},
)
4. Cost Tracking
4.1 Automatic Cost Calculation
LiteLLM automatically calculates the cost of each request.
import litellm
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
# Cost information
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
# LiteLLM cost calculation
from litellm import completion_cost
cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")
4.2 Cost Queries via Proxy
# Total spend
curl http://localhost:4000/global/spend \
-H "Authorization: Bearer sk-master-key-1234"
# Per-key spend
curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \
-H "Authorization: Bearer sk-master-key-1234"
# Per-model spend
curl "http://localhost:4000/global/spend?model=gpt-4o" \
-H "Authorization: Bearer sk-master-key-1234"
# Per-team spend
curl "http://localhost:4000/team/info?team_id=team-backend" \
-H "Authorization: Bearer sk-master-key-1234"
# Spend by date range
curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \
-H "Authorization: Bearer sk-master-key-1234"
4.3 Budget Alert Configuration
# config.yaml
general_settings:
alerting:
- slack
alerting_threshold: 300 # Alert if no response within 300 seconds
alert_types:
- budget_alerts # On budget exceeded
- spend_reports # Weekly/monthly cost reports
- failed_tracking # Failed request tracking
environment_variables:
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
5. Production Deployment
5.1 Docker Compose
# docker-compose.yml
version: '3.8'
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm-proxy
ports:
- '4000:4000'
volumes:
- ./config.yaml:/app/config.yaml
environment:
- OPENAI_API_KEY=sk-xxx
- ANTHROPIC_API_KEY=sk-ant-xxx
- AZURE_API_KEY=xxx
- AWS_ACCESS_KEY_ID=xxx
- AWS_SECRET_ACCESS_KEY=xxx
- DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
- REDIS_HOST=redis
- REDIS_PORT=6379
command: --config /app/config.yaml --port 4000
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']
interval: 30s
timeout: 10s
retries: 3
postgres:
image: postgres:16-alpine
container_name: litellm-postgres
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: password
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U litellm']
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
container_name: litellm-redis
ports:
- '6379:6379'
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:
5.2 Kubernetes Helm Chart
# Add Helm repository
helm repo add litellm https://berriai.github.io/litellm/
helm repo update
# Install
helm install litellm litellm/litellm-helm \
--namespace litellm \
--create-namespace \
--values values.yaml
# values.yaml
replicaCount: 3
image:
repository: ghcr.io/berriai/litellm
tag: main-latest
service:
type: ClusterIP
port: 4000
ingress:
enabled: true
className: nginx
hosts:
- host: litellm.internal.company.com
paths:
- path: /
pathType: Prefix
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: anthropic-api-key
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: litellm-secrets
key: database-url
postgresql:
enabled: true
auth:
database: litellm
username: litellm
redis:
enabled: true
5.3 Health Check and Metrics
# Health Check
curl http://localhost:4000/health
# Prometheus Metrics
curl http://localhost:4000/metrics
Key Prometheus Metrics:
litellm_requests_total: Total request count
litellm_request_duration_seconds: Request processing time
litellm_tokens_total: Total token usage
litellm_spend_total: Total spend
litellm_errors_total: Error count
litellm_cache_hits_total: Cache hit count
5.4 Logging Integration
# config.yaml - External logging service integration
litellm_settings:
success_callback: ['langfuse']
failure_callback: ['langfuse']
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
LANGFUSE_HOST: https://cloud.langfuse.com
Supported Logging Services:
| Service | Purpose |
|---|---|
| Langfuse | LLM observability, prompt management |
| Helicone | Request logging, cost analysis |
| Lunary | LLM monitoring |
| Custom Callback | Integration with custom logging systems |
# Custom Callback example
import litellm
def my_custom_callback(kwargs, completion_response, start_time, end_time):
# Called for every request
model = kwargs.get("model")
messages = kwargs.get("messages")
cost = completion_cost(completion_response=completion_response)
# Custom logic (DB storage, alerts, etc.)
log_to_database(
model=model,
cost=cost,
latency=(end_time - start_time).total_seconds(),
tokens=completion_response.usage.total_tokens,
)
litellm.success_callback = [my_custom_callback]
6. Real-World Use Cases
6.1 Enterprise AI Gateway
Centralize all LLM calls across the organization through LiteLLM Proxy.
+------------------+
| Frontend App |----+
+------------------+ |
| +----------------+
+------------------+ +---->| | +----------+
| Backend Service |----+ | LiteLLM Proxy |---->| OpenAI |
+------------------+ | | | +----------+
| | - Auth |
+------------------+ | | - Rate Limit | +----------+
| Data Pipeline |----+ | - Cost Track |---->| Anthropic|
+------------------+ | | - Audit Log | +----------+
| | |
+------------------+ | +--------+-------+ +----------+
| Internal Tools |----+ | | Azure |
+------------------+ v +----------+
+--------+-------+
| PostgreSQL |
| (spend logs) |
+----------------+
6.2 A/B Testing
from openai import OpenAI
import random
client = OpenAI(
api_key="sk-proxy-key",
base_url="http://litellm-proxy:4000",
)
def get_completion_with_ab_test(prompt: str, test_name: str):
# 50/50 A/B test
model = random.choice(["gpt-4o", "claude-sonnet"])
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
extra_body={
"metadata": {
"test_name": test_name,
"variant": model,
}
},
)
return {
"model": model,
"content": response.choices[0].message.content,
"tokens": response.usage.total_tokens,
}
6.3 Cost-Optimized Routing
def smart_route(prompt: str, complexity: str = "auto"):
"""Select appropriate model based on complexity"""
if complexity == "auto":
# Simple heuristic: based on token count and keywords
word_count = len(prompt.split())
if word_count < 50:
complexity = "simple"
elif any(kw in prompt.lower() for kw in
["analyze", "compare", "complex", "detailed"]):
complexity = "complex"
else:
complexity = "medium"
model_map = {
"simple": "gpt-4o-mini", # Cheap model
"medium": "claude-sonnet", # Mid performance/price
"complex": "gpt-4o", # High performance model
}
model = model_map[complexity]
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return response
6.4 Disaster Recovery (Automatic Failover)
# config.yaml - Multi-provider failover
model_list:
# Primary: OpenAI
- model_name: main-model
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
# Secondary: Azure OpenAI (different region)
- model_name: main-model-fallback-1
litellm_params:
model: azure/gpt-4o
api_base: https://eastus.openai.azure.com
api_key: os.environ/AZURE_KEY
# Tertiary: Anthropic Claude
- model_name: main-model-fallback-2
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]
num_retries: 2
timeout: 30
allowed_fails: 3
cooldown_time: 60 # 60 second cooldown for failed models
7. Comparison: LiteLLM vs Alternatives
7.1 Tool Comparison
| Feature | LiteLLM | LangChain | OpenRouter | Portkey |
|---|---|---|---|---|
| Type | Gateway + SDK | Framework | Hosted API | Hosted Gateway |
| Hosting | Self-hosted | N/A (library) | Cloud | Cloud + Self |
| Model Count | 100+ | Various | 200+ | 250+ |
| Cost Tracking | Built-in | Requires custom impl | Yes | Yes |
| Rate Limiting | Built-in | None | Yes | Yes |
| Load Balancing | Built-in | None | Yes | Yes |
| Fallback | Built-in | Manual implementation | Yes | Yes |
| API Key Mgmt | Virtual Keys | None | None | Yes |
| Pricing | Free (OSS) | Free (OSS) | Markup | Free + Enterprise |
| Data Privacy | Full control | Full control | Third-party | Third-party |
7.2 When to Choose Which Tool
Choose LiteLLM when:
- Data privacy is important (finance, healthcare, government)
- Must operate on own infrastructure
- Cost tracking and rate limiting are needed
- Already using multiple providers
Choose LangChain when:
- Building complex LLM pipelines (RAG, Agents)
- Need prompt chaining, memory management
- (LiteLLM and LangChain can be used together)
Choose OpenRouter when:
- Rapid prototyping
- Don't want to manage infrastructure
- Single API key for all models
Choose Portkey when:
- Enterprise-level management UI needed
- Advanced features like guardrails, A/B testing needed
- Prefer managed services
8. Practical Tips
8.1 Environment Variable Management
# .env file (never commit to Git)
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
AZURE_API_KEY=xxx
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm
LITELLM_MASTER_KEY=sk-master-key-change-me
8.2 Model Alias Configuration
# config.yaml
model_list:
- model_name: fast
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: smart
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: creative
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
# Call with meaningful names
response = client.chat.completions.create(
model="fast", # gpt-4o-mini
messages=[{"role": "user", "content": "Quick question"}],
)
response = client.chat.completions.create(
model="smart", # gpt-4o
messages=[{"role": "user", "content": "Complex analysis"}],
)
8.3 Error Handling Patterns
from openai import OpenAI, APIError, RateLimitError, APITimeoutError
client = OpenAI(
api_key="sk-proxy-key",
base_url="http://litellm-proxy:4000",
)
def safe_completion(messages, model="gpt-4o", max_retries=3):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
timeout=30,
)
return response
except RateLimitError:
# LiteLLM Proxy handles rate limits, but client should too
import time
wait = 2 ** attempt
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
except APITimeoutError:
print(f"Timeout on attempt {attempt + 1}")
if attempt == max_retries - 1:
raise
except APIError as e:
print(f"API error: {e}")
raise
return None
9. Conclusion
LiteLLM serves as an essential AI Gateway in multi-LLM environments.
Key Takeaways:
- Unified SDK: Call 100+ LLMs through a single completion() function
- Proxy Server: Centralized management via OpenAI-compatible API Gateway
- Cost Control: Automatic cost tracking, budget management, alerts
- Reliability: Built-in load balancing, fallback, rate limiting
- Production: Docker/Kubernetes deployment, Prometheus monitoring, external logging
Especially in enterprise environments using multiple LLM providers, deploying LiteLLM Proxy enables consistent centralized handling of API key management, cost tracking, and failure recovery.
References
- LiteLLM Official Docs: https://docs.litellm.ai/
- LiteLLM GitHub: https://github.com/BerriAI/litellm
- LiteLLM Proxy Config Guide: https://docs.litellm.ai/docs/proxy/configs
- LiteLLM Docker Deployment: https://docs.litellm.ai/docs/proxy/deploy