LiteLLM Complete Guide 2025: Unify 100+ LLMs with a Single API Proxy Server

Introduction: Why LiteLLM?
1. LiteLLM Architecture: SDK vs Proxy
2. SDK Usage: Unified LLM Calls
3. Proxy Server Configuration
4. Supported Providers and Models
5. Model Routing Strategies
6. Load Balancing and Fallback
7. Virtual Keys and Team Management
8. Budget Management and Cost Tracking
9. Rate Limiting
10. Guardrails: PII Masking and Content Filtering
11. Logging and Monitoring
12. Caching
13. Production Deployment
14. LiteLLM vs Alternatives
15. Real-World Example: Complete Production Config
Quiz
References

Introduction: Why LiteLLM?

In 2025, the AI landscape is evolving faster than ever. New models launch every month: OpenAI's GPT-4o, Anthropic's Claude Opus 4, Google's Gemini 2.0, Meta's Llama 3.1, Mistral's Large, and more. Each provider has its own API format, authentication scheme, and pricing model.

Organizations face three critical challenges in this environment.

1. Vendor Lock-In

When you build against one provider's API, switching to a better model incurs significant refactoring costs.

# Problem: Each provider has a different API format
# OpenAI
import openai
response = openai.ChatCompletion.create(model="gpt-4", messages=[...])

# Anthropic
import anthropic
response = anthropic.messages.create(model="claude-3-opus", messages=[...])

# Google
import google.generativeai as genai
response = genai.GenerativeModel("gemini-pro").generate_content(...)

2. Cost Management Complexity

Using multiple models means tracking costs across different providers with different pricing structures becomes a nightmare.

3. Availability and Reliability

Depending on a single provider means a service outage takes your entire system down.

LiteLLM solves all three problems with a single solution.

┌──────────────────────────────────────────────────┐
│                  Your Application                │
│              (OpenAI SDK Format)                 │
└─────────────────────┬────────────────────────────┘
                      │
                      ▼
┌──────────────────────────────────────────────────┐
│              LiteLLM Proxy Server                │
│  ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│  │  Cost    │ │  Load    │ │  Virtual Keys    │ │
│  │  Tracking│ │  Balance │ │  + Budget Mgmt   │ │
│  └──────────┘ └──────────┘ └──────────────────┘ │
└───┬──────────────┬──────────────┬────────────────┘
    │              │              │
    ▼              ▼              ▼
┌────────┐  ┌──────────┐  ┌───────────┐
│ OpenAI │  │ Anthropic│  │  Google   │
│ GPT-4o │  │ Claude   │  │  Gemini   │
└────────┘  └──────────┘  └───────────┘
    ▲              ▲              ▲
    │              │              │
┌────────┐  ┌──────────┐  ┌───────────┐
│  AWS   │  │  Azure   │  │  Ollama   │
│Bedrock │  │ OpenAI   │  │  (Local)  │
└────────┘  └──────────┘  └───────────┘

1. LiteLLM Architecture: SDK vs Proxy

LiteLLM operates in two modes.

1.1 SDK Mode (Python Package)

Import directly into your application. Best for simple projects and prototypes.

# pip install litellm
from litellm import completion

# Call OpenAI
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Call Anthropic (same format!)
response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Call Google Gemini (same format!)
response = completion(
    model="gemini/gemini-2.0-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)

1.2 Proxy Mode (Server)

Run as a standalone server so all applications use a single endpoint. Best for teams and organizations.

# Start proxy server
litellm --config config.yaml --port 4000

# Any language can call using OpenAI SDK format
import openai

client = openai.OpenAI(
    api_key="sk-litellm-your-virtual-key",
    base_url="http://localhost:4000"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

1.3 SDK vs Proxy Comparison

Feature	SDK Mode	Proxy Mode
Usage	Python package import	HTTP Server (REST API)
Language Support	Python only	All languages (OpenAI SDK compatible)
Team Management	Not available	Virtual keys with team management
Budget Management	Basic	Advanced (per-team/per-key)
Load Balancing	Manual in code	Automatic via config
Best For	Personal projects, prototypes	Teams, organizations, production

2. SDK Usage: Unified LLM Calls

2.1 Chat Completion

The core SDK feature. Call 100+ models with an identical interface.

from litellm import completion
import os

# Set API keys via environment variables
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
os.environ["GEMINI_API_KEY"] = "AI..."

# Same function, different models
models = [
    "gpt-4o",
    "anthropic/claude-sonnet-4-20250514",
    "gemini/gemini-2.0-flash",
    "groq/llama-3.1-70b-versatile",
    "mistral/mistral-large-latest",
]

for model in models:
    response = completion(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum computing in 2 sentences."}
        ],
        temperature=0.7,
        max_tokens=200,
    )
    print(f"[{model}]: {response.choices[0].message.content}")

2.2 Streaming Responses

from litellm import completion

response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Write a short poem about coding."}],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

2.3 Embedding Calls

from litellm import embedding

# OpenAI Embedding
response = embedding(
    model="text-embedding-3-small",
    input=["Hello world", "How are you?"]
)

# Cohere Embedding (same format!)
response = embedding(
    model="cohere/embed-english-v3.0",
    input=["Hello world", "How are you?"]
)

print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

2.4 Image Generation

from litellm import image_generation

response = image_generation(
    model="dall-e-3",
    prompt="A futuristic city powered by AI, digital art style",
    n=1,
    size="1024x1024",
)

print(f"Image URL: {response.data[0]['url']}")

2.5 Async Calls

import asyncio
from litellm import acompletion

async def generate_responses():
    tasks = [
        acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Tell me fact #{i} about space."}]
        )
        for i in range(5)
    ]
    responses = await asyncio.gather(*tasks)
    for i, resp in enumerate(responses):
        print(f"Fact #{i}: {resp.choices[0].message.content[:100]}")

asyncio.run(generate_responses())

3. Proxy Server Configuration

3.1 Basic Configuration File (config.yaml)

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gemini-flash
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY

  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.1:70b
      api_base: http://localhost:11434

litellm_settings:
  drop_params: true
  set_verbose: false
  num_retries: 3
  request_timeout: 120

general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL

3.2 Getting Started with Docker

# Docker Compose file
cat > docker-compose.yaml << 'EOF'
version: "3.9"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    environment:
      - OPENAI_API_KEY
      - ANTHROPIC_API_KEY
      - GEMINI_API_KEY
      - DATABASE_URL=postgresql://user:pass@db:5432/litellm
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
      POSTGRES_DB: litellm
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:
EOF

# Launch
docker compose up -d

3.3 Testing the Proxy Server

# Health check
curl http://localhost:4000/health

# Chat Completion call
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# List models
curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer sk-master-key-1234"

4. Supported Providers and Models

LiteLLM supports 100+ LLM providers. Here are the major ones.

4.1 Major Providers

┌────────────────────┬──────────────────────────────────────┐
│ Provider           │ Supported Models (examples)          │
├────────────────────┼──────────────────────────────────────┤
│ OpenAI             │ gpt-4o, gpt-4o-mini, o1, o3-mini    │
│ Anthropic          │ claude-opus-4, claude-sonnet, haiku  │
│ Google (Gemini)    │ gemini-2.0-flash, gemini-1.5-pro    │
│ AWS Bedrock        │ Claude, Titan, Llama via Bedrock     │
│ Azure OpenAI       │ gpt-4o (Azure hosted)               │
│ Mistral AI         │ mistral-large, mistral-small         │
│ Groq               │ llama-3.1-70b, mixtral-8x7b         │
│ Together AI        │ llama-3.1, CodeLlama, Qwen          │
│ Ollama (Local)     │ llama3.1, codellama, mistral         │
│ vLLM (Self-hosted) │ Any HuggingFace model               │
│ Cohere             │ command-r-plus, embed models         │
│ Deepseek           │ deepseek-chat, deepseek-coder        │
│ Fireworks AI       │ llama, mixtral, etc.                 │
│ Perplexity         │ pplx-70b-online, etc.               │
└────────────────────┴──────────────────────────────────────┘

4.2 AWS Bedrock Configuration

model_list:
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  - model_name: bedrock-llama
    litellm_params:
      model: bedrock/meta.llama3-1-70b-instruct-v1:0
      aws_region_name: us-west-2

4.3 Azure OpenAI Configuration

model_list:
  - model_name: azure-gpt4
    litellm_params:
      model: azure/gpt-4o-deployment-name
      api_base: https://your-resource.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-06-01"

4.4 Ollama (Local LLM) Configuration

model_list:
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.1:70b
      api_base: http://localhost:11434
      stream: true

  - model_name: local-codellama
    litellm_params:
      model: ollama/codellama:34b
      api_base: http://localhost:11434

5. Model Routing Strategies

5.1 Cost-Based Routing (Lowest Cost)

Automatically selects the cheapest model.

router_settings:
  routing_strategy: cost-based-routing

model_list:
  - model_name: general-chat
    litellm_params:
      model: gpt-4o-mini
      # Input: $0.15/1M tokens, Output: $0.60/1M tokens

  - model_name: general-chat
    litellm_params:
      model: anthropic/claude-3-haiku-20240307
      # Input: $0.25/1M tokens, Output: $1.25/1M tokens

  - model_name: general-chat
    litellm_params:
      model: gemini/gemini-2.0-flash
      # Input: $0.075/1M tokens, Output: $0.30/1M tokens

With this setup, calling general-chat will automatically select Gemini Flash as the cheapest option.

5.2 Latency-Based Routing (Lowest Latency)

Selects the model with the fastest response time.

router_settings:
  routing_strategy: latency-based-routing
  routing_strategy_args:
    ttl: 60  # Re-measure latency every 60 seconds

5.3 Usage-Based Routing

Routes to the model with the most available capacity based on RPM (requests per minute).

router_settings:
  routing_strategy: usage-based-routing

model_list:
  - model_name: fast-model
    litellm_params:
      model: gpt-4o-mini
      rpm: 500  # Max 500 requests per minute

  - model_name: fast-model
    litellm_params:
      model: gemini/gemini-2.0-flash
      rpm: 1000  # Max 1000 requests per minute

6. Load Balancing and Fallback

6.1 Load Balancing Configuration

Registering multiple deployments under the same model_name automatically enables load balancing.

model_list:
  # Distribute across multiple API keys for the same model
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-1
      rpm: 100

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-2
      rpm: 100

  # Distribute across different providers
  - model_name: fast-model
    litellm_params:
      model: openai/gpt-4o-mini
      rpm: 500

  - model_name: fast-model
    litellm_params:
      model: anthropic/claude-3-haiku-20240307
      rpm: 300

router_settings:
  routing_strategy: simple-shuffle  # Round-robin style
  allowed_fails: 3  # Deactivate deployment after 3 failures
  cooldown_time: 60  # Reactivate after 60 seconds

6.2 Fallback Chain Configuration

Automatically switch to an alternative model when the primary fails.

litellm_settings:
  fallbacks:
    - model_name: gpt-4o
      fallback_models:
        - anthropic/claude-sonnet-4-20250514
        - gemini/gemini-1.5-pro

    - model_name: claude-sonnet
      fallback_models:
        - gpt-4o
        - gemini/gemini-1.5-pro

  # Fallback on content policy violations
  content_policy_fallbacks:
    - model_name: gpt-4o
      fallback_models:
        - anthropic/claude-sonnet-4-20250514

  # Fallback when context window is exceeded
  context_window_fallbacks:
    - model_name: gpt-4o-mini
      fallback_models:
        - gpt-4o
        - anthropic/claude-sonnet-4-20250514

6.3 Fallback Flow

User Request: Chat Completion with "gpt-4o"
    |
    v
[1] Try gpt-4o
    |
    +-- Success --> Return response
    |
    +-- Failure (429 Rate Limit / 500 Error)
        |
        v
[2] Try claude-sonnet-4 (1st fallback)
    |
    +-- Success --> Return response
    |
    +-- Failure
        |
        v
[3] Try gemini-1.5-pro (2nd fallback)
    |
    +-- Success --> Return response
    |
    +-- Failure --> Return error

7. Virtual Keys and Team Management

7.1 Creating Virtual Keys

# Generate a new virtual key with the master key
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "frontend-team-key",
    "duration": "30d",
    "max_budget": 100.0,
    "models": ["gpt-4o", "claude-sonnet"],
    "max_parallel_requests": 10,
    "tpm_limit": 100000,
    "rpm_limit": 100,
    "metadata": {
      "team": "frontend",
      "environment": "production"
    }
  }'

7.2 Team Creation and Management

# Create a team
curl -X POST http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "ml-engineering",
    "max_budget": 500.0,
    "models": ["gpt-4o", "claude-sonnet", "gemini-flash"],
    "members_with_roles": [
      {"role": "admin", "user_id": "user-alice@company.com"},
      {"role": "user", "user_id": "user-bob@company.com"}
    ],
    "metadata": {
      "department": "engineering",
      "cost_center": "ENG-001"
    }
  }'

# Assign keys to a team
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team-ml-engineering-id",
    "key_alias": "ml-team-prod-key",
    "max_budget": 200.0
  }'

7.3 Model Access Control per Key

# Model access groups in config.yaml
model_list:
  - model_name: premium-model
    litellm_params:
      model: openai/gpt-4o
    model_info:
      access_groups: ["premium-tier"]

  - model_name: basic-model
    litellm_params:
      model: openai/gpt-4o-mini
    model_info:
      access_groups: ["basic-tier", "premium-tier"]

8. Budget Management and Cost Tracking

8.1 Global Budget Settings

general_settings:
  max_budget: 10000.0  # Total monthly budget $10,000
  budget_duration: 1m   # Monthly reset

8.2 Per-Key/Per-Team Budget

# Update key budget
curl -X POST http://localhost:4000/key/update \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key": "sk-virtual-key-xxx",
    "max_budget": 50.0,
    "budget_duration": "1m",
    "soft_budget": 40.0
  }'

When soft_budget is reached, an alert is sent. When max_budget is reached, requests are blocked.

8.3 Cost Tracking API

# Query total spend
curl http://localhost:4000/spend/logs \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "start_date=2025-01-01" \
  -d "end_date=2025-01-31"

# Query spend by model
curl http://localhost:4000/spend/tags \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "group_by=model"

# Query spend by team
curl http://localhost:4000/spend/tags \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "group_by=team"

8.4 Budget Alert Webhooks

general_settings:
  alerting:
    - slack
  alerting_threshold: 300  # Alert when 80% of budget is consumed
  alert_types:
    - budget_alerts
    - spend_reports
    - failed_tracking

environment_variables:
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

9. Rate Limiting

9.1 Key-Level Rate Limiting

curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "rate-limited-key",
    "rpm_limit": 60,
    "tpm_limit": 100000,
    "max_parallel_requests": 5
  }'

9.2 Model-Level Rate Limiting

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 200    # Max requests per minute for this deployment
      tpm: 400000 # Max tokens per minute for this deployment

9.3 Global Rate Limiting

general_settings:
  global_max_parallel_requests: 100  # Global concurrent request limit

10. Guardrails: PII Masking and Content Filtering

10.1 PII (Personally Identifiable Information) Masking

litellm_settings:
  guardrails:
    - guardrail_name: pii-masking
      litellm_params:
        guardrail: presidio
        mode: during_call  # Mask PII before sending to LLM
        output_parse_pii: true  # Also mask PII in responses

# Request containing PII
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "My email is john@example.com and phone is 555-123-4567"
    }],
    extra_body={
        "metadata": {
            "guardrails": ["pii-masking"]
        }
    }
)
# LLM receives: "My email is [EMAIL] and phone is [PHONE]"

10.2 Content Filtering (Lakera Guard)

litellm_settings:
  guardrails:
    - guardrail_name: content-filter
      litellm_params:
        guardrail: lakera
        mode: pre_call  # Filter before sending request
        api_key: os.environ/LAKERA_API_KEY

10.3 Prompt Injection Defense

litellm_settings:
  guardrails:
    - guardrail_name: prompt-injection
      litellm_params:
        guardrail: lakera
        mode: pre_call
        prompt_injection: true

10.4 Custom Guardrails

# custom_guardrail.py
from litellm.proxy.guardrails.guardrail_hooks import GuardrailHook

class CustomContentFilter(GuardrailHook):
    def __init__(self):
        self.blocked_words = ["harmful", "dangerous", "illegal"]

    async def pre_call_hook(self, data, call_type):
        user_message = data.get("messages", [])[-1].get("content", "")
        for word in self.blocked_words:
            if word.lower() in user_message.lower():
                raise ValueError(f"Blocked content detected: {word}")
        return data

    async def post_call_hook(self, data, response, call_type):
        return response

11. Logging and Monitoring

11.1 Langfuse Integration

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: https://cloud.langfuse.com

11.2 LangSmith Integration

litellm_settings:
  success_callback: ["langsmith"]

environment_variables:
  LANGCHAIN_API_KEY: os.environ/LANGCHAIN_API_KEY
  LANGCHAIN_PROJECT: my-litellm-project

11.3 Custom Callbacks

# custom_callback.py
from litellm.integrations.custom_logger import CustomLogger

class MyCustomLogger(CustomLogger):
    async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
        print(f"Model: {kwargs.get('model')}")
        print(f"Cost: ${kwargs.get('response_cost', 0):.6f}")
        print(f"Latency: {(end_time - start_time).total_seconds():.2f}s")
        print(f"Tokens: {response_obj.usage.total_tokens}")

    async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
        print(f"FAILED - Model: {kwargs.get('model')}")
        print(f"Error: {kwargs.get('exception')}")

11.4 Prometheus Metrics

litellm_settings:
  success_callback: ["prometheus"]
  failure_callback: ["prometheus"]

Key metrics:

# Request counts
litellm_requests_total
litellm_requests_failed_total

# Latency
litellm_request_duration_seconds_bucket
litellm_request_duration_seconds_sum

# Cost
litellm_spend_total

# Tokens
litellm_tokens_total
litellm_input_tokens_total
litellm_output_tokens_total

12. Caching

12.1 Redis Caching

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis-host
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600  # 1-hour cache

    # Semantic caching (same response for similar questions)
    supported_call_types:
      - acompletion
      - completion

12.2 In-Memory Caching

litellm_settings:
  cache: true
  cache_params:
    type: local
    ttl: 600  # 10-minute cache

12.3 Cache Control

# Use cache
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    extra_body={
        "metadata": {
            "cache": {
                "no-cache": False,  # Use cache
                "ttl": 3600         # TTL for this request
            }
        }
    }
)

# Skip cache
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the latest news?"}],
    extra_body={
        "metadata": {
            "cache": {
                "no-cache": True  # Always get fresh response
            }
        }
    }
)

13. Production Deployment

13.1 Kubernetes Deployment

# litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
  labels:
    app: litellm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:main-latest
          ports:
            - containerPort: 4000
          args: ["--config", "/app/config.yaml", "--port", "4000"]
          volumeMounts:
            - name: config
              mountPath: /app/config.yaml
              subPath: config.yaml
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: openai-api-key
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: anthropic-api-key
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: database-url
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "2Gi"
          livenessProbe:
            httpGet:
              path: /health/liveliness
              port: 4000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 4000
            initialDelaySeconds: 10
            periodSeconds: 5
      volumes:
        - name: config
          configMap:
            name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
  name: litellm-service
spec:
  selector:
    app: litellm
  ports:
    - port: 80
      targetPort: 4000
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litellm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litellm-proxy
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

13.2 Using Helm Chart

# Install with Helm
helm repo add litellm https://litellm.github.io/litellm-helm
helm repo update

helm install litellm litellm/litellm-helm \
  --set masterKey=sk-your-master-key \
  --set replicaCount=3 \
  --set database.useExisting=true \
  --set database.url=postgresql://user:pass@host:5432/litellm \
  --namespace litellm \
  --create-namespace

13.3 Production Checklist

Production Deployment Checklist
================================
[ ] PostgreSQL DB configured and connectivity verified
[ ] Master Key set to a strong random value
[ ] All API keys managed via env vars / secrets
[ ] HTTPS/TLS configured (Ingress or LoadBalancer)
[ ] 2+ replicas configured
[ ] HPA (Horizontal Pod Autoscaler) configured
[ ] Liveness/Readiness Probes configured
[ ] Prometheus + Grafana monitoring set up
[ ] Slack/PagerDuty alerting configured
[ ] Redis cache set up (optional)
[ ] Budget and rate limits configured
[ ] Guardrails configured (if needed)
[ ] Logging (Langfuse/LangSmith) set up
[ ] Backup and recovery strategy documented
[ ] Load testing completed

14. LiteLLM vs Alternatives

14.1 LiteLLM vs OpenRouter

Feature	LiteLLM	OpenRouter
Hosting	Self-hosted	Cloud service
Data Privacy	Full control	Third-party servers
Cost	Open-source free	API margin added
Team Management	Virtual keys/teams/budgets	Limited
Customization	Fully customizable	Limited
Setup Difficulty	Medium (Docker/K8s)	Easy (API key only)
Local Models	Supported (Ollama/vLLM)	Not supported

14.2 LiteLLM vs Portkey

Feature	LiteLLM	Portkey
Open Source	Yes (Apache 2.0)	Partial (Gateway only)
Proxy Server	Included	Included
Virtual Keys	Included	Included
AI Gateway	Basic	Advanced (Prompt mgmt)
Pricing	Free	Premium plans available
Community	Active (12K+ GitHub Stars)	Growing

14.3 When to Choose What

Decision Guide
================================

"I want to get started quickly"
  -> OpenRouter (sign up and use immediately)

"Data security is critical"
  -> LiteLLM (self-hosted, full control)

"I need team/org management"
  -> LiteLLM or Portkey

"I also use local models"
  -> LiteLLM (native Ollama/vLLM support)

"I need enterprise features"
  -> LiteLLM Enterprise or Portkey Enterprise

15. Real-World Example: Complete Production Config

Below is a complete config.yaml suitable for production use.

# production-config.yaml
model_list:
  # === Premium Tier ===
  - model_name: premium-chat
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 200
      tpm: 400000
    model_info:
      access_groups: ["premium"]

  - model_name: premium-chat
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 150
    model_info:
      access_groups: ["premium"]

  # === Basic Tier ===
  - model_name: basic-chat
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500
    model_info:
      access_groups: ["basic", "premium"]

  - model_name: basic-chat
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY
      rpm: 1000
    model_info:
      access_groups: ["basic", "premium"]

  # === Coding Dedicated ===
  - model_name: code-model
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      access_groups: ["premium"]

  # === Embedding ===
  - model_name: embedding
    litellm_params:
      model: openai/text-embedding-3-small
      api_key: os.environ/OPENAI_API_KEY

# === Router Settings ===
router_settings:
  routing_strategy: latency-based-routing
  allowed_fails: 3
  cooldown_time: 60
  num_retries: 3
  timeout: 120
  retry_after: 5

# === LiteLLM Settings ===
litellm_settings:
  drop_params: true
  set_verbose: false
  num_retries: 3
  request_timeout: 120

  # Fallbacks
  fallbacks:
    - model_name: premium-chat
      fallback_models: ["basic-chat"]
    - model_name: basic-chat
      fallback_models: ["premium-chat"]

  context_window_fallbacks:
    - model_name: basic-chat
      fallback_models: ["premium-chat"]

  # Caching
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600

  # Callbacks
  success_callback: ["langfuse", "prometheus"]
  failure_callback: ["langfuse", "prometheus"]

  # Guardrails
  guardrails:
    - guardrail_name: pii-masking
      litellm_params:
        guardrail: presidio
        mode: during_call

# === General Settings ===
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  max_budget: 10000
  budget_duration: 1m

  alerting:
    - slack
  alerting_threshold: 300

  global_max_parallel_requests: 200

# === Environment Variables ===
environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

Quiz

Test your understanding of LiteLLM concepts.

Q1. What are the two modes of using LiteLLM, and when is each appropriate?

A1. LiteLLM has SDK Mode and Proxy Mode.

SDK Mode: Import the Python package directly. Best for personal projects and prototypes.
Proxy Mode: Run as a standalone HTTP server. Best for teams/organizations, multi-language clients, virtual key management, budget management, and production deployments.

The key difference is that Proxy Mode provides enterprise features like virtual keys, team management, budget management, and logging.

Q2. What is Model Fallback, and what types of fallback does LiteLLM support?

A2. Model fallback is a mechanism that automatically switches to an alternative model when the primary model fails.

LiteLLM supports three types of fallback:

General fallbacks: Switch to alternative models on API failures (429, 500 errors)
Content policy fallbacks: Switch when content policy is violated
Context window fallbacks: Switch to a model with a larger context window when input exceeds limits

Q3. How does cost-based routing work in LiteLLM?

A3. When you register multiple provider models under the same model_name and set routing_strategy to cost-based-routing, LiteLLM compares the input/output token prices for each model and automatically routes to the cheapest one.

For example, if GPT-4o-mini, Claude Haiku, and Gemini Flash are all registered under the same name, the model with the lowest price per token is automatically selected. This optimizes cost while maintaining quality.

Q4. What are Virtual Keys and what are their main capabilities?

A4. Virtual Keys are API keys generated by the LiteLLM proxy. They avoid exposing actual LLM provider API keys while providing extensive control features.

Key capabilities:

Budget limits: Set maximum budget per key (max_budget)
Model access control: Restrict which models a key can access
Rate limiting: Set RPM, TPM, and concurrent request limits
Team association: Assign keys to teams for team-level management
Duration control: Set key expiration dates
Usage tracking: Track cost, tokens, and request counts per key

Q5. What are the top 5 must-consider items when deploying LiteLLM to production?

A5. Critical production deployment considerations:

Database: PostgreSQL connection required (stores virtual keys, cost tracking, team management data)
Security: Strong random Master Key, all API keys in secrets, HTTPS/TLS enabled
High Availability: 2+ replicas, HPA configured, Liveness/Readiness Probes set up
Monitoring: Prometheus metrics collection, Grafana dashboards, Slack/PagerDuty alerts configured
Cost Control: Global budget, per-team/per-key budgets, soft budget alerts, rate limits configured

Additionally recommended: Redis caching, logging (Langfuse), Guardrails, and backup strategy.

References

LiteLLM Official Documentation - https://docs.litellm.ai/
LiteLLM GitHub Repository - https://github.com/BerriAI/litellm
LiteLLM Proxy Server Quick Start - https://docs.litellm.ai/docs/proxy/quick_start
LiteLLM Supported Providers - https://docs.litellm.ai/docs/providers
LiteLLM Virtual Keys Guide - https://docs.litellm.ai/docs/proxy/virtual_keys
LiteLLM Routing Strategies - https://docs.litellm.ai/docs/routing
LiteLLM Budget Management - https://docs.litellm.ai/docs/proxy/users
LiteLLM Guardrails - https://docs.litellm.ai/docs/proxy/guardrails
LiteLLM Kubernetes Deployment - https://docs.litellm.ai/docs/proxy/deploy
OpenRouter Official Site - https://openrouter.ai/
Portkey AI Official Site - https://portkey.ai/
Langfuse LLM Observability - https://langfuse.com/
LiteLLM vs Alternatives - https://docs.litellm.ai/docs/proxy/enterprise
Prometheus Monitoring - https://prometheus.io/