Skip to content
Published on

[Architecture] Complete Guide to LiteLLM: Unified Serving of 100+ LLMs

Authors

Overview

As services leveraging LLMs (Large Language Models) proliferate, efficiently managing models from multiple providers has become critical. When each provider -- OpenAI, Anthropic, Azure OpenAI, AWS Bedrock -- uses different SDKs and API formats, code complexity increases rapidly.

LiteLLM is an open-source tool that solves this problem by providing an OpenAI SDK-compatible interface to unify 100+ LLMs into a single API. This post covers everything from LiteLLM SDK usage to Proxy Server setup, cost tracking, and production deployment.


1. What is LiteLLM

1.1 Core Value

LiteLLM is an open-source project by BerriAI that provides two core components.

1. Python SDK: Call 100+ LLMs through a unified interface

litellm.completion()
  |
  +-- model="gpt-4o"           --> OpenAI API
  +-- model="claude-sonnet-4-20250514"  --> Anthropic API
  +-- model="azure/gpt-4o"     --> Azure OpenAI
  +-- model="bedrock/claude-3" --> AWS Bedrock
  +-- model="vertex_ai/gemini" --> Google Vertex AI
  +-- model="ollama/llama3"    --> Local Ollama

2. Proxy Server (AI Gateway): OpenAI-compatible REST API server

Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers
                      (Rate Limiting, Cost Tracking, Load Balancing)

1.2 Why LiteLLM

ProblemLiteLLM Solution
Different SDK per providerUnified completion() function
Scattered API key managementCentralized management via Proxy
Difficulty tracking costsAutomatic cost calculation and tracking
Provider outagesAutomatic fallback support
Rate limit managementBuilt-in rate limiting
Model switching costsSwitch models without code changes

1.3 Supported Providers

Commercial Providers:
  - OpenAI (GPT-4o, GPT-4o-mini, o1, o3, etc.)
  - Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)
  - Azure OpenAI
  - AWS Bedrock (Claude, Titan, Llama)
  - Google Vertex AI (Gemini)
  - Google AI Studio
  - Cohere (Command R+)
  - Mistral AI
  - Together AI
  - Groq
  - Fireworks AI
  - Perplexity
  - DeepSeek

Self-Hosted / Local:
  - Ollama
  - vLLM
  - Hugging Face TGI
  - NVIDIA NIM
  - OpenAI-compatible endpoints

2. LiteLLM SDK Usage

2.1 Installation

pip install litellm

2.2 Basic Usage: completion()

import litellm

# OpenAI
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Kubernetes?"},
    ],
    temperature=0.7,
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Anthropic Claude
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "user", "content": "Explain microservices architecture."},
    ],
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Azure OpenAI
response = litellm.completion(
    model="azure/gpt-4o-deployment",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
    api_base="https://my-resource.openai.azure.com",
    api_version="2024-02-15-preview",
    api_key="your-azure-key",
)

# AWS Bedrock
response = litellm.completion(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[
        {"role": "user", "content": "Summarize this text."},
    ],
)

# Ollama (local)
response = litellm.completion(
    model="ollama/llama3",
    messages=[
        {"role": "user", "content": "Write a Python function."},
    ],
    api_base="http://localhost:11434",
)

2.3 Streaming

import litellm

# Synchronous streaming
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a long essay."}],
    stream=True,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

2.4 Async Calls

import asyncio
import litellm

async def main():
    # Single async call
    response = await litellm.acompletion(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.choices[0].message.content)

    # Async streaming
    response = await litellm.acompletion(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": "Explain Docker."}],
        stream=True,
    )
    async for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)

    # Parallel calls
    tasks = [
        litellm.acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Question {i}"}],
        )
        for i in range(5)
    ]
    responses = await asyncio.gather(*tasks)
    for resp in responses:
        print(resp.choices[0].message.content[:50])

asyncio.run(main())

2.5 Function Calling (Tool Use)

import litellm
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["location"],
            },
        },
    }
]

# Same interface for both OpenAI and Claude
for model in ["gpt-4o", "claude-sonnet-4-20250514"]:
    response = litellm.completion(
        model=model,
        messages=[
            {"role": "user", "content": "What's the weather in Seoul?"}
        ],
        tools=tools,
        tool_choice="auto",
    )

    if response.choices[0].message.tool_calls:
        tool_call = response.choices[0].message.tool_calls[0]
        print(f"Model: {model}")
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

2.6 Embedding

import litellm

# OpenAI Embedding
response = litellm.embedding(
    model="text-embedding-3-small",
    input=["Hello world", "How are you?"],
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

# Cohere Embedding
response = litellm.embedding(
    model="cohere/embed-english-v3.0",
    input=["Search query text"],
    input_type="search_query",
)

# Bedrock Embedding
response = litellm.embedding(
    model="bedrock/amazon.titan-embed-text-v2:0",
    input=["Document text for embedding"],
)

2.7 Image/Vision Models

import litellm

# GPT-4o Vision
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.png",
                    },
                },
            ],
        }
    ],
)

# Claude Vision
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this architecture diagram."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/png;base64,iVBORw0KGgo...",
                    },
                },
            ],
        }
    ],
)

3. LiteLLM Proxy Server (AI Gateway)

3.1 What is the Proxy

LiteLLM Proxy is a self-hostable OpenAI-compatible API Gateway. Any existing client using the OpenAI SDK can connect to the Proxy without code changes.

+-------------------+
| Your Application  |
| (OpenAI SDK)      |
+--------+----------+
         |
         v
+--------+----------+
| LiteLLM Proxy     |
| - Rate Limiting   |
| - Cost Tracking   |
| - Load Balancing  |
| - Fallback        |
| - Key Management  |
+--------+----------+
         |
    +----+----+----+----+
    |    |    |    |    |
    v    v    v    v    v
  OpenAI Azure Anthropic Bedrock Ollama

3.2 Installation and Running

# pip install
pip install 'litellm[proxy]'

# Basic run
litellm --model gpt-4o --port 4000

# Run with config file
litellm --config config.yaml --port 4000

# Docker run
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v ./config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-xxx \
  -e ANTHROPIC_API_KEY=sk-ant-xxx \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

3.3 config.yaml Configuration

# config.yaml
model_list:
  # OpenAI models
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # Rate limit: requests per minute
      tpm: 100000 # Rate limit: tokens per minute

  # Claude models (load balancing across deployments)
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 200
      tpm: 80000

  # Azure OpenAI
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-deployment
      api_base: https://my-resource.openai.azure.com
      api_version: '2024-02-15-preview'
      api_key: os.environ/AZURE_API_KEY
      rpm: 300

  # AWS Bedrock Claude
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  # Local Ollama
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama:11434

# Router settings
router_settings:
  routing_strategy: 'latency-based-routing'
  num_retries: 3
  timeout: 60
  allowed_fails: 2
  cooldown_time: 30

# General settings
general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

litellm_settings:
  drop_params: true
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379

3.4 Model Routing and Load Balancing

# Registering multiple deployments with the same model_name enables auto load balancing
model_list:
  # gpt-4o group: 3 deployments
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY_1
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-east
      api_base: https://east.openai.azure.com
      api_key: os.environ/AZURE_KEY_EAST
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-west
      api_base: https://west.openai.azure.com
      api_key: os.environ/AZURE_KEY_WEST

router_settings:
  # Routing strategy
  routing_strategy: 'latency-based-routing'
  # Options:
  #   simple-shuffle: Random selection
  #   least-busy: Fewest in-progress requests
  #   usage-based-routing: Based on TPM/RPM usage
  #   latency-based-routing: Based on response time (recommended)

Routing Strategy Comparison:

StrategyDescriptionBest For
simple-shuffleRandom distributionAll deployments have similar performance
least-busyBased on in-progress request countVarying request processing times
usage-based-routingBased on RPM/TPM usageApproaching rate limits
latency-based-routingBased on response timeLatency optimization is critical

3.5 Fallback Configuration

model_list:
  - model_name: primary-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: fallback-model
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  num_retries: 2
  timeout: 30
  fallbacks: [{ 'primary-model': ['fallback-model'] }]
  # Fallback only on specific errors
  retry_policy:
    RateLimitError: 3 # Retry 3 times on 429 errors
    ContentPolicyViolationError: 0 # No retry on content policy violations
    AuthenticationError: 0 # No retry on auth errors

3.6 API Key Management (Virtual Keys)

# Generate virtual key with master key
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4o", "claude-sonnet"],
    "max_budget": 100.0,
    "budget_duration": "monthly",
    "metadata": {
      "team": "backend",
      "user": "developer-1"
    },
    "tpm_limit": 50000,
    "rpm_limit": 100
  }'

Response:

{
  "key": "sk-generated-key-abc123",
  "expires": "2026-04-20T00:00:00Z",
  "max_budget": 100.0,
  "models": ["gpt-4o", "claude-sonnet"]
}
# Make API calls with the generated key
from openai import OpenAI

client = OpenAI(
    api_key="sk-generated-key-abc123",
    base_url="http://localhost:4000",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

3.7 Rate Limiting Configuration

# Rate Limiting in config.yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # Model deployment level RPM
      tpm: 100000 # Model deployment level TPM

general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL
# Per-key rate limiting
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "rpm_limit": 50,
    "tpm_limit": 20000,
    "max_budget": 10.0,
    "budget_duration": "daily"
  }'

# Per-team rate limiting
curl -X POST http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "backend-team",
    "rpm_limit": 200,
    "tpm_limit": 80000,
    "max_budget": 500.0,
    "budget_duration": "monthly"
  }'

3.8 Budget Management

# Set per-user budget
curl -X POST http://localhost:4000/user/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user-123",
    "max_budget": 50.0,
    "budget_duration": "monthly",
    "models": ["gpt-4o-mini", "claude-sonnet"]
  }'

# Check budget usage
curl http://localhost:4000/user/info?user_id=user-123 \
  -H "Authorization: Bearer sk-master-key-1234"

3.9 Caching

# config.yaml
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 3600 # 1 hour cache
# Cache control from client
from openai import OpenAI

client = OpenAI(
    api_key="sk-key",
    base_url="http://localhost:4000",
)

# Use cache (default)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
)

# Skip cache
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    extra_body={"cache": {"no-cache": True}},
)

4. Cost Tracking

4.1 Automatic Cost Calculation

LiteLLM automatically calculates the cost of each request.

import litellm

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Cost information
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

# LiteLLM cost calculation
from litellm import completion_cost

cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

4.2 Cost Queries via Proxy

# Total spend
curl http://localhost:4000/global/spend \
  -H "Authorization: Bearer sk-master-key-1234"

# Per-key spend
curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \
  -H "Authorization: Bearer sk-master-key-1234"

# Per-model spend
curl "http://localhost:4000/global/spend?model=gpt-4o" \
  -H "Authorization: Bearer sk-master-key-1234"

# Per-team spend
curl "http://localhost:4000/team/info?team_id=team-backend" \
  -H "Authorization: Bearer sk-master-key-1234"

# Spend by date range
curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \
  -H "Authorization: Bearer sk-master-key-1234"

4.3 Budget Alert Configuration

# config.yaml
general_settings:
  alerting:
    - slack
  alerting_threshold: 300 # Alert if no response within 300 seconds
  alert_types:
    - budget_alerts # On budget exceeded
    - spend_reports # Weekly/monthly cost reports
    - failed_tracking # Failed request tracking

environment_variables:
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

5. Production Deployment

5.1 Docker Compose

# docker-compose.yml
version: '3.8'

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - '4000:4000'
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=sk-xxx
      - ANTHROPIC_API_KEY=sk-ant-xxx
      - AZURE_API_KEY=xxx
      - AWS_ACCESS_KEY_ID=xxx
      - AWS_SECRET_ACCESS_KEY=xxx
      - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
      - REDIS_HOST=redis
      - REDIS_PORT=6379
    command: --config /app/config.yaml --port 4000
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']
      interval: 30s
      timeout: 10s
      retries: 3

  postgres:
    image: postgres:16-alpine
    container_name: litellm-postgres
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U litellm']
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    container_name: litellm-redis
    ports:
      - '6379:6379'
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

5.2 Kubernetes Helm Chart

# Add Helm repository
helm repo add litellm https://berriai.github.io/litellm/
helm repo update

# Install
helm install litellm litellm/litellm-helm \
  --namespace litellm \
  --create-namespace \
  --values values.yaml
# values.yaml
replicaCount: 3

image:
  repository: ghcr.io/berriai/litellm
  tag: main-latest

service:
  type: ClusterIP
  port: 4000

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: litellm.internal.company.com
      paths:
        - path: /
          pathType: Prefix

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: openai-api-key
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: anthropic-api-key
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: database-url

postgresql:
  enabled: true
  auth:
    database: litellm
    username: litellm

redis:
  enabled: true

5.3 Health Check and Metrics

# Health Check
curl http://localhost:4000/health

# Prometheus Metrics
curl http://localhost:4000/metrics

Key Prometheus Metrics:

litellm_requests_total: Total request count
litellm_request_duration_seconds: Request processing time
litellm_tokens_total: Total token usage
litellm_spend_total: Total spend
litellm_errors_total: Error count
litellm_cache_hits_total: Cache hit count

5.4 Logging Integration

# config.yaml - External logging service integration
litellm_settings:
  success_callback: ['langfuse']
  failure_callback: ['langfuse']

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: https://cloud.langfuse.com

Supported Logging Services:

ServicePurpose
LangfuseLLM observability, prompt management
HeliconeRequest logging, cost analysis
LunaryLLM monitoring
Custom CallbackIntegration with custom logging systems
# Custom Callback example
import litellm

def my_custom_callback(kwargs, completion_response, start_time, end_time):
    # Called for every request
    model = kwargs.get("model")
    messages = kwargs.get("messages")
    cost = completion_cost(completion_response=completion_response)

    # Custom logic (DB storage, alerts, etc.)
    log_to_database(
        model=model,
        cost=cost,
        latency=(end_time - start_time).total_seconds(),
        tokens=completion_response.usage.total_tokens,
    )

litellm.success_callback = [my_custom_callback]

6. Real-World Use Cases

6.1 Enterprise AI Gateway

Centralize all LLM calls across the organization through LiteLLM Proxy.

+------------------+
| Frontend App     |----+
+------------------+    |
                        |     +----------------+
+------------------+    +---->|                |     +----------+
| Backend Service  |----+     | LiteLLM Proxy  |---->| OpenAI   |
+------------------+    |     |                |     +----------+
                        |     | - Auth         |
+------------------+    |     | - Rate Limit   |     +----------+
| Data Pipeline    |----+     | - Cost Track   |---->| Anthropic|
+------------------+    |     | - Audit Log    |     +----------+
                        |     |                |
+------------------+    |     +--------+-------+     +----------+
| Internal Tools   |----+              |             | Azure    |
+------------------+                   v             +----------+
                              +--------+-------+
                              | PostgreSQL     |
                              | (spend logs)   |
                              +----------------+

6.2 A/B Testing

from openai import OpenAI
import random

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def get_completion_with_ab_test(prompt: str, test_name: str):
    # 50/50 A/B test
    model = random.choice(["gpt-4o", "claude-sonnet"])

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        extra_body={
            "metadata": {
                "test_name": test_name,
                "variant": model,
            }
        },
    )

    return {
        "model": model,
        "content": response.choices[0].message.content,
        "tokens": response.usage.total_tokens,
    }

6.3 Cost-Optimized Routing

def smart_route(prompt: str, complexity: str = "auto"):
    """Select appropriate model based on complexity"""

    if complexity == "auto":
        # Simple heuristic: based on token count and keywords
        word_count = len(prompt.split())
        if word_count < 50:
            complexity = "simple"
        elif any(kw in prompt.lower() for kw in
                 ["analyze", "compare", "complex", "detailed"]):
            complexity = "complex"
        else:
            complexity = "medium"

    model_map = {
        "simple": "gpt-4o-mini",      # Cheap model
        "medium": "claude-sonnet",      # Mid performance/price
        "complex": "gpt-4o",           # High performance model
    }

    model = model_map[complexity]

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    return response

6.4 Disaster Recovery (Automatic Failover)

# config.yaml - Multi-provider failover
model_list:
  # Primary: OpenAI
  - model_name: main-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  # Secondary: Azure OpenAI (different region)
  - model_name: main-model-fallback-1
    litellm_params:
      model: azure/gpt-4o
      api_base: https://eastus.openai.azure.com
      api_key: os.environ/AZURE_KEY

  # Tertiary: Anthropic Claude
  - model_name: main-model-fallback-2
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]
  num_retries: 2
  timeout: 30
  allowed_fails: 3
  cooldown_time: 60 # 60 second cooldown for failed models

7. Comparison: LiteLLM vs Alternatives

7.1 Tool Comparison

FeatureLiteLLMLangChainOpenRouterPortkey
TypeGateway + SDKFrameworkHosted APIHosted Gateway
HostingSelf-hostedN/A (library)CloudCloud + Self
Model Count100+Various200+250+
Cost TrackingBuilt-inRequires custom implYesYes
Rate LimitingBuilt-inNoneYesYes
Load BalancingBuilt-inNoneYesYes
FallbackBuilt-inManual implementationYesYes
API Key MgmtVirtual KeysNoneNoneYes
PricingFree (OSS)Free (OSS)MarkupFree + Enterprise
Data PrivacyFull controlFull controlThird-partyThird-party

7.2 When to Choose Which Tool

Choose LiteLLM when:
  - Data privacy is important (finance, healthcare, government)
  - Must operate on own infrastructure
  - Cost tracking and rate limiting are needed
  - Already using multiple providers

Choose LangChain when:
  - Building complex LLM pipelines (RAG, Agents)
  - Need prompt chaining, memory management
  - (LiteLLM and LangChain can be used together)

Choose OpenRouter when:
  - Rapid prototyping
  - Don't want to manage infrastructure
  - Single API key for all models

Choose Portkey when:
  - Enterprise-level management UI needed
  - Advanced features like guardrails, A/B testing needed
  - Prefer managed services

8. Practical Tips

8.1 Environment Variable Management

# .env file (never commit to Git)
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
AZURE_API_KEY=xxx
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm
LITELLM_MASTER_KEY=sk-master-key-change-me

8.2 Model Alias Configuration

# config.yaml
model_list:
  - model_name: fast
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: smart
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: creative
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
# Call with meaningful names
response = client.chat.completions.create(
    model="fast",  # gpt-4o-mini
    messages=[{"role": "user", "content": "Quick question"}],
)

response = client.chat.completions.create(
    model="smart",  # gpt-4o
    messages=[{"role": "user", "content": "Complex analysis"}],
)

8.3 Error Handling Patterns

from openai import OpenAI, APIError, RateLimitError, APITimeoutError

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def safe_completion(messages, model="gpt-4o", max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=30,
            )
            return response
        except RateLimitError:
            # LiteLLM Proxy handles rate limits, but client should too
            import time
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
        except APITimeoutError:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise
        except APIError as e:
            print(f"API error: {e}")
            raise

    return None

9. Conclusion

LiteLLM serves as an essential AI Gateway in multi-LLM environments.

Key Takeaways:

  1. Unified SDK: Call 100+ LLMs through a single completion() function
  2. Proxy Server: Centralized management via OpenAI-compatible API Gateway
  3. Cost Control: Automatic cost tracking, budget management, alerts
  4. Reliability: Built-in load balancing, fallback, rate limiting
  5. Production: Docker/Kubernetes deployment, Prometheus monitoring, external logging

Especially in enterprise environments using multiple LLM providers, deploying LiteLLM Proxy enables consistent centralized handling of API key management, cost tracking, and failure recovery.


References