Split View: [Architecture] LiteLLM 완전 가이드: 100+ LLM 통합 서빙과 비용 관리

[Architecture] LiteLLM 완전 가이드: 100+ LLM 통합 서빙과 비용 관리

개요

LLM(Large Language Model)을 활용하는 서비스가 늘어나면서, 여러 Provider의 모델을 효율적으로 관리하는 것이 중요해졌습니다. OpenAI, Anthropic, Azure OpenAI, AWS Bedrock 등 각 Provider마다 서로 다른 SDK와 API 형식을 사용하면 코드 복잡도가 급격히 증가합니다.

LiteLLM은 이 문제를 해결하는 오픈소스 도구로, OpenAI SDK 호환 인터페이스를 통해 100개 이상의 LLM을 하나의 API로 통합합니다. 이 글에서는 LiteLLM SDK 사용법부터 Proxy Server 구축, 비용 추적, 프로덕션 배포까지 총정리합니다.

1. LiteLLM이란

1.1 핵심 가치

LiteLLM은 BerriAI가 개발한 오픈소스 프로젝트로, 두 가지 핵심 컴포넌트를 제공합니다.

1. Python SDK: 통합 인터페이스로 100+ LLM 호출

litellm.completion()
  |
  +-- model="gpt-4o"           --> OpenAI API
  +-- model="claude-sonnet-4-20250514"  --> Anthropic API
  +-- model="azure/gpt-4o"     --> Azure OpenAI
  +-- model="bedrock/claude-3" --> AWS Bedrock
  +-- model="vertex_ai/gemini" --> Google Vertex AI
  +-- model="ollama/llama3"    --> Local Ollama

2. Proxy Server (AI Gateway): OpenAI 호환 REST API 서버

Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers
                      (Rate Limiting, Cost Tracking, Load Balancing)

1.2 왜 LiteLLM인가

문제	LiteLLM 해결 방식
Provider별 다른 SDK	통합 completion() 함수
API 키 분산 관리	Proxy에서 중앙 관리
비용 추적 어려움	자동 비용 계산 및 추적
Provider 장애	자동 Fallback 지원
Rate Limit 관리	내장 Rate Limiting
모델 전환 비용	코드 변경 없이 모델 교체

1.3 지원 Provider 목록

Commercial Providers:
  - OpenAI (GPT-4o, GPT-4o-mini, o1, o3, etc.)
  - Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)
  - Azure OpenAI
  - AWS Bedrock (Claude, Titan, Llama)
  - Google Vertex AI (Gemini)
  - Google AI Studio
  - Cohere (Command R+)
  - Mistral AI
  - Together AI
  - Groq
  - Fireworks AI
  - Perplexity
  - DeepSeek

Self-Hosted / Local:
  - Ollama
  - vLLM
  - Hugging Face TGI
  - NVIDIA NIM
  - OpenAI-compatible endpoints

2. LiteLLM SDK 사용법

2.1 설치

pip install litellm

2.2 기본 사용: completion()

import litellm

# OpenAI
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Kubernetes?"},
    ],
    temperature=0.7,
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Anthropic Claude
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "user", "content": "Explain microservices architecture."},
    ],
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Azure OpenAI
response = litellm.completion(
    model="azure/gpt-4o-deployment",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
    api_base="https://my-resource.openai.azure.com",
    api_version="2024-02-15-preview",
    api_key="your-azure-key",
)

# AWS Bedrock
response = litellm.completion(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[
        {"role": "user", "content": "Summarize this text."},
    ],
)

# Ollama (로컬)
response = litellm.completion(
    model="ollama/llama3",
    messages=[
        {"role": "user", "content": "Write a Python function."},
    ],
    api_base="http://localhost:11434",
)

2.3 Streaming

import litellm

# 동기 스트리밍
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a long essay."}],
    stream=True,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

2.4 Async 호출

import asyncio
import litellm

async def main():
    # 단일 비동기 호출
    response = await litellm.acompletion(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.choices[0].message.content)

    # 비동기 스트리밍
    response = await litellm.acompletion(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": "Explain Docker."}],
        stream=True,
    )
    async for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)

    # 병렬 호출
    tasks = [
        litellm.acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Question {i}"}],
        )
        for i in range(5)
    ]
    responses = await asyncio.gather(*tasks)
    for resp in responses:
        print(resp.choices[0].message.content[:50])

asyncio.run(main())

2.5 Function Calling (Tool Use)

import litellm
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["location"],
            },
        },
    }
]

# OpenAI와 Claude 모두 동일한 인터페이스
for model in ["gpt-4o", "claude-sonnet-4-20250514"]:
    response = litellm.completion(
        model=model,
        messages=[
            {"role": "user", "content": "What's the weather in Seoul?"}
        ],
        tools=tools,
        tool_choice="auto",
    )

    if response.choices[0].message.tool_calls:
        tool_call = response.choices[0].message.tool_calls[0]
        print(f"Model: {model}")
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

2.6 Embedding

import litellm

# OpenAI Embedding
response = litellm.embedding(
    model="text-embedding-3-small",
    input=["Hello world", "How are you?"],
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

# Cohere Embedding
response = litellm.embedding(
    model="cohere/embed-english-v3.0",
    input=["Search query text"],
    input_type="search_query",
)

# Bedrock Embedding
response = litellm.embedding(
    model="bedrock/amazon.titan-embed-text-v2:0",
    input=["Document text for embedding"],
)

2.7 Image/Vision 모델

import litellm

# GPT-4o Vision
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.png",
                    },
                },
            ],
        }
    ],
)

# Claude Vision
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this architecture diagram."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/png;base64,iVBORw0KGgo...",
                    },
                },
            ],
        }
    ],
)

3. LiteLLM Proxy Server (AI Gateway)

3.1 Proxy란

LiteLLM Proxy는 자체 호스팅 가능한 OpenAI 호환 API Gateway입니다. 기존 OpenAI SDK를 사용하는 모든 클라이언트가 코드 변경 없이 Proxy에 연결할 수 있습니다.

+-------------------+
| Your Application  |
| (OpenAI SDK)      |
+--------+----------+
         |
         v
+--------+----------+
| LiteLLM Proxy     |
| - Rate Limiting   |
| - Cost Tracking   |
| - Load Balancing  |
| - Fallback        |
| - Key Management  |
+--------+----------+
         |
    +----+----+----+----+
    |    |    |    |    |
    v    v    v    v    v
  OpenAI Azure Anthropic Bedrock Ollama

3.2 설치 및 실행

# pip 설치
pip install 'litellm[proxy]'

# 기본 실행
litellm --model gpt-4o --port 4000

# 설정 파일로 실행
litellm --config config.yaml --port 4000

# Docker 실행
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v ./config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-xxx \
  -e ANTHROPIC_API_KEY=sk-ant-xxx \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

3.3 config.yaml 설정

# config.yaml
model_list:
  # OpenAI 모델
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # Rate limit: requests per minute
      tpm: 100000 # Rate limit: tokens per minute

  # Claude 모델 (여러 배포로 로드밸런싱)
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 200
      tpm: 80000

  # Azure OpenAI
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-deployment
      api_base: https://my-resource.openai.azure.com
      api_version: '2024-02-15-preview'
      api_key: os.environ/AZURE_API_KEY
      rpm: 300

  # AWS Bedrock Claude
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  # 로컬 Ollama
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama:11434

# 라우터 설정
router_settings:
  routing_strategy: 'latency-based-routing'
  num_retries: 3
  timeout: 60
  allowed_fails: 2
  cooldown_time: 30

# 일반 설정
general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

litellm_settings:
  drop_params: true
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379

3.4 모델 라우팅과 Load Balancing

# 같은 model_name으로 여러 배포를 등록하면 자동 로드밸런싱
model_list:
  # gpt-4o 그룹: 3개 배포
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY_1
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-east
      api_base: https://east.openai.azure.com
      api_key: os.environ/AZURE_KEY_EAST
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-west
      api_base: https://west.openai.azure.com
      api_key: os.environ/AZURE_KEY_WEST

router_settings:
  # 라우팅 전략
  routing_strategy: 'latency-based-routing'
  # 옵션:
  #   simple-shuffle: 랜덤 선택
  #   least-busy: 가장 적은 진행 중 요청
  #   usage-based-routing: TPM/RPM 사용량 기반
  #   latency-based-routing: 응답 시간 기반 (권장)

라우팅 전략 비교:

전략	설명	적합한 경우
simple-shuffle	랜덤 분배	모든 배포 성능이 유사한 경우
least-busy	진행 중 요청 수 기준	요청 처리 시간이 다양한 경우
usage-based-routing	RPM/TPM 사용량 기준	Rate Limit에 근접한 경우
latency-based-routing	응답 시간 기준	지연시간 최적화가 중요한 경우

3.5 Fallback 설정

model_list:
  - model_name: primary-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: fallback-model
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  num_retries: 2
  timeout: 30
  fallbacks: [{ 'primary-model': ['fallback-model'] }]
  # 특정 에러에만 fallback
  retry_policy:
    RateLimitError: 3 # 429 에러 시 3번 재시도
    ContentPolicyViolationError: 0 # 콘텐츠 정책 위반은 재시도 안 함
    AuthenticationError: 0 # 인증 에러는 재시도 안 함

3.6 API Key 관리 (Virtual Keys)

# 마스터 키로 가상 키 생성
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4o", "claude-sonnet"],
    "max_budget": 100.0,
    "budget_duration": "monthly",
    "metadata": {
      "team": "backend",
      "user": "developer-1"
    },
    "tpm_limit": 50000,
    "rpm_limit": 100
  }'

응답:

{
  "key": "sk-generated-key-abc123",
  "expires": "2026-04-20T00:00:00Z",
  "max_budget": 100.0,
  "models": ["gpt-4o", "claude-sonnet"]
}

# 생성된 키로 API 호출
from openai import OpenAI

client = OpenAI(
    api_key="sk-generated-key-abc123",
    base_url="http://localhost:4000",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

3.7 Rate Limiting 설정

# config.yaml에서 Rate Limiting
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # 모델 배포 수준 RPM
      tpm: 100000 # 모델 배포 수준 TPM

general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL

# 키별 Rate Limiting 설정
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "rpm_limit": 50,
    "tpm_limit": 20000,
    "max_budget": 10.0,
    "budget_duration": "daily"
  }'

# 팀별 Rate Limiting
curl -X POST http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "backend-team",
    "rpm_limit": 200,
    "tpm_limit": 80000,
    "max_budget": 500.0,
    "budget_duration": "monthly"
  }'

3.8 Budget Management

# 사용자별 예산 설정
curl -X POST http://localhost:4000/user/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user-123",
    "max_budget": 50.0,
    "budget_duration": "monthly",
    "models": ["gpt-4o-mini", "claude-sonnet"]
  }'

# 예산 사용량 확인
curl http://localhost:4000/user/info?user_id=user-123 \
  -H "Authorization: Bearer sk-master-key-1234"

3.9 Caching

# config.yaml
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 3600 # 1시간 캐시


    # Semantic Caching (유사한 질문에 대해 캐시 히트)
    # supported_call_types: ["acompletion", "completion"]
    # similarity_threshold: 0.8

# 클라이언트에서 캐시 제어
from openai import OpenAI

client = OpenAI(
    api_key="sk-key",
    base_url="http://localhost:4000",
)

# 캐시 사용 (기본)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
)

# 캐시 건너뛰기
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    extra_body={"cache": {"no-cache": True}},
)

4. 비용 추적 (Cost Tracking)

4.1 자동 비용 계산

LiteLLM은 각 요청의 비용을 자동으로 계산합니다.

import litellm

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

# 비용 정보
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

# litellm의 비용 계산
from litellm import completion_cost

cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

4.2 Proxy에서 비용 조회

# 전체 비용 조회
curl http://localhost:4000/global/spend \
  -H "Authorization: Bearer sk-master-key-1234"

# 키별 비용 조회
curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \
  -H "Authorization: Bearer sk-master-key-1234"

# 모델별 비용 조회
curl "http://localhost:4000/global/spend?model=gpt-4o" \
  -H "Authorization: Bearer sk-master-key-1234"

# 팀별 비용 조회
curl "http://localhost:4000/team/info?team_id=team-backend" \
  -H "Authorization: Bearer sk-master-key-1234"

# 기간별 비용 조회
curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \
  -H "Authorization: Bearer sk-master-key-1234"

4.3 비용 알림 설정

# config.yaml
general_settings:
  alerting:
    - slack
  alerting_threshold: 300 # 300초 내 응답 없으면 알림
  alert_types:
    - budget_alerts # 예산 초과 시
    - spend_reports # 주간/월간 비용 리포트
    - failed_tracking # 실패한 요청 추적

environment_variables:
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

5. 프로덕션 배포

5.1 Docker Compose

# docker-compose.yml
version: '3.8'

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - '4000:4000'
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=sk-xxx
      - ANTHROPIC_API_KEY=sk-ant-xxx
      - AZURE_API_KEY=xxx
      - AWS_ACCESS_KEY_ID=xxx
      - AWS_SECRET_ACCESS_KEY=xxx
      - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
      - REDIS_HOST=redis
      - REDIS_PORT=6379
    command: --config /app/config.yaml --port 4000
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']
      interval: 30s
      timeout: 10s
      retries: 3

  postgres:
    image: postgres:16-alpine
    container_name: litellm-postgres
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U litellm']
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    container_name: litellm-redis
    ports:
      - '6379:6379'
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

5.2 Kubernetes Helm Chart

# Helm repository 추가
helm repo add litellm https://berriai.github.io/litellm/
helm repo update

# 설치
helm install litellm litellm/litellm-helm \
  --namespace litellm \
  --create-namespace \
  --values values.yaml

# values.yaml
replicaCount: 3

image:
  repository: ghcr.io/berriai/litellm
  tag: main-latest

service:
  type: ClusterIP
  port: 4000

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: litellm.internal.company.com
      paths:
        - path: /
          pathType: Prefix

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: openai-api-key
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: anthropic-api-key
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: database-url

configMap:
  config.yaml: |
    model_list:
      - model_name: gpt-4o
        litellm_params:
          model: openai/gpt-4o
          api_key: os.environ/OPENAI_API_KEY
      - model_name: claude-sonnet
        litellm_params:
          model: anthropic/claude-sonnet-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY
    router_settings:
      routing_strategy: latency-based-routing
      num_retries: 3

postgresql:
  enabled: true
  auth:
    database: litellm
    username: litellm

redis:
  enabled: true

5.3 Health Check와 Metrics

# Health Check
curl http://localhost:4000/health

# Prometheus Metrics
curl http://localhost:4000/metrics

# Prometheus scrape 설정
scrape_configs:
  - job_name: 'litellm'
    static_configs:
      - targets: ['litellm:4000']
    metrics_path: /metrics
    scrape_interval: 15s

주요 Prometheus 메트릭:

litellm_requests_total: 전체 요청 수
litellm_request_duration_seconds: 요청 처리 시간
litellm_tokens_total: 전체 토큰 사용량
litellm_spend_total: 전체 비용
litellm_errors_total: 에러 수
litellm_cache_hits_total: 캐시 히트 수

5.4 Logging 연동

# config.yaml - 외부 로깅 서비스 연동
litellm_settings:
  success_callback: ['langfuse']
  failure_callback: ['langfuse']

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: https://cloud.langfuse.com

지원하는 로깅 서비스:

서비스	용도
Langfuse	LLM 관찰성, 프롬프트 관리
Helicone	요청 로깅, 비용 분석
Lunary	LLM 모니터링
Custom Callback	자체 로깅 시스템 연동

# Custom Callback 예시
import litellm

def my_custom_callback(kwargs, completion_response, start_time, end_time):
    # 모든 요청에 대해 호출됨
    model = kwargs.get("model")
    messages = kwargs.get("messages")
    cost = completion_cost(completion_response=completion_response)

    # 커스텀 로직 (DB 저장, 알림 등)
    log_to_database(
        model=model,
        cost=cost,
        latency=(end_time - start_time).total_seconds(),
        tokens=completion_response.usage.total_tokens,
    )

litellm.success_callback = [my_custom_callback]

6. 실전 사용 사례

6.1 기업 AI Gateway

사내 모든 LLM 호출을 LiteLLM Proxy로 중앙화합니다.

+------------------+
| Frontend App     |----+
+------------------+    |
                        |     +----------------+
+------------------+    +---->|                |     +----------+
| Backend Service  |----+     | LiteLLM Proxy  |---->| OpenAI   |
+------------------+    |     |                |     +----------+
                        |     | - Auth         |
+------------------+    |     | - Rate Limit   |     +----------+
| Data Pipeline    |----+     | - Cost Track   |---->| Anthropic|
+------------------+    |     | - Audit Log    |     +----------+
                        |     |                |
+------------------+    |     +--------+-------+     +----------+
| Internal Tools   |----+              |             | Azure    |
+------------------+                   v             +----------+
                              +--------+-------+
                              | PostgreSQL     |
                              | (spend logs)   |
                              +----------------+

6.2 A/B 테스트

from openai import OpenAI
import random

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def get_completion_with_ab_test(prompt: str, test_name: str):
    # 50/50 A/B 테스트
    model = random.choice(["gpt-4o", "claude-sonnet"])

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        extra_body={
            "metadata": {
                "test_name": test_name,
                "variant": model,
            }
        },
    )

    return {
        "model": model,
        "content": response.choices[0].message.content,
        "tokens": response.usage.total_tokens,
    }

6.3 비용 최적화 라우팅

def smart_route(prompt: str, complexity: str = "auto"):
    """복잡도에 따라 적절한 모델 선택"""

    if complexity == "auto":
        # 간단한 휴리스틱: 토큰 수와 키워드 기반
        word_count = len(prompt.split())
        if word_count < 50:
            complexity = "simple"
        elif any(kw in prompt.lower() for kw in
                 ["analyze", "compare", "complex", "detailed"]):
            complexity = "complex"
        else:
            complexity = "medium"

    model_map = {
        "simple": "gpt-4o-mini",      # 저렴한 모델
        "medium": "claude-sonnet",      # 중간 성능/가격
        "complex": "gpt-4o",           # 고성능 모델
    }

    model = model_map[complexity]

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    return response

6.4 Disaster Recovery (자동 Failover)

# config.yaml - 다중 Provider Failover
model_list:
  # Primary: OpenAI
  - model_name: main-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  # Secondary: Azure OpenAI (다른 리전)
  - model_name: main-model-fallback-1
    litellm_params:
      model: azure/gpt-4o
      api_base: https://eastus.openai.azure.com
      api_key: os.environ/AZURE_KEY

  # Tertiary: Anthropic Claude
  - model_name: main-model-fallback-2
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]
  num_retries: 2
  timeout: 30
  allowed_fails: 3
  cooldown_time: 60 # 실패한 모델 60초 쿨다운

7. 비교: LiteLLM vs 대안

7.1 주요 도구 비교

기능	LiteLLM	LangChain	OpenRouter	Portkey
타입	Gateway + SDK	Framework	Hosted API	Hosted Gateway
호스팅	Self-hosted	N/A (라이브러리)	Cloud	Cloud + Self
모델 수	100+	다양	200+	250+
비용 추적	내장	별도 구현 필요	있음	있음
Rate Limiting	내장	없음	있음	있음
Load Balancing	내장	없음	있음	있음
Fallback	내장	수동 구현	있음	있음
API Key 관리	Virtual Keys	없음	없음	있음
가격	무료 (OSS)	무료 (OSS)	마크업	무료 + Enterprise
데이터 프라이버시	완전 제어	완전 제어	제3자 경유	제3자 경유

7.2 언제 어떤 도구를 선택할까

LiteLLM을 선택해야 하는 경우:
  - 데이터 프라이버시가 중요 (금융, 의료, 정부)
  - 자체 인프라에서 운영해야 하는 경우
  - 비용 추적과 Rate Limiting이 필요한 경우
  - 여러 Provider를 이미 사용 중인 경우

LangChain을 선택해야 하는 경우:
  - RAG, Agent 등 복잡한 LLM 파이프라인 구축
  - 프롬프트 체이닝, 메모리 관리 등이 필요한 경우
  - (LiteLLM과 LangChain은 함께 사용 가능)

OpenRouter를 선택해야 하는 경우:
  - 빠른 프로토타이핑
  - 인프라 관리를 원하지 않는 경우
  - 단일 API 키로 모든 모델 접근

Portkey를 선택해야 하는 경우:
  - 엔터프라이즈 수준의 관리 UI 필요
  - Guardrails, A/B 테스트 등 고급 기능 필요
  - 매니지드 서비스 선호

8. 실전 팁

8.1 환경 변수 관리

# .env 파일 (절대 Git에 커밋하지 마세요)
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
AZURE_API_KEY=xxx
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm
LITELLM_MASTER_KEY=sk-master-key-change-me

8.2 모델 별칭(Alias) 설정

# config.yaml
model_list:
  - model_name: fast
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: smart
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: creative
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

# 사용 시 의미 있는 이름으로 호출
response = client.chat.completions.create(
    model="fast",  # gpt-4o-mini
    messages=[{"role": "user", "content": "Quick question"}],
)

response = client.chat.completions.create(
    model="smart",  # gpt-4o
    messages=[{"role": "user", "content": "Complex analysis"}],
)

8.3 에러 처리 패턴

from openai import OpenAI, APIError, RateLimitError, APITimeoutError

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def safe_completion(messages, model="gpt-4o", max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=30,
            )
            return response
        except RateLimitError:
            # LiteLLM Proxy가 Rate Limit 관리하지만 클라이언트도 처리
            import time
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
        except APITimeoutError:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise
        except APIError as e:
            print(f"API error: {e}")
            raise

    return None

9. 마무리

LiteLLM은 멀티 LLM 환경에서 필수적인 AI Gateway 역할을 합니다.

핵심 정리:

통합 SDK: 100+ LLM을 하나의 completion() 함수로 호출
Proxy Server: OpenAI 호환 API Gateway로 중앙 관리
비용 제어: 자동 비용 추적, Budget 관리, 알림
안정성: Load Balancing, Fallback, Rate Limiting 내장
프로덕션: Docker/Kubernetes 배포, Prometheus 모니터링, 외부 로깅 연동

특히 여러 LLM Provider를 사용하는 기업 환경에서, LiteLLM Proxy를 도입하면 API 키 관리, 비용 추적, 장애 대응을 중앙에서 일관되게 처리할 수 있습니다.

참고 자료

LiteLLM 공식 문서: https://docs.litellm.ai/
LiteLLM GitHub: https://github.com/BerriAI/litellm
LiteLLM Proxy 설정 가이드: https://docs.litellm.ai/docs/proxy/configs
LiteLLM Docker 배포: https://docs.litellm.ai/docs/proxy/deploy

[Architecture] Complete Guide to LiteLLM: Unified Serving of 100+ LLMs

Overview

As services leveraging LLMs (Large Language Models) proliferate, efficiently managing models from multiple providers has become critical. When each provider -- OpenAI, Anthropic, Azure OpenAI, AWS Bedrock -- uses different SDKs and API formats, code complexity increases rapidly.

LiteLLM is an open-source tool that solves this problem by providing an OpenAI SDK-compatible interface to unify 100+ LLMs into a single API. This post covers everything from LiteLLM SDK usage to Proxy Server setup, cost tracking, and production deployment.

1. What is LiteLLM

1.1 Core Value

LiteLLM is an open-source project by BerriAI that provides two core components.

1. Python SDK: Call 100+ LLMs through a unified interface

litellm.completion()
  |
  +-- model="gpt-4o"           --> OpenAI API
  +-- model="claude-sonnet-4-20250514"  --> Anthropic API
  +-- model="azure/gpt-4o"     --> Azure OpenAI
  +-- model="bedrock/claude-3" --> AWS Bedrock
  +-- model="vertex_ai/gemini" --> Google Vertex AI
  +-- model="ollama/llama3"    --> Local Ollama

2. Proxy Server (AI Gateway): OpenAI-compatible REST API server

Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers
                      (Rate Limiting, Cost Tracking, Load Balancing)

1.2 Why LiteLLM

Problem	LiteLLM Solution
Different SDK per provider	Unified completion() function
Scattered API key management	Centralized management via Proxy
Difficulty tracking costs	Automatic cost calculation and tracking
Provider outages	Automatic fallback support
Rate limit management	Built-in rate limiting
Model switching costs	Switch models without code changes

1.3 Supported Providers

Commercial Providers:
  - OpenAI (GPT-4o, GPT-4o-mini, o1, o3, etc.)
  - Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)
  - Azure OpenAI
  - AWS Bedrock (Claude, Titan, Llama)
  - Google Vertex AI (Gemini)
  - Google AI Studio
  - Cohere (Command R+)
  - Mistral AI
  - Together AI
  - Groq
  - Fireworks AI
  - Perplexity
  - DeepSeek

Self-Hosted / Local:
  - Ollama
  - vLLM
  - Hugging Face TGI
  - NVIDIA NIM
  - OpenAI-compatible endpoints

2. LiteLLM SDK Usage

2.1 Installation

pip install litellm

2.2 Basic Usage: completion()

import litellm

# OpenAI
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Kubernetes?"},
    ],
    temperature=0.7,
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Anthropic Claude
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "user", "content": "Explain microservices architecture."},
    ],
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Azure OpenAI
response = litellm.completion(
    model="azure/gpt-4o-deployment",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
    api_base="https://my-resource.openai.azure.com",
    api_version="2024-02-15-preview",
    api_key="your-azure-key",
)

# AWS Bedrock
response = litellm.completion(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[
        {"role": "user", "content": "Summarize this text."},
    ],
)

# Ollama (local)
response = litellm.completion(
    model="ollama/llama3",
    messages=[
        {"role": "user", "content": "Write a Python function."},
    ],
    api_base="http://localhost:11434",
)

2.3 Streaming

import litellm

# Synchronous streaming
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a long essay."}],
    stream=True,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

2.4 Async Calls

import asyncio
import litellm

async def main():
    # Single async call
    response = await litellm.acompletion(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.choices[0].message.content)

    # Async streaming
    response = await litellm.acompletion(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": "Explain Docker."}],
        stream=True,
    )
    async for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)

    # Parallel calls
    tasks = [
        litellm.acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Question {i}"}],
        )
        for i in range(5)
    ]
    responses = await asyncio.gather(*tasks)
    for resp in responses:
        print(resp.choices[0].message.content[:50])

asyncio.run(main())

2.5 Function Calling (Tool Use)

import litellm
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["location"],
            },
        },
    }
]

# Same interface for both OpenAI and Claude
for model in ["gpt-4o", "claude-sonnet-4-20250514"]:
    response = litellm.completion(
        model=model,
        messages=[
            {"role": "user", "content": "What's the weather in Seoul?"}
        ],
        tools=tools,
        tool_choice="auto",
    )

    if response.choices[0].message.tool_calls:
        tool_call = response.choices[0].message.tool_calls[0]
        print(f"Model: {model}")
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

2.6 Embedding

import litellm

# OpenAI Embedding
response = litellm.embedding(
    model="text-embedding-3-small",
    input=["Hello world", "How are you?"],
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

# Cohere Embedding
response = litellm.embedding(
    model="cohere/embed-english-v3.0",
    input=["Search query text"],
    input_type="search_query",
)

# Bedrock Embedding
response = litellm.embedding(
    model="bedrock/amazon.titan-embed-text-v2:0",
    input=["Document text for embedding"],
)

2.7 Image/Vision Models

import litellm

# GPT-4o Vision
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.png",
                    },
                },
            ],
        }
    ],
)

# Claude Vision
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this architecture diagram."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/png;base64,iVBORw0KGgo...",
                    },
                },
            ],
        }
    ],
)

3. LiteLLM Proxy Server (AI Gateway)

3.1 What is the Proxy

LiteLLM Proxy is a self-hostable OpenAI-compatible API Gateway. Any existing client using the OpenAI SDK can connect to the Proxy without code changes.

+-------------------+
| Your Application  |
| (OpenAI SDK)      |
+--------+----------+
         |
         v
+--------+----------+
| LiteLLM Proxy     |
| - Rate Limiting   |
| - Cost Tracking   |
| - Load Balancing  |
| - Fallback        |
| - Key Management  |
+--------+----------+
         |
    +----+----+----+----+
    |    |    |    |    |
    v    v    v    v    v
  OpenAI Azure Anthropic Bedrock Ollama

3.2 Installation and Running

# pip install
pip install 'litellm[proxy]'

# Basic run
litellm --model gpt-4o --port 4000

# Run with config file
litellm --config config.yaml --port 4000

# Docker run
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v ./config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-xxx \
  -e ANTHROPIC_API_KEY=sk-ant-xxx \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

3.3 config.yaml Configuration

# config.yaml
model_list:
  # OpenAI models
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # Rate limit: requests per minute
      tpm: 100000 # Rate limit: tokens per minute

  # Claude models (load balancing across deployments)
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 200
      tpm: 80000

  # Azure OpenAI
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-deployment
      api_base: https://my-resource.openai.azure.com
      api_version: '2024-02-15-preview'
      api_key: os.environ/AZURE_API_KEY
      rpm: 300

  # AWS Bedrock Claude
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  # Local Ollama
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama:11434

# Router settings
router_settings:
  routing_strategy: 'latency-based-routing'
  num_retries: 3
  timeout: 60
  allowed_fails: 2
  cooldown_time: 30

# General settings
general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

litellm_settings:
  drop_params: true
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379

3.4 Model Routing and Load Balancing

# Registering multiple deployments with the same model_name enables auto load balancing
model_list:
  # gpt-4o group: 3 deployments
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY_1
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-east
      api_base: https://east.openai.azure.com
      api_key: os.environ/AZURE_KEY_EAST
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-west
      api_base: https://west.openai.azure.com
      api_key: os.environ/AZURE_KEY_WEST

router_settings:
  # Routing strategy
  routing_strategy: 'latency-based-routing'
  # Options:
  #   simple-shuffle: Random selection
  #   least-busy: Fewest in-progress requests
  #   usage-based-routing: Based on TPM/RPM usage
  #   latency-based-routing: Based on response time (recommended)

Routing Strategy Comparison:

Strategy	Description	Best For
simple-shuffle	Random distribution	All deployments have similar performance
least-busy	Based on in-progress request count	Varying request processing times
usage-based-routing	Based on RPM/TPM usage	Approaching rate limits
latency-based-routing	Based on response time	Latency optimization is critical

3.5 Fallback Configuration

model_list:
  - model_name: primary-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: fallback-model
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  num_retries: 2
  timeout: 30
  fallbacks: [{ 'primary-model': ['fallback-model'] }]
  # Fallback only on specific errors
  retry_policy:
    RateLimitError: 3 # Retry 3 times on 429 errors
    ContentPolicyViolationError: 0 # No retry on content policy violations
    AuthenticationError: 0 # No retry on auth errors

3.6 API Key Management (Virtual Keys)

# Generate virtual key with master key
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4o", "claude-sonnet"],
    "max_budget": 100.0,
    "budget_duration": "monthly",
    "metadata": {
      "team": "backend",
      "user": "developer-1"
    },
    "tpm_limit": 50000,
    "rpm_limit": 100
  }'

Response:

{
  "key": "sk-generated-key-abc123",
  "expires": "2026-04-20T00:00:00Z",
  "max_budget": 100.0,
  "models": ["gpt-4o", "claude-sonnet"]
}

# Make API calls with the generated key
from openai import OpenAI

client = OpenAI(
    api_key="sk-generated-key-abc123",
    base_url="http://localhost:4000",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

3.7 Rate Limiting Configuration

# Rate Limiting in config.yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # Model deployment level RPM
      tpm: 100000 # Model deployment level TPM

general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL

# Per-key rate limiting
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "rpm_limit": 50,
    "tpm_limit": 20000,
    "max_budget": 10.0,
    "budget_duration": "daily"
  }'

# Per-team rate limiting
curl -X POST http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "backend-team",
    "rpm_limit": 200,
    "tpm_limit": 80000,
    "max_budget": 500.0,
    "budget_duration": "monthly"
  }'

3.8 Budget Management

# Set per-user budget
curl -X POST http://localhost:4000/user/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user-123",
    "max_budget": 50.0,
    "budget_duration": "monthly",
    "models": ["gpt-4o-mini", "claude-sonnet"]
  }'

# Check budget usage
curl http://localhost:4000/user/info?user_id=user-123 \
  -H "Authorization: Bearer sk-master-key-1234"

3.9 Caching

# config.yaml
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 3600 # 1 hour cache

# Cache control from client
from openai import OpenAI

client = OpenAI(
    api_key="sk-key",
    base_url="http://localhost:4000",
)

# Use cache (default)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
)

# Skip cache
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    extra_body={"cache": {"no-cache": True}},
)

4. Cost Tracking

4.1 Automatic Cost Calculation

LiteLLM automatically calculates the cost of each request.

import litellm

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Cost information
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

# LiteLLM cost calculation
from litellm import completion_cost

cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

4.2 Cost Queries via Proxy

# Total spend
curl http://localhost:4000/global/spend \
  -H "Authorization: Bearer sk-master-key-1234"

# Per-key spend
curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \
  -H "Authorization: Bearer sk-master-key-1234"

# Per-model spend
curl "http://localhost:4000/global/spend?model=gpt-4o" \
  -H "Authorization: Bearer sk-master-key-1234"

# Per-team spend
curl "http://localhost:4000/team/info?team_id=team-backend" \
  -H "Authorization: Bearer sk-master-key-1234"

# Spend by date range
curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \
  -H "Authorization: Bearer sk-master-key-1234"

4.3 Budget Alert Configuration

# config.yaml
general_settings:
  alerting:
    - slack
  alerting_threshold: 300 # Alert if no response within 300 seconds
  alert_types:
    - budget_alerts # On budget exceeded
    - spend_reports # Weekly/monthly cost reports
    - failed_tracking # Failed request tracking

environment_variables:
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

5. Production Deployment

5.1 Docker Compose

# docker-compose.yml
version: '3.8'

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - '4000:4000'
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=sk-xxx
      - ANTHROPIC_API_KEY=sk-ant-xxx
      - AZURE_API_KEY=xxx
      - AWS_ACCESS_KEY_ID=xxx
      - AWS_SECRET_ACCESS_KEY=xxx
      - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
      - REDIS_HOST=redis
      - REDIS_PORT=6379
    command: --config /app/config.yaml --port 4000
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']
      interval: 30s
      timeout: 10s
      retries: 3

  postgres:
    image: postgres:16-alpine
    container_name: litellm-postgres
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U litellm']
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    container_name: litellm-redis
    ports:
      - '6379:6379'
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

5.2 Kubernetes Helm Chart

# Add Helm repository
helm repo add litellm https://berriai.github.io/litellm/
helm repo update

# Install
helm install litellm litellm/litellm-helm \
  --namespace litellm \
  --create-namespace \
  --values values.yaml

# values.yaml
replicaCount: 3

image:
  repository: ghcr.io/berriai/litellm
  tag: main-latest

service:
  type: ClusterIP
  port: 4000

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: litellm.internal.company.com
      paths:
        - path: /
          pathType: Prefix

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: openai-api-key
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: anthropic-api-key
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: database-url

postgresql:
  enabled: true
  auth:
    database: litellm
    username: litellm

redis:
  enabled: true

5.3 Health Check and Metrics

# Health Check
curl http://localhost:4000/health

# Prometheus Metrics
curl http://localhost:4000/metrics

Key Prometheus Metrics:

litellm_requests_total: Total request count
litellm_request_duration_seconds: Request processing time
litellm_tokens_total: Total token usage
litellm_spend_total: Total spend
litellm_errors_total: Error count
litellm_cache_hits_total: Cache hit count

5.4 Logging Integration

# config.yaml - External logging service integration
litellm_settings:
  success_callback: ['langfuse']
  failure_callback: ['langfuse']

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: https://cloud.langfuse.com

Supported Logging Services:

Service	Purpose
Langfuse	LLM observability, prompt management
Helicone	Request logging, cost analysis
Lunary	LLM monitoring
Custom Callback	Integration with custom logging systems

# Custom Callback example
import litellm

def my_custom_callback(kwargs, completion_response, start_time, end_time):
    # Called for every request
    model = kwargs.get("model")
    messages = kwargs.get("messages")
    cost = completion_cost(completion_response=completion_response)

    # Custom logic (DB storage, alerts, etc.)
    log_to_database(
        model=model,
        cost=cost,
        latency=(end_time - start_time).total_seconds(),
        tokens=completion_response.usage.total_tokens,
    )

litellm.success_callback = [my_custom_callback]

6. Real-World Use Cases

6.1 Enterprise AI Gateway

Centralize all LLM calls across the organization through LiteLLM Proxy.

+------------------+
| Frontend App     |----+
+------------------+    |
                        |     +----------------+
+------------------+    +---->|                |     +----------+
| Backend Service  |----+     | LiteLLM Proxy  |---->| OpenAI   |
+------------------+    |     |                |     +----------+
                        |     | - Auth         |
+------------------+    |     | - Rate Limit   |     +----------+
| Data Pipeline    |----+     | - Cost Track   |---->| Anthropic|
+------------------+    |     | - Audit Log    |     +----------+
                        |     |                |
+------------------+    |     +--------+-------+     +----------+
| Internal Tools   |----+              |             | Azure    |
+------------------+                   v             +----------+
                              +--------+-------+
                              | PostgreSQL     |
                              | (spend logs)   |
                              +----------------+

6.2 A/B Testing

from openai import OpenAI
import random

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def get_completion_with_ab_test(prompt: str, test_name: str):
    # 50/50 A/B test
    model = random.choice(["gpt-4o", "claude-sonnet"])

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        extra_body={
            "metadata": {
                "test_name": test_name,
                "variant": model,
            }
        },
    )

    return {
        "model": model,
        "content": response.choices[0].message.content,
        "tokens": response.usage.total_tokens,
    }

6.3 Cost-Optimized Routing

def smart_route(prompt: str, complexity: str = "auto"):
    """Select appropriate model based on complexity"""

    if complexity == "auto":
        # Simple heuristic: based on token count and keywords
        word_count = len(prompt.split())
        if word_count < 50:
            complexity = "simple"
        elif any(kw in prompt.lower() for kw in
                 ["analyze", "compare", "complex", "detailed"]):
            complexity = "complex"
        else:
            complexity = "medium"

    model_map = {
        "simple": "gpt-4o-mini",      # Cheap model
        "medium": "claude-sonnet",      # Mid performance/price
        "complex": "gpt-4o",           # High performance model
    }

    model = model_map[complexity]

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    return response

6.4 Disaster Recovery (Automatic Failover)

# config.yaml - Multi-provider failover
model_list:
  # Primary: OpenAI
  - model_name: main-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  # Secondary: Azure OpenAI (different region)
  - model_name: main-model-fallback-1
    litellm_params:
      model: azure/gpt-4o
      api_base: https://eastus.openai.azure.com
      api_key: os.environ/AZURE_KEY

  # Tertiary: Anthropic Claude
  - model_name: main-model-fallback-2
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]
  num_retries: 2
  timeout: 30
  allowed_fails: 3
  cooldown_time: 60 # 60 second cooldown for failed models

7. Comparison: LiteLLM vs Alternatives

7.1 Tool Comparison

Feature	LiteLLM	LangChain	OpenRouter	Portkey
Type	Gateway + SDK	Framework	Hosted API	Hosted Gateway
Hosting	Self-hosted	N/A (library)	Cloud	Cloud + Self
Model Count	100+	Various	200+	250+
Cost Tracking	Built-in	Requires custom impl	Yes	Yes
Rate Limiting	Built-in	None	Yes	Yes
Load Balancing	Built-in	None	Yes	Yes
Fallback	Built-in	Manual implementation	Yes	Yes
API Key Mgmt	Virtual Keys	None	None	Yes
Pricing	Free (OSS)	Free (OSS)	Markup	Free + Enterprise
Data Privacy	Full control	Full control	Third-party	Third-party

7.2 When to Choose Which Tool

Choose LiteLLM when:
  - Data privacy is important (finance, healthcare, government)
  - Must operate on own infrastructure
  - Cost tracking and rate limiting are needed
  - Already using multiple providers

Choose LangChain when:
  - Building complex LLM pipelines (RAG, Agents)
  - Need prompt chaining, memory management
  - (LiteLLM and LangChain can be used together)

Choose OpenRouter when:
  - Rapid prototyping
  - Don't want to manage infrastructure
  - Single API key for all models

Choose Portkey when:
  - Enterprise-level management UI needed
  - Advanced features like guardrails, A/B testing needed
  - Prefer managed services

8. Practical Tips

8.1 Environment Variable Management

# .env file (never commit to Git)
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
AZURE_API_KEY=xxx
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm
LITELLM_MASTER_KEY=sk-master-key-change-me

8.2 Model Alias Configuration

# config.yaml
model_list:
  - model_name: fast
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: smart
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: creative
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

# Call with meaningful names
response = client.chat.completions.create(
    model="fast",  # gpt-4o-mini
    messages=[{"role": "user", "content": "Quick question"}],
)

response = client.chat.completions.create(
    model="smart",  # gpt-4o
    messages=[{"role": "user", "content": "Complex analysis"}],
)

8.3 Error Handling Patterns

from openai import OpenAI, APIError, RateLimitError, APITimeoutError

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def safe_completion(messages, model="gpt-4o", max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=30,
            )
            return response
        except RateLimitError:
            # LiteLLM Proxy handles rate limits, but client should too
            import time
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
        except APITimeoutError:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise
        except APIError as e:
            print(f"API error: {e}")
            raise

    return None

9. Conclusion

LiteLLM serves as an essential AI Gateway in multi-LLM environments.

Key Takeaways:

Unified SDK: Call 100+ LLMs through a single completion() function
Proxy Server: Centralized management via OpenAI-compatible API Gateway
Cost Control: Automatic cost tracking, budget management, alerts
Reliability: Built-in load balancing, fallback, rate limiting
Production: Docker/Kubernetes deployment, Prometheus monitoring, external logging

Especially in enterprise environments using multiple LLM providers, deploying LiteLLM Proxy enables consistent centralized handling of API key management, cost tracking, and failure recovery.

References

LiteLLM Official Docs: https://docs.litellm.ai/
LiteLLM GitHub: https://github.com/BerriAI/litellm
LiteLLM Proxy Config Guide: https://docs.litellm.ai/docs/proxy/configs
LiteLLM Docker Deployment: https://docs.litellm.ai/docs/proxy/deploy