Skip to content

Split View: LiteLLM 완전 가이드 2025: 100+ LLM을 하나의 API로 통합하는 프록시 서버

✨ Learn with Quiz
|

LiteLLM 완전 가이드 2025: 100+ LLM을 하나의 API로 통합하는 프록시 서버

서론: 왜 LiteLLM인가?

2025년 현재, AI 시장은 그 어느 때보다 빠르게 변화하고 있습니다. OpenAI의 GPT-4o, Anthropic의 Claude Opus 4, Google의 Gemini 2.0, Meta의 Llama 3.1, Mistral의 Large 등 새로운 모델이 매월 출시되고 있으며, 각 제공자마다 서로 다른 API 형식, 인증 방식, 가격 정책을 가지고 있습니다.

이런 상황에서 기업들이 직면하는 3가지 핵심 문제가 있습니다.

1. 벤더 락인(Vendor Lock-in) 문제

한 LLM 제공자의 API에 코드를 맞추면, 더 좋은 모델이 나와도 전환 비용이 큽니다.

# 문제: 각 제공자마다 다른 API 형식
# OpenAI
import openai
response = openai.ChatCompletion.create(model="gpt-4", messages=[...])

# Anthropic
import anthropic
response = anthropic.messages.create(model="claude-3-opus", messages=[...])

# Google
import google.generativeai as genai
response = genai.GenerativeModel("gemini-pro").generate_content(...)

2. 비용 관리의 어려움

여러 모델을 사용하면 각 제공자별 비용을 추적하고 예산을 관리하기가 복잡해집니다.

3. 가용성과 안정성

단일 제공자에 의존하면 해당 서비스에 장애가 생겼을 때 전체 시스템이 멈춥니다.

LiteLLM은 이 세 가지 문제를 하나의 솔루션으로 해결합니다.

┌──────────────────────────────────────────────────┐
Your Application              (OpenAI SDK Format)└─────────────────────┬────────────────────────────┘
┌──────────────────────────────────────────────────┐
LiteLLM Proxy Server│  ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│  │  Cost    │ │  Load    │ │  Virtual Keys    │ │
│  │  Tracking│Balance │ │  + Budget Mgmt   │ │
│  └──────────┘ └──────────┘ └──────────────────┘ │
└───┬──────────────┬──────────────┬────────────────┘
    │              │              │
    ▼              ▼              ▼
┌────────┐  ┌──────────┐  ┌───────────┐
OpenAI │  │ Anthropic│GoogleGPT-4o │  │ Claude   │  │  Gemini└────────┘  └──────────┘  └───────────┘
    ▲              ▲              ▲
    │              │              │
┌────────┐  ┌──────────┐  ┌───────────┐
AWS   │  │  Azure   │  │  Ollama│Bedrock │  │ OpenAI  (Local)└────────┘  └──────────┘  └───────────┘

1. LiteLLM 아키텍처: SDK vs Proxy

LiteLLM은 두 가지 모드로 사용할 수 있습니다.

1.1 SDK 모드 (Python 패키지)

애플리케이션에 직접 임포트하여 사용합니다. 간단한 프로젝트에 적합합니다.

# pip install litellm
from litellm import completion

# OpenAI 호출
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Anthropic 호출 (같은 형식!)
response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Google Gemini 호출 (같은 형식!)
response = completion(
    model="gemini/gemini-2.0-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)

1.2 Proxy 모드 (서버)

독립 서버로 실행하여 모든 애플리케이션이 하나의 엔드포인트를 사용합니다. 팀/조직에 적합합니다.

# 프록시 서버 시작
litellm --config config.yaml --port 4000
# 어떤 언어든 OpenAI SDK 형식으로 호출
import openai

client = openai.OpenAI(
    api_key="sk-litellm-your-virtual-key",
    base_url="http://localhost:4000"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

1.3 SDK vs Proxy 비교

항목SDK 모드Proxy 모드
사용 방식Python 패키지 임포트HTTP 서버 (REST API)
언어 지원Python만모든 언어 (OpenAI SDK 호환)
팀 관리불가능가상 키로 팀별 관리
예산 관리기본적고급 (팀별/키별)
로드 밸런싱코드에서 직접설정 파일로 자동
적합한 경우개인 프로젝트, 프로토타입팀/조직, 프로덕션

2. SDK 사용법: 다양한 LLM 통합 호출

2.1 Chat Completion

LiteLLM SDK의 핵심 기능입니다. 100+ 모델을 동일한 인터페이스로 호출합니다.

from litellm import completion
import os

# 환경 변수로 API 키 설정
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
os.environ["GEMINI_API_KEY"] = "AI..."

# 동일한 함수, 다른 모델
models = [
    "gpt-4o",
    "anthropic/claude-sonnet-4-20250514",
    "gemini/gemini-2.0-flash",
    "groq/llama-3.1-70b-versatile",
    "mistral/mistral-large-latest",
]

for model in models:
    response = completion(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum computing in 2 sentences."}
        ],
        temperature=0.7,
        max_tokens=200,
    )
    print(f"[{model}]: {response.choices[0].message.content}")

2.2 스트리밍 응답

from litellm import completion

response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Write a short poem about coding."}],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

2.3 Embedding 호출

from litellm import embedding

# OpenAI Embedding
response = embedding(
    model="text-embedding-3-small",
    input=["Hello world", "How are you?"]
)

# Cohere Embedding (같은 형식!)
response = embedding(
    model="cohere/embed-english-v3.0",
    input=["Hello world", "How are you?"]
)

print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

2.4 이미지 생성

from litellm import image_generation

response = image_generation(
    model="dall-e-3",
    prompt="A futuristic city powered by AI, digital art style",
    n=1,
    size="1024x1024",
)

print(f"Image URL: {response.data[0]['url']}")

2.5 비동기 호출

import asyncio
from litellm import acompletion

async def generate_responses():
    tasks = [
        acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Tell me fact #{i} about space."}]
        )
        for i in range(5)
    ]
    responses = await asyncio.gather(*tasks)
    for i, resp in enumerate(responses):
        print(f"Fact #{i}: {resp.choices[0].message.content[:100]}")

asyncio.run(generate_responses())

3. Proxy 서버 설정

3.1 기본 설정 파일 (config.yaml)

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gemini-flash
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY

  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.1:70b
      api_base: http://localhost:11434

litellm_settings:
  drop_params: true
  set_verbose: false
  num_retries: 3
  request_timeout: 120

general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL

3.2 Docker로 시작하기

# Docker Compose 파일
cat > docker-compose.yaml << 'EOF'
version: "3.9"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    environment:
      - OPENAI_API_KEY
      - ANTHROPIC_API_KEY
      - GEMINI_API_KEY
      - DATABASE_URL=postgresql://user:pass@db:5432/litellm
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
      POSTGRES_DB: litellm
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:
EOF

# 시작
docker compose up -d

3.3 프록시 서버 테스트

# 헬스 체크
curl http://localhost:4000/health

# Chat Completion 호출
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# 모델 목록 조회
curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer sk-master-key-1234"

4. 지원 제공자 및 모델

LiteLLM은 100+ LLM 제공자를 지원합니다. 주요 제공자를 정리합니다.

4.1 주요 제공자 목록

┌────────────────────┬──────────────────────────────────────┐
│ 제공자              │ 지원 모델 (예시)├────────────────────┼──────────────────────────────────────┤
OpenAI             │ gpt-4o, gpt-4o-mini, o1, o3-mini    │
Anthropic          │ claude-opus-4, claude-sonnet, haiku  │
Google (Gemini)    │ gemini-2.0-flash, gemini-1.5-pro    │
AWS BedrockClaude, Titan, Llama via BedrockAzure OpenAI       │ gpt-4o (Azure 호스팅)Mistral AI         │ mistral-large, mistral-small         │
Groq               │ llama-3.1-70b, mixtral-8x7b         │
Together AI        │ llama-3.1, CodeLlama, QwenOllama (로컬)      │ llama3.1, codellama, mistral 등     │
vLLM (자체 호스팅) │ 모든 HuggingFace 모델               │
Cohere             │ command-r-plus, embed 모델           │
Deepseek           │ deepseek-chat, deepseek-coder        │
Fireworks AI       │ llama, mixtral 등                    │
Perplexity         │ pplx-70b-online 등                   │
└────────────────────┴──────────────────────────────────────┘

4.2 AWS Bedrock 설정

model_list:
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  - model_name: bedrock-llama
    litellm_params:
      model: bedrock/meta.llama3-1-70b-instruct-v1:0
      aws_region_name: us-west-2

4.3 Azure OpenAI 설정

model_list:
  - model_name: azure-gpt4
    litellm_params:
      model: azure/gpt-4o-deployment-name
      api_base: https://your-resource.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-06-01"

4.4 Ollama (로컬 LLM) 설정

model_list:
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.1:70b
      api_base: http://localhost:11434
      stream: true

  - model_name: local-codellama
    litellm_params:
      model: ollama/codellama:34b
      api_base: http://localhost:11434

5. 모델 라우팅 전략

5.1 비용 기반 라우팅 (Lowest Cost)

가장 저렴한 모델을 자동 선택합니다.

router_settings:
  routing_strategy: cost-based-routing

model_list:
  - model_name: general-chat
    litellm_params:
      model: gpt-4o-mini
      # Input: $0.15/1M tokens, Output: $0.60/1M tokens

  - model_name: general-chat
    litellm_params:
      model: anthropic/claude-3-haiku-20240307
      # Input: $0.25/1M tokens, Output: $1.25/1M tokens

  - model_name: general-chat
    litellm_params:
      model: gemini/gemini-2.0-flash
      # Input: $0.075/1M tokens, Output: $0.30/1M tokens

위 설정에서 general-chat을 호출하면 Gemini Flash가 가장 저렴하므로 자동 선택됩니다.

5.2 레이턴시 기반 라우팅 (Lowest Latency)

가장 빠른 응답 시간의 모델을 선택합니다.

router_settings:
  routing_strategy: latency-based-routing
  routing_strategy_args:
    ttl: 60  # 60초마다 레이턴시 재측정

5.3 사용량 기반 라우팅 (Usage-Based)

RPM(분당 요청 수) 기준으로 여유 있는 모델에 라우팅합니다.

router_settings:
  routing_strategy: usage-based-routing

model_list:
  - model_name: fast-model
    litellm_params:
      model: gpt-4o-mini
      rpm: 500  # 분당 최대 500 요청

  - model_name: fast-model
    litellm_params:
      model: gemini/gemini-2.0-flash
      rpm: 1000  # 분당 최대 1000 요청

5.4 커스텀 태그 기반 라우팅

# 요청 시 metadata로 라우팅 제어
response = client.chat.completions.create(
    model="general-chat",
    messages=[{"role": "user", "content": "Complex analysis..."}],
    extra_body={
        "metadata": {
            "tags": ["high-quality", "long-context"]
        }
    }
)

6. 로드 밸런싱과 폴백

6.1 로드 밸런싱 설정

같은 model_name에 여러 배포를 등록하면 자동으로 로드 밸런싱됩니다.

model_list:
  # 같은 모델의 여러 API 키로 분산
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-1
      rpm: 100

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-2
      rpm: 100

  # 다른 제공자 간 분산
  - model_name: fast-model
    litellm_params:
      model: openai/gpt-4o-mini
      rpm: 500

  - model_name: fast-model
    litellm_params:
      model: anthropic/claude-3-haiku-20240307
      rpm: 300

router_settings:
  routing_strategy: simple-shuffle  # round-robin 방식
  allowed_fails: 3  # 3번 실패 시 해당 배포 비활성화
  cooldown_time: 60  # 60초 후 재활성화

6.2 폴백 체인 설정

주 모델이 실패하면 대체 모델로 자동 전환합니다.

litellm_settings:
  fallbacks:
    - model_name: gpt-4o
      fallback_models:
        - anthropic/claude-sonnet-4-20250514
        - gemini/gemini-1.5-pro

    - model_name: claude-sonnet
      fallback_models:
        - gpt-4o
        - gemini/gemini-1.5-pro

  # 콘텐츠 정책 위반 시 다른 모델로
  content_policy_fallbacks:
    - model_name: gpt-4o
      fallback_models:
        - anthropic/claude-sonnet-4-20250514

  # 컨텍스트 윈도우 초과 시 더 큰 모델로
  context_window_fallbacks:
    - model_name: gpt-4o-mini
      fallback_models:
        - gpt-4o
        - anthropic/claude-sonnet-4-20250514

6.3 폴백 동작 흐름

사용자 요청: "gpt-4o"Chat Completion
[1] gpt-4o 호출 시도
    ├── 성공 → 응답 반환
    └── 실패 (429 Rate Limit / 500 Error)
[2] claude-sonnet-4 호출 시도 (1차 폴백)
    ├── 성공 → 응답 반환
    └── 실패
[3] gemini-1.5-pro 호출 시도 (2차 폴백)
    ├── 성공 → 응답 반환
    └── 실패 → 에러 반환

7. 가상 키와 팀 관리

7.1 가상 키 생성

# 마스터 키로 새 가상 키 생성
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "frontend-team-key",
    "duration": "30d",
    "max_budget": 100.0,
    "models": ["gpt-4o", "claude-sonnet"],
    "max_parallel_requests": 10,
    "tpm_limit": 100000,
    "rpm_limit": 100,
    "metadata": {
      "team": "frontend",
      "environment": "production"
    }
  }'

7.2 팀 생성 및 관리

# 팀 생성
curl -X POST http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "ml-engineering",
    "max_budget": 500.0,
    "models": ["gpt-4o", "claude-sonnet", "gemini-flash"],
    "members_with_roles": [
      {"role": "admin", "user_id": "user-alice@company.com"},
      {"role": "user", "user_id": "user-bob@company.com"}
    ],
    "metadata": {
      "department": "engineering",
      "cost_center": "ENG-001"
    }
  }'

# 팀에 키 할당
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team-ml-engineering-id",
    "key_alias": "ml-team-prod-key",
    "max_budget": 200.0
  }'

7.3 키별 모델 접근 제어

# config.yaml에서 모델 접근 그룹 설정
model_list:
  - model_name: premium-model
    litellm_params:
      model: openai/gpt-4o
    model_info:
      access_groups: ["premium-tier"]

  - model_name: basic-model
    litellm_params:
      model: openai/gpt-4o-mini
    model_info:
      access_groups: ["basic-tier", "premium-tier"]

8. 예산 관리와 비용 추적

8.1 글로벌 예산 설정

general_settings:
  max_budget: 10000.0  # 전체 월 예산 $10,000
  budget_duration: 1m   # 월간 리셋

8.2 키별/팀별 예산

# 키별 예산 업데이트
curl -X POST http://localhost:4000/key/update \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key": "sk-virtual-key-xxx",
    "max_budget": 50.0,
    "budget_duration": "1m",
    "soft_budget": 40.0
  }'

soft_budget에 도달하면 알림을 보내고, max_budget에 도달하면 요청이 차단됩니다.

8.3 비용 추적 API

# 전체 비용 조회
curl http://localhost:4000/spend/logs \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "start_date=2025-01-01" \
  -d "end_date=2025-01-31"

# 모델별 비용 조회
curl http://localhost:4000/spend/tags \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "group_by=model"

# 팀별 비용 조회
curl http://localhost:4000/spend/tags \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "group_by=team"

8.4 예산 알림 Webhook

general_settings:
  alerting:
    - slack
  alerting_threshold: 300  # 예산의 80% 소진 시 알림
  alert_types:
    - budget_alerts
    - spend_reports
    - failed_tracking

environment_variables:
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

8.5 비용 최적화 대시보드 예시

┌─────────────────────────────────────────────────────┐
LiteLLM Cost Dashboard (January 2025)├─────────────────────────────────────────────────────┤
│                                                     │
Total Spend: $3,245.67 / $10,000.00 budget        │
│  ████████████████░░░░░░░░░░ 32.5%│                                                     │
By Model:│  ├─ gpt-4o:          $1,823.45 (56.2%)│  ├─ claude-sonnet:   $891.23  (27.5%)│  ├─ gemini-flash:    $234.56  (7.2%)│  └─ gpt-4o-mini:    $296.43  (9.1%)│                                                     │
By Team:│  ├─ ML Engineering:  $1,456.78│  ├─ Frontend:        $987.65│  └─ Data Science:    $801.24│                                                     │
Daily Trend: ▁▂▃▅▆▇█▇▆▅▃▂▁▂▃▅▆                   │
└─────────────────────────────────────────────────────┘

9. 레이트 리밋 (Rate Limiting)

9.1 키 레벨 레이트 리밋

curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "rate-limited-key",
    "rpm_limit": 60,
    "tpm_limit": 100000,
    "max_parallel_requests": 5
  }'

9.2 모델 레벨 레이트 리밋

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 200    # 이 배포의 분당 최대 요청
      tpm: 400000 # 이 배포의 분당 최대 토큰

9.3 글로벌 레이트 리밋

general_settings:
  global_max_parallel_requests: 100  # 전체 동시 요청 제한

10. Guardrails: PII 마스킹과 콘텐츠 필터링

10.1 PII(개인정보) 마스킹

litellm_settings:
  guardrails:
    - guardrail_name: pii-masking
      litellm_params:
        guardrail: presidio
        mode: during_call  # 요청 전에 PII 마스킹
        output_parse_pii: true  # 응답에서도 PII 마스킹
# PII가 포함된 요청
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "My email is john@example.com and phone is 010-1234-5678"
    }],
    extra_body={
        "metadata": {
            "guardrails": ["pii-masking"]
        }
    }
)
# LLM에는 "My email is [EMAIL] and phone is [PHONE]"로 전송됨

10.2 콘텐츠 필터링 (Lakera Guard)

litellm_settings:
  guardrails:
    - guardrail_name: content-filter
      litellm_params:
        guardrail: lakera
        mode: pre_call  # 요청 전에 필터링
        api_key: os.environ/LAKERA_API_KEY

10.3 프롬프트 인젝션 방어

litellm_settings:
  guardrails:
    - guardrail_name: prompt-injection
      litellm_params:
        guardrail: lakera
        mode: pre_call
        prompt_injection: true

10.4 커스텀 Guardrail

# custom_guardrail.py
from litellm.proxy.guardrails.guardrail_hooks import GuardrailHook

class CustomContentFilter(GuardrailHook):
    def __init__(self):
        self.blocked_words = ["harmful", "dangerous", "illegal"]

    async def pre_call_hook(self, data, call_type):
        user_message = data.get("messages", [])[-1].get("content", "")
        for word in self.blocked_words:
            if word.lower() in user_message.lower():
                raise ValueError(f"Blocked content detected: {word}")
        return data

    async def post_call_hook(self, data, response, call_type):
        # 응답 후처리 로직
        return response

11. 로깅과 모니터링

11.1 Langfuse 연동

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: https://cloud.langfuse.com

11.2 LangSmith 연동

litellm_settings:
  success_callback: ["langsmith"]

environment_variables:
  LANGCHAIN_API_KEY: os.environ/LANGCHAIN_API_KEY
  LANGCHAIN_PROJECT: my-litellm-project

11.3 커스텀 콜백

# custom_callback.py
from litellm.integrations.custom_logger import CustomLogger

class MyCustomLogger(CustomLogger):
    async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
        print(f"Model: {kwargs.get('model')}")
        print(f"Cost: ${kwargs.get('response_cost', 0):.6f}")
        print(f"Latency: {(end_time - start_time).total_seconds():.2f}s")
        print(f"Tokens: {response_obj.usage.total_tokens}")

    async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
        print(f"FAILED - Model: {kwargs.get('model')}")
        print(f"Error: {kwargs.get('exception')}")

11.4 Prometheus 메트릭

litellm_settings:
  success_callback: ["prometheus"]
  failure_callback: ["prometheus"]

주요 메트릭:

# 요청 수
litellm_requests_total
litellm_requests_failed_total

# 레이턴시
litellm_request_duration_seconds_bucket
litellm_request_duration_seconds_sum

# 비용
litellm_spend_total

# 토큰
litellm_tokens_total
litellm_input_tokens_total
litellm_output_tokens_total

12. 캐싱

12.1 Redis 캐싱

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis-host
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600  # 1시간 캐시

    # 의미적 캐싱 (유사한 질문에 같은 응답)
    supported_call_types:
      - acompletion
      - completion

12.2 인메모리 캐싱

litellm_settings:
  cache: true
  cache_params:
    type: local
    ttl: 600  # 10분 캐시

12.3 캐시 제어

# 캐시 사용
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    extra_body={
        "metadata": {
            "cache": {
                "no-cache": False,  # 캐시 사용
                "ttl": 3600         # 이 요청의 TTL
            }
        }
    }
)

# 캐시 무시
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the latest news?"}],
    extra_body={
        "metadata": {
            "cache": {
                "no-cache": True  # 항상 새로운 응답
            }
        }
    }
)

13. 프로덕션 배포

13.1 Kubernetes 배포

# litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
  labels:
    app: litellm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:main-latest
          ports:
            - containerPort: 4000
          args: ["--config", "/app/config.yaml", "--port", "4000"]
          volumeMounts:
            - name: config
              mountPath: /app/config.yaml
              subPath: config.yaml
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: openai-api-key
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: anthropic-api-key
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: database-url
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "2Gi"
          livenessProbe:
            httpGet:
              path: /health/liveliness
              port: 4000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 4000
            initialDelaySeconds: 10
            periodSeconds: 5
      volumes:
        - name: config
          configMap:
            name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
  name: litellm-service
spec:
  selector:
    app: litellm
  ports:
    - port: 80
      targetPort: 4000
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litellm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litellm-proxy
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

13.2 Helm Chart 사용

# Helm으로 설치
helm repo add litellm https://litellm.github.io/litellm-helm
helm repo update

helm install litellm litellm/litellm-helm \
  --set masterKey=sk-your-master-key \
  --set replicaCount=3 \
  --set database.useExisting=true \
  --set database.url=postgresql://user:pass@host:5432/litellm \
  --namespace litellm \
  --create-namespace

13.3 프로덕션 체크리스트

Production Deployment Checklist
================================
[ ] PostgreSQL DB 설정 및 연결 확인
[ ] Master Key를 강력한 랜덤 값으로 설정
[ ] 모든 API 키를 환경 변수/시크릿으로 관리
[ ] HTTPS/TLS 설정 (Ingress 또는 LoadBalancer)
[ ] 레플리카 2개 이상 설정
[ ] HPA (Horizontal Pod Autoscaler) 설정
[ ] Liveness/Readiness Probe 설정
[ ] Prometheus + Grafana 모니터링 설정
[ ] Slack/PagerDuty 알림 설정
[ ] Redis 캐시 설정 (선택사항)
[ ] 예산 및 레이트 리밋 설정
[ ] Guardrails 설정 (필요 시)
[ ] 로깅 (Langfuse/LangSmith) 설정
[ ] 백업 및 복구 전략 수립
[ ] 로드 테스트 완료

14. LiteLLM vs 대안 비교

14.1 LiteLLM vs OpenRouter

항목LiteLLMOpenRouter
호스팅셀프 호스팅클라우드 서비스
데이터 프라이버시완전 제어제3자 서버 경유
비용오픈소스 무료API 마진 추가
팀 관리가상 키/팀/예산제한적
커스터마이징완전 커스터마이징제한적
셋업 난이도중간 (Docker/K8s 필요)쉬움 (API 키만)
로컬 모델지원 (Ollama/vLLM)미지원

14.2 LiteLLM vs Portkey

항목LiteLLMPortkey
오픈소스예 (Apache 2.0)부분적 (Gateway만)
프록시 서버포함포함
가상 키포함포함
AI Gateway기본적고급 (Prompt 관리 등)
가격무료프리미엄 플랜 존재
커뮤니티활발 (12K+ GitHub Stars)성장 중

14.3 언제 무엇을 선택할까

┌─────────────────────────────────────────────────┐
│ 상황별 추천 솔루션                                │
├─────────────────────────────────────────────────┤
│                                                 │
"빠르게 시작하고 싶다"│  → OpenRouter (가입 후 즉시 사용)│                                                 │
"데이터 보안이 중요하다"│  → LiteLLM (셀프 호스팅, 완전 제어)│                                                 │
"팀/조직 관리가 필요하다"│  → LiteLLM 또는 Portkey│                                                 │
"로컬 모델도 사용한다"│  → LiteLLM (Ollama/vLLM 네이티브 지원)│                                                 │
"엔터프라이즈 기능이 필요하다"│  → LiteLLM Enterprise 또는 Portkey Enterprise│                                                 │
└─────────────────────────────────────────────────┘

15. 실전 사례: 완전한 프로덕션 설정

아래는 실제 프로덕션에서 사용할 수 있는 완전한 config.yaml 예시입니다.

# production-config.yaml
model_list:
  # === 프리미엄 티어 ===
  - model_name: premium-chat
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 200
      tpm: 400000
    model_info:
      access_groups: ["premium"]

  - model_name: premium-chat
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 150
    model_info:
      access_groups: ["premium"]

  # === 기본 티어 ===
  - model_name: basic-chat
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500
    model_info:
      access_groups: ["basic", "premium"]

  - model_name: basic-chat
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY
      rpm: 1000
    model_info:
      access_groups: ["basic", "premium"]

  # === 코딩 전용 ===
  - model_name: code-model
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      access_groups: ["premium"]

  # === 임베딩 ===
  - model_name: embedding
    litellm_params:
      model: openai/text-embedding-3-small
      api_key: os.environ/OPENAI_API_KEY

# === 라우터 설정 ===
router_settings:
  routing_strategy: latency-based-routing
  allowed_fails: 3
  cooldown_time: 60
  num_retries: 3
  timeout: 120
  retry_after: 5

# === LiteLLM 설정 ===
litellm_settings:
  drop_params: true
  set_verbose: false
  num_retries: 3
  request_timeout: 120

  # 폴백
  fallbacks:
    - model_name: premium-chat
      fallback_models: ["basic-chat"]
    - model_name: basic-chat
      fallback_models: ["premium-chat"]

  context_window_fallbacks:
    - model_name: basic-chat
      fallback_models: ["premium-chat"]

  # 캐싱
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600

  # 콜백
  success_callback: ["langfuse", "prometheus"]
  failure_callback: ["langfuse", "prometheus"]

  # Guardrails
  guardrails:
    - guardrail_name: pii-masking
      litellm_params:
        guardrail: presidio
        mode: during_call

# === 일반 설정 ===
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  max_budget: 10000
  budget_duration: 1m

  alerting:
    - slack
  alerting_threshold: 300

  global_max_parallel_requests: 200

# === 환경 변수 ===
environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

퀴즈

지금까지 학습한 내용을 확인해보겠습니다.

Q1. LiteLLM의 두 가지 사용 모드는 무엇이며, 각각 어떤 상황에 적합한가요?

A1. SDK 모드와 Proxy 모드가 있습니다.

  • SDK 모드: Python 패키지를 직접 임포트하여 사용. 개인 프로젝트나 프로토타입에 적합
  • Proxy 모드: 독립 HTTP 서버로 실행. 팀/조직 환경, 멀티 언어 클라이언트, 가상 키 관리, 예산 관리가 필요한 프로덕션에 적합

핵심 차이는 Proxy 모드가 가상 키, 팀 관리, 예산 관리, 로깅 등 엔터프라이즈 기능을 제공한다는 것입니다.

Q2. 모델 폴백(Fallback)이란 무엇이며, LiteLLM에서 어떤 종류의 폴백을 지원하나요?

A2. 모델 폴백은 주 모델이 실패했을 때 자동으로 대체 모델로 전환하는 메커니즘입니다.

LiteLLM은 3가지 종류의 폴백을 지원합니다.

  • 일반 폴백 (fallbacks): 모델 호출 실패 시 (429, 500 에러 등) 대체 모델로 전환
  • 콘텐츠 정책 폴백 (content_policy_fallbacks): 콘텐츠 정책 위반 시 다른 모델로 전환
  • 컨텍스트 윈도우 폴백 (context_window_fallbacks): 입력이 모델의 컨텍스트 윈도우를 초과할 때 더 큰 모델로 전환
Q3. LiteLLM의 비용 기반 라우팅(Cost-Based Routing)은 어떻게 작동하나요?

A3. 같은 model_name에 여러 제공자의 모델을 등록하고 routing_strategy를 cost-based-routing으로 설정하면, LiteLLM이 각 모델의 input/output 토큰 가격을 비교하여 가장 저렴한 모델에 자동으로 라우팅합니다.

예를 들어, GPT-4o-mini, Claude Haiku, Gemini Flash를 같은 이름으로 등록하면, 가격이 가장 낮은 모델이 자동 선택됩니다. 이를 통해 품질을 유지하면서 비용을 최적화할 수 있습니다.

Q4. 가상 키(Virtual Keys)의 역할과 주요 기능을 설명해주세요.

A4. 가상 키는 LiteLLM 프록시가 생성하는 API 키입니다. 실제 LLM 제공자의 API 키를 노출하지 않으면서 다양한 제어 기능을 제공합니다.

주요 기능:

  • 예산 제한: 키별 최대 예산 설정 (max_budget)
  • 모델 접근 제어: 특정 모델만 사용 가능하도록 제한
  • 레이트 리밋: RPM, TPM, 동시 요청 수 제한
  • 팀 연결: 키를 팀에 할당하여 팀 단위 관리
  • 기간 설정: 키의 유효 기간 지정
  • 사용량 추적: 키별 비용, 토큰, 요청 수 추적
Q5. LiteLLM을 프로덕션에 배포할 때 반드시 고려해야 할 항목 5가지는?

A5. 프로덕션 배포 시 반드시 고려해야 할 항목:

  1. 데이터베이스: PostgreSQL 연결 필수 (가상 키, 비용 추적, 팀 관리 데이터 저장)
  2. 보안: Master Key를 강력한 랜덤 값으로 설정, 모든 API 키를 시크릿으로 관리, HTTPS/TLS 적용
  3. 고가용성: 레플리카 2개 이상, HPA 설정, Liveness/Readiness Probe 구성
  4. 모니터링: Prometheus 메트릭 수집, Grafana 대시보드, Slack/PagerDuty 알림 설정
  5. 비용 제어: 글로벌 예산, 팀별/키별 예산, 소프트 예산 알림, 레이트 리밋 설정

추가로 Redis 캐싱, 로깅(Langfuse), Guardrails 설정, 백업 전략도 권장됩니다.


참고 자료

  1. LiteLLM 공식 문서 - https://docs.litellm.ai/
  2. LiteLLM GitHub - https://github.com/BerriAI/litellm
  3. LiteLLM Proxy Server 가이드 - https://docs.litellm.ai/docs/proxy/quick_start
  4. LiteLLM 지원 제공자 목록 - https://docs.litellm.ai/docs/providers
  5. LiteLLM 가상 키 관리 - https://docs.litellm.ai/docs/proxy/virtual_keys
  6. LiteLLM 라우팅 전략 - https://docs.litellm.ai/docs/routing
  7. LiteLLM 예산 관리 - https://docs.litellm.ai/docs/proxy/users
  8. LiteLLM Guardrails - https://docs.litellm.ai/docs/proxy/guardrails
  9. LiteLLM Kubernetes 배포 - https://docs.litellm.ai/docs/proxy/deploy
  10. OpenRouter 공식 사이트 - https://openrouter.ai/
  11. Portkey AI 공식 사이트 - https://portkey.ai/
  12. Langfuse LLM Observability - https://langfuse.com/
  13. LiteLLM vs Alternatives 비교 - https://docs.litellm.ai/docs/proxy/enterprise
  14. Prometheus + Grafana 모니터링 - https://prometheus.io/

LiteLLM Complete Guide 2025: Unify 100+ LLMs with a Single API Proxy Server

Introduction: Why LiteLLM?

In 2025, the AI landscape is evolving faster than ever. New models launch every month: OpenAI's GPT-4o, Anthropic's Claude Opus 4, Google's Gemini 2.0, Meta's Llama 3.1, Mistral's Large, and more. Each provider has its own API format, authentication scheme, and pricing model.

Organizations face three critical challenges in this environment.

1. Vendor Lock-In

When you build against one provider's API, switching to a better model incurs significant refactoring costs.

# Problem: Each provider has a different API format
# OpenAI
import openai
response = openai.ChatCompletion.create(model="gpt-4", messages=[...])

# Anthropic
import anthropic
response = anthropic.messages.create(model="claude-3-opus", messages=[...])

# Google
import google.generativeai as genai
response = genai.GenerativeModel("gemini-pro").generate_content(...)

2. Cost Management Complexity

Using multiple models means tracking costs across different providers with different pricing structures becomes a nightmare.

3. Availability and Reliability

Depending on a single provider means a service outage takes your entire system down.

LiteLLM solves all three problems with a single solution.

┌──────────────────────────────────────────────────┐
Your Application              (OpenAI SDK Format)└─────────────────────┬────────────────────────────┘
┌──────────────────────────────────────────────────┐
LiteLLM Proxy Server│  ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│  │  Cost    │ │  Load    │ │  Virtual Keys    │ │
│  │  Tracking│Balance │ │  + Budget Mgmt   │ │
│  └──────────┘ └──────────┘ └──────────────────┘ │
└───┬──────────────┬──────────────┬────────────────┘
    │              │              │
    ▼              ▼              ▼
┌────────┐  ┌──────────┐  ┌───────────┐
OpenAI │  │ Anthropic│GoogleGPT-4o │  │ Claude   │  │  Gemini└────────┘  └──────────┘  └───────────┘
    ▲              ▲              ▲
    │              │              │
┌────────┐  ┌──────────┐  ┌───────────┐
AWS   │  │  Azure   │  │  Ollama│Bedrock │  │ OpenAI  (Local)└────────┘  └──────────┘  └───────────┘

1. LiteLLM Architecture: SDK vs Proxy

LiteLLM operates in two modes.

1.1 SDK Mode (Python Package)

Import directly into your application. Best for simple projects and prototypes.

# pip install litellm
from litellm import completion

# Call OpenAI
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Call Anthropic (same format!)
response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Call Google Gemini (same format!)
response = completion(
    model="gemini/gemini-2.0-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)

1.2 Proxy Mode (Server)

Run as a standalone server so all applications use a single endpoint. Best for teams and organizations.

# Start proxy server
litellm --config config.yaml --port 4000
# Any language can call using OpenAI SDK format
import openai

client = openai.OpenAI(
    api_key="sk-litellm-your-virtual-key",
    base_url="http://localhost:4000"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

1.3 SDK vs Proxy Comparison

FeatureSDK ModeProxy Mode
UsagePython package importHTTP Server (REST API)
Language SupportPython onlyAll languages (OpenAI SDK compatible)
Team ManagementNot availableVirtual keys with team management
Budget ManagementBasicAdvanced (per-team/per-key)
Load BalancingManual in codeAutomatic via config
Best ForPersonal projects, prototypesTeams, organizations, production

2. SDK Usage: Unified LLM Calls

2.1 Chat Completion

The core SDK feature. Call 100+ models with an identical interface.

from litellm import completion
import os

# Set API keys via environment variables
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
os.environ["GEMINI_API_KEY"] = "AI..."

# Same function, different models
models = [
    "gpt-4o",
    "anthropic/claude-sonnet-4-20250514",
    "gemini/gemini-2.0-flash",
    "groq/llama-3.1-70b-versatile",
    "mistral/mistral-large-latest",
]

for model in models:
    response = completion(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum computing in 2 sentences."}
        ],
        temperature=0.7,
        max_tokens=200,
    )
    print(f"[{model}]: {response.choices[0].message.content}")

2.2 Streaming Responses

from litellm import completion

response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Write a short poem about coding."}],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

2.3 Embedding Calls

from litellm import embedding

# OpenAI Embedding
response = embedding(
    model="text-embedding-3-small",
    input=["Hello world", "How are you?"]
)

# Cohere Embedding (same format!)
response = embedding(
    model="cohere/embed-english-v3.0",
    input=["Hello world", "How are you?"]
)

print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

2.4 Image Generation

from litellm import image_generation

response = image_generation(
    model="dall-e-3",
    prompt="A futuristic city powered by AI, digital art style",
    n=1,
    size="1024x1024",
)

print(f"Image URL: {response.data[0]['url']}")

2.5 Async Calls

import asyncio
from litellm import acompletion

async def generate_responses():
    tasks = [
        acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Tell me fact #{i} about space."}]
        )
        for i in range(5)
    ]
    responses = await asyncio.gather(*tasks)
    for i, resp in enumerate(responses):
        print(f"Fact #{i}: {resp.choices[0].message.content[:100]}")

asyncio.run(generate_responses())

3. Proxy Server Configuration

3.1 Basic Configuration File (config.yaml)

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gemini-flash
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY

  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.1:70b
      api_base: http://localhost:11434

litellm_settings:
  drop_params: true
  set_verbose: false
  num_retries: 3
  request_timeout: 120

general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL

3.2 Getting Started with Docker

# Docker Compose file
cat > docker-compose.yaml << 'EOF'
version: "3.9"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    environment:
      - OPENAI_API_KEY
      - ANTHROPIC_API_KEY
      - GEMINI_API_KEY
      - DATABASE_URL=postgresql://user:pass@db:5432/litellm
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
      POSTGRES_DB: litellm
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:
EOF

# Launch
docker compose up -d

3.3 Testing the Proxy Server

# Health check
curl http://localhost:4000/health

# Chat Completion call
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# List models
curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer sk-master-key-1234"

4. Supported Providers and Models

LiteLLM supports 100+ LLM providers. Here are the major ones.

4.1 Major Providers

┌────────────────────┬──────────────────────────────────────┐
ProviderSupported Models (examples)├────────────────────┼──────────────────────────────────────┤
OpenAI             │ gpt-4o, gpt-4o-mini, o1, o3-mini    │
Anthropic          │ claude-opus-4, claude-sonnet, haiku  │
Google (Gemini)    │ gemini-2.0-flash, gemini-1.5-pro    │
AWS BedrockClaude, Titan, Llama via BedrockAzure OpenAI       │ gpt-4o (Azure hosted)Mistral AI         │ mistral-large, mistral-small         │
Groq               │ llama-3.1-70b, mixtral-8x7b         │
Together AI        │ llama-3.1, CodeLlama, QwenOllama (Local)     │ llama3.1, codellama, mistral         │
vLLM (Self-hosted)Any HuggingFace model               │
Cohere             │ command-r-plus, embed models         │
Deepseek           │ deepseek-chat, deepseek-coder        │
Fireworks AI       │ llama, mixtral, etc.                 
Perplexity         │ pplx-70b-online, etc.               
└────────────────────┴──────────────────────────────────────┘

4.2 AWS Bedrock Configuration

model_list:
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  - model_name: bedrock-llama
    litellm_params:
      model: bedrock/meta.llama3-1-70b-instruct-v1:0
      aws_region_name: us-west-2

4.3 Azure OpenAI Configuration

model_list:
  - model_name: azure-gpt4
    litellm_params:
      model: azure/gpt-4o-deployment-name
      api_base: https://your-resource.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-06-01"

4.4 Ollama (Local LLM) Configuration

model_list:
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.1:70b
      api_base: http://localhost:11434
      stream: true

  - model_name: local-codellama
    litellm_params:
      model: ollama/codellama:34b
      api_base: http://localhost:11434

5. Model Routing Strategies

5.1 Cost-Based Routing (Lowest Cost)

Automatically selects the cheapest model.

router_settings:
  routing_strategy: cost-based-routing

model_list:
  - model_name: general-chat
    litellm_params:
      model: gpt-4o-mini
      # Input: $0.15/1M tokens, Output: $0.60/1M tokens

  - model_name: general-chat
    litellm_params:
      model: anthropic/claude-3-haiku-20240307
      # Input: $0.25/1M tokens, Output: $1.25/1M tokens

  - model_name: general-chat
    litellm_params:
      model: gemini/gemini-2.0-flash
      # Input: $0.075/1M tokens, Output: $0.30/1M tokens

With this setup, calling general-chat will automatically select Gemini Flash as the cheapest option.

5.2 Latency-Based Routing (Lowest Latency)

Selects the model with the fastest response time.

router_settings:
  routing_strategy: latency-based-routing
  routing_strategy_args:
    ttl: 60  # Re-measure latency every 60 seconds

5.3 Usage-Based Routing

Routes to the model with the most available capacity based on RPM (requests per minute).

router_settings:
  routing_strategy: usage-based-routing

model_list:
  - model_name: fast-model
    litellm_params:
      model: gpt-4o-mini
      rpm: 500  # Max 500 requests per minute

  - model_name: fast-model
    litellm_params:
      model: gemini/gemini-2.0-flash
      rpm: 1000  # Max 1000 requests per minute

6. Load Balancing and Fallback

6.1 Load Balancing Configuration

Registering multiple deployments under the same model_name automatically enables load balancing.

model_list:
  # Distribute across multiple API keys for the same model
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-1
      rpm: 100

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-key-2
      rpm: 100

  # Distribute across different providers
  - model_name: fast-model
    litellm_params:
      model: openai/gpt-4o-mini
      rpm: 500

  - model_name: fast-model
    litellm_params:
      model: anthropic/claude-3-haiku-20240307
      rpm: 300

router_settings:
  routing_strategy: simple-shuffle  # Round-robin style
  allowed_fails: 3  # Deactivate deployment after 3 failures
  cooldown_time: 60  # Reactivate after 60 seconds

6.2 Fallback Chain Configuration

Automatically switch to an alternative model when the primary fails.

litellm_settings:
  fallbacks:
    - model_name: gpt-4o
      fallback_models:
        - anthropic/claude-sonnet-4-20250514
        - gemini/gemini-1.5-pro

    - model_name: claude-sonnet
      fallback_models:
        - gpt-4o
        - gemini/gemini-1.5-pro

  # Fallback on content policy violations
  content_policy_fallbacks:
    - model_name: gpt-4o
      fallback_models:
        - anthropic/claude-sonnet-4-20250514

  # Fallback when context window is exceeded
  context_window_fallbacks:
    - model_name: gpt-4o-mini
      fallback_models:
        - gpt-4o
        - anthropic/claude-sonnet-4-20250514

6.3 Fallback Flow

User Request: Chat Completion with "gpt-4o"
    |
    v
[1] Try gpt-4o
    |
    +-- Success --> Return response
    |
    +-- Failure (429 Rate Limit / 500 Error)
        |
        v
[2] Try claude-sonnet-4 (1st fallback)
    |
    +-- Success --> Return response
    |
    +-- Failure
        |
        v
[3] Try gemini-1.5-pro (2nd fallback)
    |
    +-- Success --> Return response
    |
    +-- Failure --> Return error

7. Virtual Keys and Team Management

7.1 Creating Virtual Keys

# Generate a new virtual key with the master key
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "frontend-team-key",
    "duration": "30d",
    "max_budget": 100.0,
    "models": ["gpt-4o", "claude-sonnet"],
    "max_parallel_requests": 10,
    "tpm_limit": 100000,
    "rpm_limit": 100,
    "metadata": {
      "team": "frontend",
      "environment": "production"
    }
  }'

7.2 Team Creation and Management

# Create a team
curl -X POST http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "ml-engineering",
    "max_budget": 500.0,
    "models": ["gpt-4o", "claude-sonnet", "gemini-flash"],
    "members_with_roles": [
      {"role": "admin", "user_id": "user-alice@company.com"},
      {"role": "user", "user_id": "user-bob@company.com"}
    ],
    "metadata": {
      "department": "engineering",
      "cost_center": "ENG-001"
    }
  }'

# Assign keys to a team
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team-ml-engineering-id",
    "key_alias": "ml-team-prod-key",
    "max_budget": 200.0
  }'

7.3 Model Access Control per Key

# Model access groups in config.yaml
model_list:
  - model_name: premium-model
    litellm_params:
      model: openai/gpt-4o
    model_info:
      access_groups: ["premium-tier"]

  - model_name: basic-model
    litellm_params:
      model: openai/gpt-4o-mini
    model_info:
      access_groups: ["basic-tier", "premium-tier"]

8. Budget Management and Cost Tracking

8.1 Global Budget Settings

general_settings:
  max_budget: 10000.0  # Total monthly budget $10,000
  budget_duration: 1m   # Monthly reset

8.2 Per-Key/Per-Team Budget

# Update key budget
curl -X POST http://localhost:4000/key/update \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key": "sk-virtual-key-xxx",
    "max_budget": 50.0,
    "budget_duration": "1m",
    "soft_budget": 40.0
  }'

When soft_budget is reached, an alert is sent. When max_budget is reached, requests are blocked.

8.3 Cost Tracking API

# Query total spend
curl http://localhost:4000/spend/logs \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "start_date=2025-01-01" \
  -d "end_date=2025-01-31"

# Query spend by model
curl http://localhost:4000/spend/tags \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "group_by=model"

# Query spend by team
curl http://localhost:4000/spend/tags \
  -H "Authorization: Bearer sk-master-key-1234" \
  -G \
  -d "group_by=team"

8.4 Budget Alert Webhooks

general_settings:
  alerting:
    - slack
  alerting_threshold: 300  # Alert when 80% of budget is consumed
  alert_types:
    - budget_alerts
    - spend_reports
    - failed_tracking

environment_variables:
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

9. Rate Limiting

9.1 Key-Level Rate Limiting

curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "rate-limited-key",
    "rpm_limit": 60,
    "tpm_limit": 100000,
    "max_parallel_requests": 5
  }'

9.2 Model-Level Rate Limiting

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 200    # Max requests per minute for this deployment
      tpm: 400000 # Max tokens per minute for this deployment

9.3 Global Rate Limiting

general_settings:
  global_max_parallel_requests: 100  # Global concurrent request limit

10. Guardrails: PII Masking and Content Filtering

10.1 PII (Personally Identifiable Information) Masking

litellm_settings:
  guardrails:
    - guardrail_name: pii-masking
      litellm_params:
        guardrail: presidio
        mode: during_call  # Mask PII before sending to LLM
        output_parse_pii: true  # Also mask PII in responses
# Request containing PII
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "My email is john@example.com and phone is 555-123-4567"
    }],
    extra_body={
        "metadata": {
            "guardrails": ["pii-masking"]
        }
    }
)
# LLM receives: "My email is [EMAIL] and phone is [PHONE]"

10.2 Content Filtering (Lakera Guard)

litellm_settings:
  guardrails:
    - guardrail_name: content-filter
      litellm_params:
        guardrail: lakera
        mode: pre_call  # Filter before sending request
        api_key: os.environ/LAKERA_API_KEY

10.3 Prompt Injection Defense

litellm_settings:
  guardrails:
    - guardrail_name: prompt-injection
      litellm_params:
        guardrail: lakera
        mode: pre_call
        prompt_injection: true

10.4 Custom Guardrails

# custom_guardrail.py
from litellm.proxy.guardrails.guardrail_hooks import GuardrailHook

class CustomContentFilter(GuardrailHook):
    def __init__(self):
        self.blocked_words = ["harmful", "dangerous", "illegal"]

    async def pre_call_hook(self, data, call_type):
        user_message = data.get("messages", [])[-1].get("content", "")
        for word in self.blocked_words:
            if word.lower() in user_message.lower():
                raise ValueError(f"Blocked content detected: {word}")
        return data

    async def post_call_hook(self, data, response, call_type):
        return response

11. Logging and Monitoring

11.1 Langfuse Integration

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: https://cloud.langfuse.com

11.2 LangSmith Integration

litellm_settings:
  success_callback: ["langsmith"]

environment_variables:
  LANGCHAIN_API_KEY: os.environ/LANGCHAIN_API_KEY
  LANGCHAIN_PROJECT: my-litellm-project

11.3 Custom Callbacks

# custom_callback.py
from litellm.integrations.custom_logger import CustomLogger

class MyCustomLogger(CustomLogger):
    async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
        print(f"Model: {kwargs.get('model')}")
        print(f"Cost: ${kwargs.get('response_cost', 0):.6f}")
        print(f"Latency: {(end_time - start_time).total_seconds():.2f}s")
        print(f"Tokens: {response_obj.usage.total_tokens}")

    async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
        print(f"FAILED - Model: {kwargs.get('model')}")
        print(f"Error: {kwargs.get('exception')}")

11.4 Prometheus Metrics

litellm_settings:
  success_callback: ["prometheus"]
  failure_callback: ["prometheus"]

Key metrics:

# Request counts
litellm_requests_total
litellm_requests_failed_total

# Latency
litellm_request_duration_seconds_bucket
litellm_request_duration_seconds_sum

# Cost
litellm_spend_total

# Tokens
litellm_tokens_total
litellm_input_tokens_total
litellm_output_tokens_total

12. Caching

12.1 Redis Caching

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis-host
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600  # 1-hour cache

    # Semantic caching (same response for similar questions)
    supported_call_types:
      - acompletion
      - completion

12.2 In-Memory Caching

litellm_settings:
  cache: true
  cache_params:
    type: local
    ttl: 600  # 10-minute cache

12.3 Cache Control

# Use cache
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    extra_body={
        "metadata": {
            "cache": {
                "no-cache": False,  # Use cache
                "ttl": 3600         # TTL for this request
            }
        }
    }
)

# Skip cache
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the latest news?"}],
    extra_body={
        "metadata": {
            "cache": {
                "no-cache": True  # Always get fresh response
            }
        }
    }
)

13. Production Deployment

13.1 Kubernetes Deployment

# litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
  labels:
    app: litellm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:main-latest
          ports:
            - containerPort: 4000
          args: ["--config", "/app/config.yaml", "--port", "4000"]
          volumeMounts:
            - name: config
              mountPath: /app/config.yaml
              subPath: config.yaml
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: openai-api-key
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: anthropic-api-key
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: database-url
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "2Gi"
          livenessProbe:
            httpGet:
              path: /health/liveliness
              port: 4000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 4000
            initialDelaySeconds: 10
            periodSeconds: 5
      volumes:
        - name: config
          configMap:
            name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
  name: litellm-service
spec:
  selector:
    app: litellm
  ports:
    - port: 80
      targetPort: 4000
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litellm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litellm-proxy
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

13.2 Using Helm Chart

# Install with Helm
helm repo add litellm https://litellm.github.io/litellm-helm
helm repo update

helm install litellm litellm/litellm-helm \
  --set masterKey=sk-your-master-key \
  --set replicaCount=3 \
  --set database.useExisting=true \
  --set database.url=postgresql://user:pass@host:5432/litellm \
  --namespace litellm \
  --create-namespace

13.3 Production Checklist

Production Deployment Checklist
================================
[ ] PostgreSQL DB configured and connectivity verified
[ ] Master Key set to a strong random value
[ ] All API keys managed via env vars / secrets
[ ] HTTPS/TLS configured (Ingress or LoadBalancer)
[ ] 2+ replicas configured
[ ] HPA (Horizontal Pod Autoscaler) configured
[ ] Liveness/Readiness Probes configured
[ ] Prometheus + Grafana monitoring set up
[ ] Slack/PagerDuty alerting configured
[ ] Redis cache set up (optional)
[ ] Budget and rate limits configured
[ ] Guardrails configured (if needed)
[ ] Logging (Langfuse/LangSmith) set up
[ ] Backup and recovery strategy documented
[ ] Load testing completed

14. LiteLLM vs Alternatives

14.1 LiteLLM vs OpenRouter

FeatureLiteLLMOpenRouter
HostingSelf-hostedCloud service
Data PrivacyFull controlThird-party servers
CostOpen-source freeAPI margin added
Team ManagementVirtual keys/teams/budgetsLimited
CustomizationFully customizableLimited
Setup DifficultyMedium (Docker/K8s)Easy (API key only)
Local ModelsSupported (Ollama/vLLM)Not supported

14.2 LiteLLM vs Portkey

FeatureLiteLLMPortkey
Open SourceYes (Apache 2.0)Partial (Gateway only)
Proxy ServerIncludedIncluded
Virtual KeysIncludedIncluded
AI GatewayBasicAdvanced (Prompt mgmt)
PricingFreePremium plans available
CommunityActive (12K+ GitHub Stars)Growing

14.3 When to Choose What

Decision Guide
================================

"I want to get started quickly"
  -> OpenRouter (sign up and use immediately)

"Data security is critical"
  -> LiteLLM (self-hosted, full control)

"I need team/org management"
  -> LiteLLM or Portkey

"I also use local models"
  -> LiteLLM (native Ollama/vLLM support)

"I need enterprise features"
  -> LiteLLM Enterprise or Portkey Enterprise

15. Real-World Example: Complete Production Config

Below is a complete config.yaml suitable for production use.

# production-config.yaml
model_list:
  # === Premium Tier ===
  - model_name: premium-chat
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 200
      tpm: 400000
    model_info:
      access_groups: ["premium"]

  - model_name: premium-chat
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 150
    model_info:
      access_groups: ["premium"]

  # === Basic Tier ===
  - model_name: basic-chat
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500
    model_info:
      access_groups: ["basic", "premium"]

  - model_name: basic-chat
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY
      rpm: 1000
    model_info:
      access_groups: ["basic", "premium"]

  # === Coding Dedicated ===
  - model_name: code-model
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      access_groups: ["premium"]

  # === Embedding ===
  - model_name: embedding
    litellm_params:
      model: openai/text-embedding-3-small
      api_key: os.environ/OPENAI_API_KEY

# === Router Settings ===
router_settings:
  routing_strategy: latency-based-routing
  allowed_fails: 3
  cooldown_time: 60
  num_retries: 3
  timeout: 120
  retry_after: 5

# === LiteLLM Settings ===
litellm_settings:
  drop_params: true
  set_verbose: false
  num_retries: 3
  request_timeout: 120

  # Fallbacks
  fallbacks:
    - model_name: premium-chat
      fallback_models: ["basic-chat"]
    - model_name: basic-chat
      fallback_models: ["premium-chat"]

  context_window_fallbacks:
    - model_name: basic-chat
      fallback_models: ["premium-chat"]

  # Caching
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600

  # Callbacks
  success_callback: ["langfuse", "prometheus"]
  failure_callback: ["langfuse", "prometheus"]

  # Guardrails
  guardrails:
    - guardrail_name: pii-masking
      litellm_params:
        guardrail: presidio
        mode: during_call

# === General Settings ===
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  max_budget: 10000
  budget_duration: 1m

  alerting:
    - slack
  alerting_threshold: 300

  global_max_parallel_requests: 200

# === Environment Variables ===
environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

Quiz

Test your understanding of LiteLLM concepts.

Q1. What are the two modes of using LiteLLM, and when is each appropriate?

A1. LiteLLM has SDK Mode and Proxy Mode.

  • SDK Mode: Import the Python package directly. Best for personal projects and prototypes.
  • Proxy Mode: Run as a standalone HTTP server. Best for teams/organizations, multi-language clients, virtual key management, budget management, and production deployments.

The key difference is that Proxy Mode provides enterprise features like virtual keys, team management, budget management, and logging.

Q2. What is Model Fallback, and what types of fallback does LiteLLM support?

A2. Model fallback is a mechanism that automatically switches to an alternative model when the primary model fails.

LiteLLM supports three types of fallback:

  • General fallbacks: Switch to alternative models on API failures (429, 500 errors)
  • Content policy fallbacks: Switch when content policy is violated
  • Context window fallbacks: Switch to a model with a larger context window when input exceeds limits
Q3. How does cost-based routing work in LiteLLM?

A3. When you register multiple provider models under the same model_name and set routing_strategy to cost-based-routing, LiteLLM compares the input/output token prices for each model and automatically routes to the cheapest one.

For example, if GPT-4o-mini, Claude Haiku, and Gemini Flash are all registered under the same name, the model with the lowest price per token is automatically selected. This optimizes cost while maintaining quality.

Q4. What are Virtual Keys and what are their main capabilities?

A4. Virtual Keys are API keys generated by the LiteLLM proxy. They avoid exposing actual LLM provider API keys while providing extensive control features.

Key capabilities:

  • Budget limits: Set maximum budget per key (max_budget)
  • Model access control: Restrict which models a key can access
  • Rate limiting: Set RPM, TPM, and concurrent request limits
  • Team association: Assign keys to teams for team-level management
  • Duration control: Set key expiration dates
  • Usage tracking: Track cost, tokens, and request counts per key
Q5. What are the top 5 must-consider items when deploying LiteLLM to production?

A5. Critical production deployment considerations:

  1. Database: PostgreSQL connection required (stores virtual keys, cost tracking, team management data)
  2. Security: Strong random Master Key, all API keys in secrets, HTTPS/TLS enabled
  3. High Availability: 2+ replicas, HPA configured, Liveness/Readiness Probes set up
  4. Monitoring: Prometheus metrics collection, Grafana dashboards, Slack/PagerDuty alerts configured
  5. Cost Control: Global budget, per-team/per-key budgets, soft budget alerts, rate limits configured

Additionally recommended: Redis caching, logging (Langfuse), Guardrails, and backup strategy.


References

  1. LiteLLM Official Documentation - https://docs.litellm.ai/
  2. LiteLLM GitHub Repository - https://github.com/BerriAI/litellm
  3. LiteLLM Proxy Server Quick Start - https://docs.litellm.ai/docs/proxy/quick_start
  4. LiteLLM Supported Providers - https://docs.litellm.ai/docs/providers
  5. LiteLLM Virtual Keys Guide - https://docs.litellm.ai/docs/proxy/virtual_keys
  6. LiteLLM Routing Strategies - https://docs.litellm.ai/docs/routing
  7. LiteLLM Budget Management - https://docs.litellm.ai/docs/proxy/users
  8. LiteLLM Guardrails - https://docs.litellm.ai/docs/proxy/guardrails
  9. LiteLLM Kubernetes Deployment - https://docs.litellm.ai/docs/proxy/deploy
  10. OpenRouter Official Site - https://openrouter.ai/
  11. Portkey AI Official Site - https://portkey.ai/
  12. Langfuse LLM Observability - https://langfuse.com/
  13. LiteLLM vs Alternatives - https://docs.litellm.ai/docs/proxy/enterprise
  14. Prometheus Monitoring - https://prometheus.io/