Split View: [Architecture] LiteLLM 완전 가이드: 100+ LLM 통합 서빙과 비용 관리
[Architecture] LiteLLM 완전 가이드: 100+ LLM 통합 서빙과 비용 관리
개요
LLM(Large Language Model)을 활용하는 서비스가 늘어나면서, 여러 Provider의 모델을 효율적으로 관리하는 것이 중요해졌습니다. OpenAI, Anthropic, Azure OpenAI, AWS Bedrock 등 각 Provider마다 서로 다른 SDK와 API 형식을 사용하면 코드 복잡도가 급격히 증가합니다.
LiteLLM은 이 문제를 해결하는 오픈소스 도구로, OpenAI SDK 호환 인터페이스를 통해 100개 이상의 LLM을 하나의 API로 통합합니다. 이 글에서는 LiteLLM SDK 사용법부터 Proxy Server 구축, 비용 추적, 프로덕션 배포까지 총정리합니다.
1. LiteLLM이란
1.1 핵심 가치
LiteLLM은 BerriAI가 개발한 오픈소스 프로젝트로, 두 가지 핵심 컴포넌트를 제공합니다.
1. Python SDK: 통합 인터페이스로 100+ LLM 호출
litellm.completion()
|
+-- model="gpt-4o" --> OpenAI API
+-- model="claude-sonnet-4-20250514" --> Anthropic API
+-- model="azure/gpt-4o" --> Azure OpenAI
+-- model="bedrock/claude-3" --> AWS Bedrock
+-- model="vertex_ai/gemini" --> Google Vertex AI
+-- model="ollama/llama3" --> Local Ollama
2. Proxy Server (AI Gateway): OpenAI 호환 REST API 서버
Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers
(Rate Limiting, Cost Tracking, Load Balancing)
1.2 왜 LiteLLM인가
| 문제 | LiteLLM 해결 방식 |
|---|---|
| Provider별 다른 SDK | 통합 completion() 함수 |
| API 키 분산 관리 | Proxy에서 중앙 관리 |
| 비용 추적 어려움 | 자동 비용 계산 및 추적 |
| Provider 장애 | 자동 Fallback 지원 |
| Rate Limit 관리 | 내장 Rate Limiting |
| 모델 전환 비용 | 코드 변경 없이 모델 교체 |
1.3 지원 Provider 목록
Commercial Providers:
- OpenAI (GPT-4o, GPT-4o-mini, o1, o3, etc.)
- Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)
- Azure OpenAI
- AWS Bedrock (Claude, Titan, Llama)
- Google Vertex AI (Gemini)
- Google AI Studio
- Cohere (Command R+)
- Mistral AI
- Together AI
- Groq
- Fireworks AI
- Perplexity
- DeepSeek
Self-Hosted / Local:
- Ollama
- vLLM
- Hugging Face TGI
- NVIDIA NIM
- OpenAI-compatible endpoints
2. LiteLLM SDK 사용법
2.1 설치
pip install litellm
2.2 기본 사용: completion()
import litellm
# OpenAI
response = litellm.completion(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Kubernetes?"},
],
temperature=0.7,
max_tokens=1000,
)
print(response.choices[0].message.content)
# Anthropic Claude
response = litellm.completion(
model="claude-sonnet-4-20250514",
messages=[
{"role": "user", "content": "Explain microservices architecture."},
],
max_tokens=1000,
)
print(response.choices[0].message.content)
# Azure OpenAI
response = litellm.completion(
model="azure/gpt-4o-deployment",
messages=[
{"role": "user", "content": "Hello!"},
],
api_base="https://my-resource.openai.azure.com",
api_version="2024-02-15-preview",
api_key="your-azure-key",
)
# AWS Bedrock
response = litellm.completion(
model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
messages=[
{"role": "user", "content": "Summarize this text."},
],
)
# Ollama (로컬)
response = litellm.completion(
model="ollama/llama3",
messages=[
{"role": "user", "content": "Write a Python function."},
],
api_base="http://localhost:11434",
)
2.3 Streaming
import litellm
# 동기 스트리밍
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a long essay."}],
stream=True,
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
2.4 Async 호출
import asyncio
import litellm
async def main():
# 단일 비동기 호출
response = await litellm.acompletion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
# 비동기 스트리밍
response = await litellm.acompletion(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Explain Docker."}],
stream=True,
)
async for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
# 병렬 호출
tasks = [
litellm.acompletion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Question {i}"}],
)
for i in range(5)
]
responses = await asyncio.gather(*tasks)
for resp in responses:
print(resp.choices[0].message.content[:50])
asyncio.run(main())
2.5 Function Calling (Tool Use)
import litellm
import json
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
},
},
"required": ["location"],
},
},
}
]
# OpenAI와 Claude 모두 동일한 인터페이스
for model in ["gpt-4o", "claude-sonnet-4-20250514"]:
response = litellm.completion(
model=model,
messages=[
{"role": "user", "content": "What's the weather in Seoul?"}
],
tools=tools,
tool_choice="auto",
)
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Model: {model}")
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
2.6 Embedding
import litellm
# OpenAI Embedding
response = litellm.embedding(
model="text-embedding-3-small",
input=["Hello world", "How are you?"],
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")
# Cohere Embedding
response = litellm.embedding(
model="cohere/embed-english-v3.0",
input=["Search query text"],
input_type="search_query",
)
# Bedrock Embedding
response = litellm.embedding(
model="bedrock/amazon.titan-embed-text-v2:0",
input=["Document text for embedding"],
)
2.7 Image/Vision 모델
import litellm
# GPT-4o Vision
response = litellm.completion(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png",
},
},
],
}
],
)
# Claude Vision
response = litellm.completion(
model="claude-sonnet-4-20250514",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this architecture diagram."},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgo...",
},
},
],
}
],
)
3. LiteLLM Proxy Server (AI Gateway)
3.1 Proxy란
LiteLLM Proxy는 자체 호스팅 가능한 OpenAI 호환 API Gateway입니다. 기존 OpenAI SDK를 사용하는 모든 클라이언트가 코드 변경 없이 Proxy에 연결할 수 있습니다.
+-------------------+
| Your Application |
| (OpenAI SDK) |
+--------+----------+
|
v
+--------+----------+
| LiteLLM Proxy |
| - Rate Limiting |
| - Cost Tracking |
| - Load Balancing |
| - Fallback |
| - Key Management |
+--------+----------+
|
+----+----+----+----+
| | | | |
v v v v v
OpenAI Azure Anthropic Bedrock Ollama
3.2 설치 및 실행
# pip 설치
pip install 'litellm[proxy]'
# 기본 실행
litellm --model gpt-4o --port 4000
# 설정 파일로 실행
litellm --config config.yaml --port 4000
# Docker 실행
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-v ./config.yaml:/app/config.yaml \
-e OPENAI_API_KEY=sk-xxx \
-e ANTHROPIC_API_KEY=sk-ant-xxx \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml
3.3 config.yaml 설정
# config.yaml
model_list:
# OpenAI 모델
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 500 # Rate limit: requests per minute
tpm: 100000 # Rate limit: tokens per minute
# Claude 모델 (여러 배포로 로드밸런싱)
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 200
tpm: 80000
# Azure OpenAI
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-deployment
api_base: https://my-resource.openai.azure.com
api_version: '2024-02-15-preview'
api_key: os.environ/AZURE_API_KEY
rpm: 300
# AWS Bedrock Claude
- model_name: bedrock-claude
litellm_params:
model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: us-east-1
# 로컬 Ollama
- model_name: local-llama
litellm_params:
model: ollama/llama3
api_base: http://ollama:11434
# 라우터 설정
router_settings:
routing_strategy: 'latency-based-routing'
num_retries: 3
timeout: 60
allowed_fails: 2
cooldown_time: 30
# 일반 설정
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
store_model_in_db: true
litellm_settings:
drop_params: true
set_verbose: false
cache: true
cache_params:
type: redis
host: redis
port: 6379
3.4 모델 라우팅과 Load Balancing
# 같은 model_name으로 여러 배포를 등록하면 자동 로드밸런싱
model_list:
# gpt-4o 그룹: 3개 배포
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY_1
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-east
api_base: https://east.openai.azure.com
api_key: os.environ/AZURE_KEY_EAST
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-west
api_base: https://west.openai.azure.com
api_key: os.environ/AZURE_KEY_WEST
router_settings:
# 라우팅 전략
routing_strategy: 'latency-based-routing'
# 옵션:
# simple-shuffle: 랜덤 선택
# least-busy: 가장 적은 진행 중 요청
# usage-based-routing: TPM/RPM 사용량 기반
# latency-based-routing: 응답 시간 기반 (권장)
라우팅 전략 비교:
| 전략 | 설명 | 적합한 경우 |
|---|---|---|
| simple-shuffle | 랜덤 분배 | 모든 배포 성능이 유사한 경우 |
| least-busy | 진행 중 요청 수 기준 | 요청 처리 시간이 다양한 경우 |
| usage-based-routing | RPM/TPM 사용량 기준 | Rate Limit에 근접한 경우 |
| latency-based-routing | 응답 시간 기준 | 지연시간 최적화가 중요한 경우 |
3.5 Fallback 설정
model_list:
- model_name: primary-model
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: fallback-model
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
num_retries: 2
timeout: 30
fallbacks: [{ 'primary-model': ['fallback-model'] }]
# 특정 에러에만 fallback
retry_policy:
RateLimitError: 3 # 429 에러 시 3번 재시도
ContentPolicyViolationError: 0 # 콘텐츠 정책 위반은 재시도 안 함
AuthenticationError: 0 # 인증 에러는 재시도 안 함
3.6 API Key 관리 (Virtual Keys)
# 마스터 키로 가상 키 생성
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"models": ["gpt-4o", "claude-sonnet"],
"max_budget": 100.0,
"budget_duration": "monthly",
"metadata": {
"team": "backend",
"user": "developer-1"
},
"tpm_limit": 50000,
"rpm_limit": 100
}'
응답:
{
"key": "sk-generated-key-abc123",
"expires": "2026-04-20T00:00:00Z",
"max_budget": 100.0,
"models": ["gpt-4o", "claude-sonnet"]
}
# 생성된 키로 API 호출
from openai import OpenAI
client = OpenAI(
api_key="sk-generated-key-abc123",
base_url="http://localhost:4000",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
3.7 Rate Limiting 설정
# config.yaml에서 Rate Limiting
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 500 # 모델 배포 수준 RPM
tpm: 100000 # 모델 배포 수준 TPM
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
# 키별 Rate Limiting 설정
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"rpm_limit": 50,
"tpm_limit": 20000,
"max_budget": 10.0,
"budget_duration": "daily"
}'
# 팀별 Rate Limiting
curl -X POST http://localhost:4000/team/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_alias": "backend-team",
"rpm_limit": 200,
"tpm_limit": 80000,
"max_budget": 500.0,
"budget_duration": "monthly"
}'
3.8 Budget Management
# 사용자별 예산 설정
curl -X POST http://localhost:4000/user/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"user_id": "user-123",
"max_budget": 50.0,
"budget_duration": "monthly",
"models": ["gpt-4o-mini", "claude-sonnet"]
}'
# 예산 사용량 확인
curl http://localhost:4000/user/info?user_id=user-123 \
-H "Authorization: Bearer sk-master-key-1234"
3.9 Caching
# config.yaml
litellm_settings:
cache: true
cache_params:
type: redis
host: redis
port: 6379
ttl: 3600 # 1시간 캐시
# Semantic Caching (유사한 질문에 대해 캐시 히트)
# supported_call_types: ["acompletion", "completion"]
# similarity_threshold: 0.8
# 클라이언트에서 캐시 제어
from openai import OpenAI
client = OpenAI(
api_key="sk-key",
base_url="http://localhost:4000",
)
# 캐시 사용 (기본)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}],
)
# 캐시 건너뛰기
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}],
extra_body={"cache": {"no-cache": True}},
)
4. 비용 추적 (Cost Tracking)
4.1 자동 비용 계산
LiteLLM은 각 요청의 비용을 자동으로 계산합니다.
import litellm
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
# 비용 정보
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
# litellm의 비용 계산
from litellm import completion_cost
cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")
4.2 Proxy에서 비용 조회
# 전체 비용 조회
curl http://localhost:4000/global/spend \
-H "Authorization: Bearer sk-master-key-1234"
# 키별 비용 조회
curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \
-H "Authorization: Bearer sk-master-key-1234"
# 모델별 비용 조회
curl "http://localhost:4000/global/spend?model=gpt-4o" \
-H "Authorization: Bearer sk-master-key-1234"
# 팀별 비용 조회
curl "http://localhost:4000/team/info?team_id=team-backend" \
-H "Authorization: Bearer sk-master-key-1234"
# 기간별 비용 조회
curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \
-H "Authorization: Bearer sk-master-key-1234"
4.3 비용 알림 설정
# config.yaml
general_settings:
alerting:
- slack
alerting_threshold: 300 # 300초 내 응답 없으면 알림
alert_types:
- budget_alerts # 예산 초과 시
- spend_reports # 주간/월간 비용 리포트
- failed_tracking # 실패한 요청 추적
environment_variables:
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
5. 프로덕션 배포
5.1 Docker Compose
# docker-compose.yml
version: '3.8'
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm-proxy
ports:
- '4000:4000'
volumes:
- ./config.yaml:/app/config.yaml
environment:
- OPENAI_API_KEY=sk-xxx
- ANTHROPIC_API_KEY=sk-ant-xxx
- AZURE_API_KEY=xxx
- AWS_ACCESS_KEY_ID=xxx
- AWS_SECRET_ACCESS_KEY=xxx
- DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
- REDIS_HOST=redis
- REDIS_PORT=6379
command: --config /app/config.yaml --port 4000
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']
interval: 30s
timeout: 10s
retries: 3
postgres:
image: postgres:16-alpine
container_name: litellm-postgres
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: password
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U litellm']
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
container_name: litellm-redis
ports:
- '6379:6379'
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:
5.2 Kubernetes Helm Chart
# Helm repository 추가
helm repo add litellm https://berriai.github.io/litellm/
helm repo update
# 설치
helm install litellm litellm/litellm-helm \
--namespace litellm \
--create-namespace \
--values values.yaml
# values.yaml
replicaCount: 3
image:
repository: ghcr.io/berriai/litellm
tag: main-latest
service:
type: ClusterIP
port: 4000
ingress:
enabled: true
className: nginx
hosts:
- host: litellm.internal.company.com
paths:
- path: /
pathType: Prefix
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: anthropic-api-key
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: litellm-secrets
key: database-url
configMap:
config.yaml: |
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
routing_strategy: latency-based-routing
num_retries: 3
postgresql:
enabled: true
auth:
database: litellm
username: litellm
redis:
enabled: true
5.3 Health Check와 Metrics
# Health Check
curl http://localhost:4000/health
# Prometheus Metrics
curl http://localhost:4000/metrics
# Prometheus scrape 설정
scrape_configs:
- job_name: 'litellm'
static_configs:
- targets: ['litellm:4000']
metrics_path: /metrics
scrape_interval: 15s
주요 Prometheus 메트릭:
litellm_requests_total: 전체 요청 수
litellm_request_duration_seconds: 요청 처리 시간
litellm_tokens_total: 전체 토큰 사용량
litellm_spend_total: 전체 비용
litellm_errors_total: 에러 수
litellm_cache_hits_total: 캐시 히트 수
5.4 Logging 연동
# config.yaml - 외부 로깅 서비스 연동
litellm_settings:
success_callback: ['langfuse']
failure_callback: ['langfuse']
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
LANGFUSE_HOST: https://cloud.langfuse.com
지원하는 로깅 서비스:
| 서비스 | 용도 |
|---|---|
| Langfuse | LLM 관찰성, 프롬프트 관리 |
| Helicone | 요청 로깅, 비용 분석 |
| Lunary | LLM 모니터링 |
| Custom Callback | 자체 로깅 시스템 연동 |
# Custom Callback 예시
import litellm
def my_custom_callback(kwargs, completion_response, start_time, end_time):
# 모든 요청에 대해 호출됨
model = kwargs.get("model")
messages = kwargs.get("messages")
cost = completion_cost(completion_response=completion_response)
# 커스텀 로직 (DB 저장, 알림 등)
log_to_database(
model=model,
cost=cost,
latency=(end_time - start_time).total_seconds(),
tokens=completion_response.usage.total_tokens,
)
litellm.success_callback = [my_custom_callback]
6. 실전 사용 사례
6.1 기업 AI Gateway
사내 모든 LLM 호출을 LiteLLM Proxy로 중앙화합니다.
+------------------+
| Frontend App |----+
+------------------+ |
| +----------------+
+------------------+ +---->| | +----------+
| Backend Service |----+ | LiteLLM Proxy |---->| OpenAI |
+------------------+ | | | +----------+
| | - Auth |
+------------------+ | | - Rate Limit | +----------+
| Data Pipeline |----+ | - Cost Track |---->| Anthropic|
+------------------+ | | - Audit Log | +----------+
| | |
+------------------+ | +--------+-------+ +----------+
| Internal Tools |----+ | | Azure |
+------------------+ v +----------+
+--------+-------+
| PostgreSQL |
| (spend logs) |
+----------------+
6.2 A/B 테스트
from openai import OpenAI
import random
client = OpenAI(
api_key="sk-proxy-key",
base_url="http://litellm-proxy:4000",
)
def get_completion_with_ab_test(prompt: str, test_name: str):
# 50/50 A/B 테스트
model = random.choice(["gpt-4o", "claude-sonnet"])
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
extra_body={
"metadata": {
"test_name": test_name,
"variant": model,
}
},
)
return {
"model": model,
"content": response.choices[0].message.content,
"tokens": response.usage.total_tokens,
}
6.3 비용 최적화 라우팅
def smart_route(prompt: str, complexity: str = "auto"):
"""복잡도에 따라 적절한 모델 선택"""
if complexity == "auto":
# 간단한 휴리스틱: 토큰 수와 키워드 기반
word_count = len(prompt.split())
if word_count < 50:
complexity = "simple"
elif any(kw in prompt.lower() for kw in
["analyze", "compare", "complex", "detailed"]):
complexity = "complex"
else:
complexity = "medium"
model_map = {
"simple": "gpt-4o-mini", # 저렴한 모델
"medium": "claude-sonnet", # 중간 성능/가격
"complex": "gpt-4o", # 고성능 모델
}
model = model_map[complexity]
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return response
6.4 Disaster Recovery (자동 Failover)
# config.yaml - 다중 Provider Failover
model_list:
# Primary: OpenAI
- model_name: main-model
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
# Secondary: Azure OpenAI (다른 리전)
- model_name: main-model-fallback-1
litellm_params:
model: azure/gpt-4o
api_base: https://eastus.openai.azure.com
api_key: os.environ/AZURE_KEY
# Tertiary: Anthropic Claude
- model_name: main-model-fallback-2
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]
num_retries: 2
timeout: 30
allowed_fails: 3
cooldown_time: 60 # 실패한 모델 60초 쿨다운
7. 비교: LiteLLM vs 대안
7.1 주요 도구 비교
| 기능 | LiteLLM | LangChain | OpenRouter | Portkey |
|---|---|---|---|---|
| 타입 | Gateway + SDK | Framework | Hosted API | Hosted Gateway |
| 호스팅 | Self-hosted | N/A (라이브러리) | Cloud | Cloud + Self |
| 모델 수 | 100+ | 다양 | 200+ | 250+ |
| 비용 추적 | 내장 | 별도 구현 필요 | 있음 | 있음 |
| Rate Limiting | 내장 | 없음 | 있음 | 있음 |
| Load Balancing | 내장 | 없음 | 있음 | 있음 |
| Fallback | 내장 | 수동 구현 | 있음 | 있음 |
| API Key 관리 | Virtual Keys | 없음 | 없음 | 있음 |
| 가격 | 무료 (OSS) | 무료 (OSS) | 마크업 | 무료 + Enterprise |
| 데이터 프라이버시 | 완전 제어 | 완전 제어 | 제3자 경유 | 제3자 경유 |
7.2 언제 어떤 도구를 선택할까
LiteLLM을 선택해야 하는 경우:
- 데이터 프라이버시가 중요 (금융, 의료, 정부)
- 자체 인프라에서 운영해야 하는 경우
- 비용 추적과 Rate Limiting이 필요한 경우
- 여러 Provider를 이미 사용 중인 경우
LangChain을 선택해야 하는 경우:
- RAG, Agent 등 복잡한 LLM 파이프라인 구축
- 프롬프트 체이닝, 메모리 관리 등이 필요한 경우
- (LiteLLM과 LangChain은 함께 사용 가능)
OpenRouter를 선택해야 하는 경우:
- 빠른 프로토타이핑
- 인프라 관리를 원하지 않는 경우
- 단일 API 키로 모든 모델 접근
Portkey를 선택해야 하는 경우:
- 엔터프라이즈 수준의 관리 UI 필요
- Guardrails, A/B 테스트 등 고급 기능 필요
- 매니지드 서비스 선호
8. 실전 팁
8.1 환경 변수 관리
# .env 파일 (절대 Git에 커밋하지 마세요)
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
AZURE_API_KEY=xxx
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm
LITELLM_MASTER_KEY=sk-master-key-change-me
8.2 모델 별칭(Alias) 설정
# config.yaml
model_list:
- model_name: fast
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: smart
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: creative
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
# 사용 시 의미 있는 이름으로 호출
response = client.chat.completions.create(
model="fast", # gpt-4o-mini
messages=[{"role": "user", "content": "Quick question"}],
)
response = client.chat.completions.create(
model="smart", # gpt-4o
messages=[{"role": "user", "content": "Complex analysis"}],
)
8.3 에러 처리 패턴
from openai import OpenAI, APIError, RateLimitError, APITimeoutError
client = OpenAI(
api_key="sk-proxy-key",
base_url="http://litellm-proxy:4000",
)
def safe_completion(messages, model="gpt-4o", max_retries=3):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
timeout=30,
)
return response
except RateLimitError:
# LiteLLM Proxy가 Rate Limit 관리하지만 클라이언트도 처리
import time
wait = 2 ** attempt
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
except APITimeoutError:
print(f"Timeout on attempt {attempt + 1}")
if attempt == max_retries - 1:
raise
except APIError as e:
print(f"API error: {e}")
raise
return None
9. 마무리
LiteLLM은 멀티 LLM 환경에서 필수적인 AI Gateway 역할을 합니다.
핵심 정리:
- 통합 SDK: 100+ LLM을 하나의 completion() 함수로 호출
- Proxy Server: OpenAI 호환 API Gateway로 중앙 관리
- 비용 제어: 자동 비용 추적, Budget 관리, 알림
- 안정성: Load Balancing, Fallback, Rate Limiting 내장
- 프로덕션: Docker/Kubernetes 배포, Prometheus 모니터링, 외부 로깅 연동
특히 여러 LLM Provider를 사용하는 기업 환경에서, LiteLLM Proxy를 도입하면 API 키 관리, 비용 추적, 장애 대응을 중앙에서 일관되게 처리할 수 있습니다.
참고 자료
- LiteLLM 공식 문서: https://docs.litellm.ai/
- LiteLLM GitHub: https://github.com/BerriAI/litellm
- LiteLLM Proxy 설정 가이드: https://docs.litellm.ai/docs/proxy/configs
- LiteLLM Docker 배포: https://docs.litellm.ai/docs/proxy/deploy
[Architecture] Complete Guide to LiteLLM: Unified Serving of 100+ LLMs
Overview
As services leveraging LLMs (Large Language Models) proliferate, efficiently managing models from multiple providers has become critical. When each provider -- OpenAI, Anthropic, Azure OpenAI, AWS Bedrock -- uses different SDKs and API formats, code complexity increases rapidly.
LiteLLM is an open-source tool that solves this problem by providing an OpenAI SDK-compatible interface to unify 100+ LLMs into a single API. This post covers everything from LiteLLM SDK usage to Proxy Server setup, cost tracking, and production deployment.
1. What is LiteLLM
1.1 Core Value
LiteLLM is an open-source project by BerriAI that provides two core components.
1. Python SDK: Call 100+ LLMs through a unified interface
litellm.completion()
|
+-- model="gpt-4o" --> OpenAI API
+-- model="claude-sonnet-4-20250514" --> Anthropic API
+-- model="azure/gpt-4o" --> Azure OpenAI
+-- model="bedrock/claude-3" --> AWS Bedrock
+-- model="vertex_ai/gemini" --> Google Vertex AI
+-- model="ollama/llama3" --> Local Ollama
2. Proxy Server (AI Gateway): OpenAI-compatible REST API server
Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers
(Rate Limiting, Cost Tracking, Load Balancing)
1.2 Why LiteLLM
| Problem | LiteLLM Solution |
|---|---|
| Different SDK per provider | Unified completion() function |
| Scattered API key management | Centralized management via Proxy |
| Difficulty tracking costs | Automatic cost calculation and tracking |
| Provider outages | Automatic fallback support |
| Rate limit management | Built-in rate limiting |
| Model switching costs | Switch models without code changes |
1.3 Supported Providers
Commercial Providers:
- OpenAI (GPT-4o, GPT-4o-mini, o1, o3, etc.)
- Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)
- Azure OpenAI
- AWS Bedrock (Claude, Titan, Llama)
- Google Vertex AI (Gemini)
- Google AI Studio
- Cohere (Command R+)
- Mistral AI
- Together AI
- Groq
- Fireworks AI
- Perplexity
- DeepSeek
Self-Hosted / Local:
- Ollama
- vLLM
- Hugging Face TGI
- NVIDIA NIM
- OpenAI-compatible endpoints
2. LiteLLM SDK Usage
2.1 Installation
pip install litellm
2.2 Basic Usage: completion()
import litellm
# OpenAI
response = litellm.completion(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Kubernetes?"},
],
temperature=0.7,
max_tokens=1000,
)
print(response.choices[0].message.content)
# Anthropic Claude
response = litellm.completion(
model="claude-sonnet-4-20250514",
messages=[
{"role": "user", "content": "Explain microservices architecture."},
],
max_tokens=1000,
)
print(response.choices[0].message.content)
# Azure OpenAI
response = litellm.completion(
model="azure/gpt-4o-deployment",
messages=[
{"role": "user", "content": "Hello!"},
],
api_base="https://my-resource.openai.azure.com",
api_version="2024-02-15-preview",
api_key="your-azure-key",
)
# AWS Bedrock
response = litellm.completion(
model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
messages=[
{"role": "user", "content": "Summarize this text."},
],
)
# Ollama (local)
response = litellm.completion(
model="ollama/llama3",
messages=[
{"role": "user", "content": "Write a Python function."},
],
api_base="http://localhost:11434",
)
2.3 Streaming
import litellm
# Synchronous streaming
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a long essay."}],
stream=True,
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
2.4 Async Calls
import asyncio
import litellm
async def main():
# Single async call
response = await litellm.acompletion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
# Async streaming
response = await litellm.acompletion(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Explain Docker."}],
stream=True,
)
async for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
# Parallel calls
tasks = [
litellm.acompletion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Question {i}"}],
)
for i in range(5)
]
responses = await asyncio.gather(*tasks)
for resp in responses:
print(resp.choices[0].message.content[:50])
asyncio.run(main())
2.5 Function Calling (Tool Use)
import litellm
import json
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
},
},
"required": ["location"],
},
},
}
]
# Same interface for both OpenAI and Claude
for model in ["gpt-4o", "claude-sonnet-4-20250514"]:
response = litellm.completion(
model=model,
messages=[
{"role": "user", "content": "What's the weather in Seoul?"}
],
tools=tools,
tool_choice="auto",
)
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Model: {model}")
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
2.6 Embedding
import litellm
# OpenAI Embedding
response = litellm.embedding(
model="text-embedding-3-small",
input=["Hello world", "How are you?"],
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")
# Cohere Embedding
response = litellm.embedding(
model="cohere/embed-english-v3.0",
input=["Search query text"],
input_type="search_query",
)
# Bedrock Embedding
response = litellm.embedding(
model="bedrock/amazon.titan-embed-text-v2:0",
input=["Document text for embedding"],
)
2.7 Image/Vision Models
import litellm
# GPT-4o Vision
response = litellm.completion(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png",
},
},
],
}
],
)
# Claude Vision
response = litellm.completion(
model="claude-sonnet-4-20250514",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this architecture diagram."},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgo...",
},
},
],
}
],
)
3. LiteLLM Proxy Server (AI Gateway)
3.1 What is the Proxy
LiteLLM Proxy is a self-hostable OpenAI-compatible API Gateway. Any existing client using the OpenAI SDK can connect to the Proxy without code changes.
+-------------------+
| Your Application |
| (OpenAI SDK) |
+--------+----------+
|
v
+--------+----------+
| LiteLLM Proxy |
| - Rate Limiting |
| - Cost Tracking |
| - Load Balancing |
| - Fallback |
| - Key Management |
+--------+----------+
|
+----+----+----+----+
| | | | |
v v v v v
OpenAI Azure Anthropic Bedrock Ollama
3.2 Installation and Running
# pip install
pip install 'litellm[proxy]'
# Basic run
litellm --model gpt-4o --port 4000
# Run with config file
litellm --config config.yaml --port 4000
# Docker run
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-v ./config.yaml:/app/config.yaml \
-e OPENAI_API_KEY=sk-xxx \
-e ANTHROPIC_API_KEY=sk-ant-xxx \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml
3.3 config.yaml Configuration
# config.yaml
model_list:
# OpenAI models
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 500 # Rate limit: requests per minute
tpm: 100000 # Rate limit: tokens per minute
# Claude models (load balancing across deployments)
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 200
tpm: 80000
# Azure OpenAI
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-deployment
api_base: https://my-resource.openai.azure.com
api_version: '2024-02-15-preview'
api_key: os.environ/AZURE_API_KEY
rpm: 300
# AWS Bedrock Claude
- model_name: bedrock-claude
litellm_params:
model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: us-east-1
# Local Ollama
- model_name: local-llama
litellm_params:
model: ollama/llama3
api_base: http://ollama:11434
# Router settings
router_settings:
routing_strategy: 'latency-based-routing'
num_retries: 3
timeout: 60
allowed_fails: 2
cooldown_time: 30
# General settings
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
store_model_in_db: true
litellm_settings:
drop_params: true
set_verbose: false
cache: true
cache_params:
type: redis
host: redis
port: 6379
3.4 Model Routing and Load Balancing
# Registering multiple deployments with the same model_name enables auto load balancing
model_list:
# gpt-4o group: 3 deployments
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY_1
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-east
api_base: https://east.openai.azure.com
api_key: os.environ/AZURE_KEY_EAST
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-west
api_base: https://west.openai.azure.com
api_key: os.environ/AZURE_KEY_WEST
router_settings:
# Routing strategy
routing_strategy: 'latency-based-routing'
# Options:
# simple-shuffle: Random selection
# least-busy: Fewest in-progress requests
# usage-based-routing: Based on TPM/RPM usage
# latency-based-routing: Based on response time (recommended)
Routing Strategy Comparison:
| Strategy | Description | Best For |
|---|---|---|
| simple-shuffle | Random distribution | All deployments have similar performance |
| least-busy | Based on in-progress request count | Varying request processing times |
| usage-based-routing | Based on RPM/TPM usage | Approaching rate limits |
| latency-based-routing | Based on response time | Latency optimization is critical |
3.5 Fallback Configuration
model_list:
- model_name: primary-model
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: fallback-model
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
num_retries: 2
timeout: 30
fallbacks: [{ 'primary-model': ['fallback-model'] }]
# Fallback only on specific errors
retry_policy:
RateLimitError: 3 # Retry 3 times on 429 errors
ContentPolicyViolationError: 0 # No retry on content policy violations
AuthenticationError: 0 # No retry on auth errors
3.6 API Key Management (Virtual Keys)
# Generate virtual key with master key
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"models": ["gpt-4o", "claude-sonnet"],
"max_budget": 100.0,
"budget_duration": "monthly",
"metadata": {
"team": "backend",
"user": "developer-1"
},
"tpm_limit": 50000,
"rpm_limit": 100
}'
Response:
{
"key": "sk-generated-key-abc123",
"expires": "2026-04-20T00:00:00Z",
"max_budget": 100.0,
"models": ["gpt-4o", "claude-sonnet"]
}
# Make API calls with the generated key
from openai import OpenAI
client = OpenAI(
api_key="sk-generated-key-abc123",
base_url="http://localhost:4000",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
3.7 Rate Limiting Configuration
# Rate Limiting in config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 500 # Model deployment level RPM
tpm: 100000 # Model deployment level TPM
general_settings:
master_key: sk-master-key-1234
database_url: os.environ/DATABASE_URL
# Per-key rate limiting
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"rpm_limit": 50,
"tpm_limit": 20000,
"max_budget": 10.0,
"budget_duration": "daily"
}'
# Per-team rate limiting
curl -X POST http://localhost:4000/team/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"team_alias": "backend-team",
"rpm_limit": 200,
"tpm_limit": 80000,
"max_budget": 500.0,
"budget_duration": "monthly"
}'
3.8 Budget Management
# Set per-user budget
curl -X POST http://localhost:4000/user/new \
-H "Authorization: Bearer sk-master-key-1234" \
-H "Content-Type: application/json" \
-d '{
"user_id": "user-123",
"max_budget": 50.0,
"budget_duration": "monthly",
"models": ["gpt-4o-mini", "claude-sonnet"]
}'
# Check budget usage
curl http://localhost:4000/user/info?user_id=user-123 \
-H "Authorization: Bearer sk-master-key-1234"
3.9 Caching
# config.yaml
litellm_settings:
cache: true
cache_params:
type: redis
host: redis
port: 6379
ttl: 3600 # 1 hour cache
# Cache control from client
from openai import OpenAI
client = OpenAI(
api_key="sk-key",
base_url="http://localhost:4000",
)
# Use cache (default)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}],
)
# Skip cache
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}],
extra_body={"cache": {"no-cache": True}},
)
4. Cost Tracking
4.1 Automatic Cost Calculation
LiteLLM automatically calculates the cost of each request.
import litellm
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
# Cost information
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
# LiteLLM cost calculation
from litellm import completion_cost
cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")
4.2 Cost Queries via Proxy
# Total spend
curl http://localhost:4000/global/spend \
-H "Authorization: Bearer sk-master-key-1234"
# Per-key spend
curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \
-H "Authorization: Bearer sk-master-key-1234"
# Per-model spend
curl "http://localhost:4000/global/spend?model=gpt-4o" \
-H "Authorization: Bearer sk-master-key-1234"
# Per-team spend
curl "http://localhost:4000/team/info?team_id=team-backend" \
-H "Authorization: Bearer sk-master-key-1234"
# Spend by date range
curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \
-H "Authorization: Bearer sk-master-key-1234"
4.3 Budget Alert Configuration
# config.yaml
general_settings:
alerting:
- slack
alerting_threshold: 300 # Alert if no response within 300 seconds
alert_types:
- budget_alerts # On budget exceeded
- spend_reports # Weekly/monthly cost reports
- failed_tracking # Failed request tracking
environment_variables:
SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL
5. Production Deployment
5.1 Docker Compose
# docker-compose.yml
version: '3.8'
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm-proxy
ports:
- '4000:4000'
volumes:
- ./config.yaml:/app/config.yaml
environment:
- OPENAI_API_KEY=sk-xxx
- ANTHROPIC_API_KEY=sk-ant-xxx
- AZURE_API_KEY=xxx
- AWS_ACCESS_KEY_ID=xxx
- AWS_SECRET_ACCESS_KEY=xxx
- DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
- REDIS_HOST=redis
- REDIS_PORT=6379
command: --config /app/config.yaml --port 4000
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']
interval: 30s
timeout: 10s
retries: 3
postgres:
image: postgres:16-alpine
container_name: litellm-postgres
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: password
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U litellm']
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
container_name: litellm-redis
ports:
- '6379:6379'
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:
5.2 Kubernetes Helm Chart
# Add Helm repository
helm repo add litellm https://berriai.github.io/litellm/
helm repo update
# Install
helm install litellm litellm/litellm-helm \
--namespace litellm \
--create-namespace \
--values values.yaml
# values.yaml
replicaCount: 3
image:
repository: ghcr.io/berriai/litellm
tag: main-latest
service:
type: ClusterIP
port: 4000
ingress:
enabled: true
className: nginx
hosts:
- host: litellm.internal.company.com
paths:
- path: /
pathType: Prefix
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: anthropic-api-key
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: litellm-secrets
key: database-url
postgresql:
enabled: true
auth:
database: litellm
username: litellm
redis:
enabled: true
5.3 Health Check and Metrics
# Health Check
curl http://localhost:4000/health
# Prometheus Metrics
curl http://localhost:4000/metrics
Key Prometheus Metrics:
litellm_requests_total: Total request count
litellm_request_duration_seconds: Request processing time
litellm_tokens_total: Total token usage
litellm_spend_total: Total spend
litellm_errors_total: Error count
litellm_cache_hits_total: Cache hit count
5.4 Logging Integration
# config.yaml - External logging service integration
litellm_settings:
success_callback: ['langfuse']
failure_callback: ['langfuse']
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
LANGFUSE_HOST: https://cloud.langfuse.com
Supported Logging Services:
| Service | Purpose |
|---|---|
| Langfuse | LLM observability, prompt management |
| Helicone | Request logging, cost analysis |
| Lunary | LLM monitoring |
| Custom Callback | Integration with custom logging systems |
# Custom Callback example
import litellm
def my_custom_callback(kwargs, completion_response, start_time, end_time):
# Called for every request
model = kwargs.get("model")
messages = kwargs.get("messages")
cost = completion_cost(completion_response=completion_response)
# Custom logic (DB storage, alerts, etc.)
log_to_database(
model=model,
cost=cost,
latency=(end_time - start_time).total_seconds(),
tokens=completion_response.usage.total_tokens,
)
litellm.success_callback = [my_custom_callback]
6. Real-World Use Cases
6.1 Enterprise AI Gateway
Centralize all LLM calls across the organization through LiteLLM Proxy.
+------------------+
| Frontend App |----+
+------------------+ |
| +----------------+
+------------------+ +---->| | +----------+
| Backend Service |----+ | LiteLLM Proxy |---->| OpenAI |
+------------------+ | | | +----------+
| | - Auth |
+------------------+ | | - Rate Limit | +----------+
| Data Pipeline |----+ | - Cost Track |---->| Anthropic|
+------------------+ | | - Audit Log | +----------+
| | |
+------------------+ | +--------+-------+ +----------+
| Internal Tools |----+ | | Azure |
+------------------+ v +----------+
+--------+-------+
| PostgreSQL |
| (spend logs) |
+----------------+
6.2 A/B Testing
from openai import OpenAI
import random
client = OpenAI(
api_key="sk-proxy-key",
base_url="http://litellm-proxy:4000",
)
def get_completion_with_ab_test(prompt: str, test_name: str):
# 50/50 A/B test
model = random.choice(["gpt-4o", "claude-sonnet"])
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
extra_body={
"metadata": {
"test_name": test_name,
"variant": model,
}
},
)
return {
"model": model,
"content": response.choices[0].message.content,
"tokens": response.usage.total_tokens,
}
6.3 Cost-Optimized Routing
def smart_route(prompt: str, complexity: str = "auto"):
"""Select appropriate model based on complexity"""
if complexity == "auto":
# Simple heuristic: based on token count and keywords
word_count = len(prompt.split())
if word_count < 50:
complexity = "simple"
elif any(kw in prompt.lower() for kw in
["analyze", "compare", "complex", "detailed"]):
complexity = "complex"
else:
complexity = "medium"
model_map = {
"simple": "gpt-4o-mini", # Cheap model
"medium": "claude-sonnet", # Mid performance/price
"complex": "gpt-4o", # High performance model
}
model = model_map[complexity]
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return response
6.4 Disaster Recovery (Automatic Failover)
# config.yaml - Multi-provider failover
model_list:
# Primary: OpenAI
- model_name: main-model
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
# Secondary: Azure OpenAI (different region)
- model_name: main-model-fallback-1
litellm_params:
model: azure/gpt-4o
api_base: https://eastus.openai.azure.com
api_key: os.environ/AZURE_KEY
# Tertiary: Anthropic Claude
- model_name: main-model-fallback-2
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]
num_retries: 2
timeout: 30
allowed_fails: 3
cooldown_time: 60 # 60 second cooldown for failed models
7. Comparison: LiteLLM vs Alternatives
7.1 Tool Comparison
| Feature | LiteLLM | LangChain | OpenRouter | Portkey |
|---|---|---|---|---|
| Type | Gateway + SDK | Framework | Hosted API | Hosted Gateway |
| Hosting | Self-hosted | N/A (library) | Cloud | Cloud + Self |
| Model Count | 100+ | Various | 200+ | 250+ |
| Cost Tracking | Built-in | Requires custom impl | Yes | Yes |
| Rate Limiting | Built-in | None | Yes | Yes |
| Load Balancing | Built-in | None | Yes | Yes |
| Fallback | Built-in | Manual implementation | Yes | Yes |
| API Key Mgmt | Virtual Keys | None | None | Yes |
| Pricing | Free (OSS) | Free (OSS) | Markup | Free + Enterprise |
| Data Privacy | Full control | Full control | Third-party | Third-party |
7.2 When to Choose Which Tool
Choose LiteLLM when:
- Data privacy is important (finance, healthcare, government)
- Must operate on own infrastructure
- Cost tracking and rate limiting are needed
- Already using multiple providers
Choose LangChain when:
- Building complex LLM pipelines (RAG, Agents)
- Need prompt chaining, memory management
- (LiteLLM and LangChain can be used together)
Choose OpenRouter when:
- Rapid prototyping
- Don't want to manage infrastructure
- Single API key for all models
Choose Portkey when:
- Enterprise-level management UI needed
- Advanced features like guardrails, A/B testing needed
- Prefer managed services
8. Practical Tips
8.1 Environment Variable Management
# .env file (never commit to Git)
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
AZURE_API_KEY=xxx
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm
LITELLM_MASTER_KEY=sk-master-key-change-me
8.2 Model Alias Configuration
# config.yaml
model_list:
- model_name: fast
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: smart
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: creative
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
# Call with meaningful names
response = client.chat.completions.create(
model="fast", # gpt-4o-mini
messages=[{"role": "user", "content": "Quick question"}],
)
response = client.chat.completions.create(
model="smart", # gpt-4o
messages=[{"role": "user", "content": "Complex analysis"}],
)
8.3 Error Handling Patterns
from openai import OpenAI, APIError, RateLimitError, APITimeoutError
client = OpenAI(
api_key="sk-proxy-key",
base_url="http://litellm-proxy:4000",
)
def safe_completion(messages, model="gpt-4o", max_retries=3):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
timeout=30,
)
return response
except RateLimitError:
# LiteLLM Proxy handles rate limits, but client should too
import time
wait = 2 ** attempt
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
except APITimeoutError:
print(f"Timeout on attempt {attempt + 1}")
if attempt == max_retries - 1:
raise
except APIError as e:
print(f"API error: {e}")
raise
return None
9. Conclusion
LiteLLM serves as an essential AI Gateway in multi-LLM environments.
Key Takeaways:
- Unified SDK: Call 100+ LLMs through a single completion() function
- Proxy Server: Centralized management via OpenAI-compatible API Gateway
- Cost Control: Automatic cost tracking, budget management, alerts
- Reliability: Built-in load balancing, fallback, rate limiting
- Production: Docker/Kubernetes deployment, Prometheus monitoring, external logging
Especially in enterprise environments using multiple LLM providers, deploying LiteLLM Proxy enables consistent centralized handling of API key management, cost tracking, and failure recovery.
References
- LiteLLM Official Docs: https://docs.litellm.ai/
- LiteLLM GitHub: https://github.com/BerriAI/litellm
- LiteLLM Proxy Config Guide: https://docs.litellm.ai/docs/proxy/configs
- LiteLLM Docker Deployment: https://docs.litellm.ai/docs/proxy/deploy