Skip to content

필사 모드: [Architecture] Complete Guide to LiteLLM: Unified Serving of 100+ LLMs

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Overview

As services leveraging LLMs (Large Language Models) proliferate, efficiently managing models from multiple providers has become critical.

When each provider -- OpenAI, Anthropic, Azure OpenAI, AWS Bedrock -- uses different SDKs and API formats, code complexity increases rapidly.

**LiteLLM** is an open-source tool that solves this problem by providing an OpenAI SDK-compatible interface to unify 100+ LLMs into a single API.

This post covers everything from LiteLLM SDK usage to Proxy Server setup, cost tracking, and production deployment.

1. What is LiteLLM

1.1 Core Value

LiteLLM is an open-source project by BerriAI that provides two core components.

**1. Python SDK**: Call 100+ LLMs through a unified interface

litellm.completion()

|

+-- model="gpt-4o" --> OpenAI API

+-- model="claude-sonnet-4-20250514" --> Anthropic API

+-- model="azure/gpt-4o" --> Azure OpenAI

+-- model="bedrock/claude-3" --> AWS Bedrock

+-- model="vertex_ai/gemini" --> Google Vertex AI

+-- model="ollama/llama3" --> Local Ollama

**2. Proxy Server (AI Gateway)**: OpenAI-compatible REST API server

Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers

(Rate Limiting, Cost Tracking, Load Balancing)

1.2 Why LiteLLM

| Problem | LiteLLM Solution |

| ---------------------------- | --------------------------------------- |

| Different SDK per provider | Unified completion() function |

| Scattered API key management | Centralized management via Proxy |

| Difficulty tracking costs | Automatic cost calculation and tracking |

| Provider outages | Automatic fallback support |

| Rate limit management | Built-in rate limiting |

| Model switching costs | Switch models without code changes |

1.3 Supported Providers

Commercial Providers:

- OpenAI (GPT-4o, GPT-4o-mini, o1, o3, etc.)

- Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)

- Azure OpenAI

- AWS Bedrock (Claude, Titan, Llama)

- Google Vertex AI (Gemini)

- Google AI Studio

- Cohere (Command R+)

- Mistral AI

- Together AI

- Groq

- Fireworks AI

- Perplexity

- DeepSeek

Self-Hosted / Local:

- Ollama

- vLLM

- Hugging Face TGI

- NVIDIA NIM

- OpenAI-compatible endpoints

2. LiteLLM SDK Usage

2.1 Installation

pip install litellm

2.2 Basic Usage: completion()

OpenAI

response = litellm.completion(

model="gpt-4o",

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "What is Kubernetes?"},

],

temperature=0.7,

max_tokens=1000,

)

print(response.choices[0].message.content)

Anthropic Claude

response = litellm.completion(

model="claude-sonnet-4-20250514",

messages=[

{"role": "user", "content": "Explain microservices architecture."},

],

max_tokens=1000,

)

print(response.choices[0].message.content)

Azure OpenAI

response = litellm.completion(

model="azure/gpt-4o-deployment",

messages=[

{"role": "user", "content": "Hello!"},

],

api_base="https://my-resource.openai.azure.com",

api_version="2024-02-15-preview",

api_key="your-azure-key",

)

AWS Bedrock

response = litellm.completion(

model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",

messages=[

{"role": "user", "content": "Summarize this text."},

],

)

Ollama (local)

response = litellm.completion(

model="ollama/llama3",

messages=[

{"role": "user", "content": "Write a Python function."},

],

api_base="http://localhost:11434",

)

2.3 Streaming

Synchronous streaming

response = litellm.completion(

model="gpt-4o",

messages=[{"role": "user", "content": "Write a long essay."}],

stream=True,

)

for chunk in response:

content = chunk.choices[0].delta.content

if content:

print(content, end="", flush=True)

2.4 Async Calls

async def main():

Single async call

response = await litellm.acompletion(

model="gpt-4o",

messages=[{"role": "user", "content": "Hello!"}],

)

print(response.choices[0].message.content)

Async streaming

response = await litellm.acompletion(

model="claude-sonnet-4-20250514",

messages=[{"role": "user", "content": "Explain Docker."}],

stream=True,

)

async for chunk in response:

content = chunk.choices[0].delta.content

if content:

print(content, end="", flush=True)

Parallel calls

tasks = [

litellm.acompletion(

model="gpt-4o-mini",

messages=[{"role": "user", "content": f"Question {i}"}],

)

for i in range(5)

]

responses = await asyncio.gather(*tasks)

for resp in responses:

print(resp.choices[0].message.content[:50])

asyncio.run(main())

2.5 Function Calling (Tool Use)

tools = [

{

"type": "function",

"function": {

"name": "get_weather",

"description": "Get the current weather for a location",

"parameters": {

"type": "object",

"properties": {

"location": {

"type": "string",

"description": "City name",

},

"unit": {

"type": "string",

"enum": ["celsius", "fahrenheit"],

},

},

"required": ["location"],

},

},

}

]

Same interface for both OpenAI and Claude

for model in ["gpt-4o", "claude-sonnet-4-20250514"]:

response = litellm.completion(

model=model,

messages=[

{"role": "user", "content": "What's the weather in Seoul?"}

],

tools=tools,

tool_choice="auto",

)

if response.choices[0].message.tool_calls:

tool_call = response.choices[0].message.tool_calls[0]

print(f"Model: {model}")

print(f"Function: {tool_call.function.name}")

print(f"Arguments: {tool_call.function.arguments}")

2.6 Embedding

OpenAI Embedding

response = litellm.embedding(

model="text-embedding-3-small",

input=["Hello world", "How are you?"],

)

print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

Cohere Embedding

response = litellm.embedding(

model="cohere/embed-english-v3.0",

input=["Search query text"],

input_type="search_query",

)

Bedrock Embedding

response = litellm.embedding(

model="bedrock/amazon.titan-embed-text-v2:0",

input=["Document text for embedding"],

)

2.7 Image/Vision Models

GPT-4o Vision

response = litellm.completion(

model="gpt-4o",

messages=[

{

"role": "user",

"content": [

{"type": "text", "text": "What is in this image?"},

{

"type": "image_url",

"image_url": {

"url": "https://example.com/image.png",

},

},

],

}

],

)

Claude Vision

response = litellm.completion(

model="claude-sonnet-4-20250514",

messages=[

{

"role": "user",

"content": [

{"type": "text", "text": "Describe this architecture diagram."},

{

"type": "image_url",

"image_url": {

"url": "data:image/png;base64,iVBORw0KGgo...",

},

},

],

}

],

)

3. LiteLLM Proxy Server (AI Gateway)

3.1 What is the Proxy

LiteLLM Proxy is a self-hostable OpenAI-compatible API Gateway.

Any existing client using the OpenAI SDK can connect to the Proxy without code changes.

+-------------------+

| Your Application |

| (OpenAI SDK) |

+--------+----------+

|

v

+--------+----------+

| LiteLLM Proxy |

| - Rate Limiting |

| - Cost Tracking |

| - Load Balancing |

| - Fallback |

| - Key Management |

+--------+----------+

|

+----+----+----+----+

| | | | |

v v v v v

OpenAI Azure Anthropic Bedrock Ollama

3.2 Installation and Running

pip install

pip install 'litellm[proxy]'

Basic run

litellm --model gpt-4o --port 4000

Run with config file

litellm --config config.yaml --port 4000

Docker run

docker run -d \

--name litellm-proxy \

-p 4000:4000 \

-v ./config.yaml:/app/config.yaml \

-e OPENAI_API_KEY=sk-xxx \

-e ANTHROPIC_API_KEY=sk-ant-xxx \

ghcr.io/berriai/litellm:main-latest \

--config /app/config.yaml

3.3 config.yaml Configuration

config.yaml

model_list:

OpenAI models

- model_name: gpt-4o

litellm_params:

model: openai/gpt-4o

api_key: os.environ/OPENAI_API_KEY

rpm: 500 # Rate limit: requests per minute

tpm: 100000 # Rate limit: tokens per minute

Claude models (load balancing across deployments)

- model_name: claude-sonnet

litellm_params:

model: anthropic/claude-sonnet-4-20250514

api_key: os.environ/ANTHROPIC_API_KEY

rpm: 200

tpm: 80000

Azure OpenAI

- model_name: gpt-4o

litellm_params:

model: azure/gpt-4o-deployment

api_base: https://my-resource.openai.azure.com

api_version: '2024-02-15-preview'

api_key: os.environ/AZURE_API_KEY

rpm: 300

AWS Bedrock Claude

- model_name: bedrock-claude

litellm_params:

model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0

aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID

aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY

aws_region_name: us-east-1

Local Ollama

- model_name: local-llama

litellm_params:

model: ollama/llama3

api_base: http://ollama:11434

Router settings

router_settings:

routing_strategy: 'latency-based-routing'

num_retries: 3

timeout: 60

allowed_fails: 2

cooldown_time: 30

General settings

general_settings:

master_key: sk-master-key-1234

database_url: os.environ/DATABASE_URL

store_model_in_db: true

litellm_settings:

drop_params: true

set_verbose: false

cache: true

cache_params:

type: redis

host: redis

port: 6379

3.4 Model Routing and Load Balancing

Registering multiple deployments with the same model_name enables auto load balancing

model_list:

gpt-4o group: 3 deployments

- model_name: gpt-4o

litellm_params:

model: openai/gpt-4o

api_key: os.environ/OPENAI_API_KEY_1

- model_name: gpt-4o

litellm_params:

model: azure/gpt-4o-east

api_base: https://east.openai.azure.com

api_key: os.environ/AZURE_KEY_EAST

- model_name: gpt-4o

litellm_params:

model: azure/gpt-4o-west

api_base: https://west.openai.azure.com

api_key: os.environ/AZURE_KEY_WEST

router_settings:

Routing strategy

routing_strategy: 'latency-based-routing'

Options:

simple-shuffle: Random selection

least-busy: Fewest in-progress requests

usage-based-routing: Based on TPM/RPM usage

latency-based-routing: Based on response time (recommended)

**Routing Strategy Comparison:**

| Strategy | Description | Best For |

| --------------------- | ---------------------------------- | ---------------------------------------- |

| simple-shuffle | Random distribution | All deployments have similar performance |

| least-busy | Based on in-progress request count | Varying request processing times |

| usage-based-routing | Based on RPM/TPM usage | Approaching rate limits |

| latency-based-routing | Based on response time | Latency optimization is critical |

3.5 Fallback Configuration

model_list:

- model_name: primary-model

litellm_params:

model: openai/gpt-4o

api_key: os.environ/OPENAI_API_KEY

- model_name: fallback-model

litellm_params:

model: anthropic/claude-sonnet-4-20250514

api_key: os.environ/ANTHROPIC_API_KEY

router_settings:

num_retries: 2

timeout: 30

fallbacks: [{ 'primary-model': ['fallback-model'] }]

Fallback only on specific errors

retry_policy:

RateLimitError: 3 # Retry 3 times on 429 errors

ContentPolicyViolationError: 0 # No retry on content policy violations

AuthenticationError: 0 # No retry on auth errors

3.6 API Key Management (Virtual Keys)

Generate virtual key with master key

curl -X POST http://localhost:4000/key/generate \

-H "Authorization: Bearer sk-master-key-1234" \

-H "Content-Type: application/json" \

-d '{

"models": ["gpt-4o", "claude-sonnet"],

"max_budget": 100.0,

"budget_duration": "monthly",

"metadata": {

"team": "backend",

"user": "developer-1"

},

"tpm_limit": 50000,

"rpm_limit": 100

}'

Response:

{

"key": "sk-generated-key-abc123",

"expires": "2026-04-20T00:00:00Z",

"max_budget": 100.0,

"models": ["gpt-4o", "claude-sonnet"]

}

Make API calls with the generated key

from openai import OpenAI

client = OpenAI(

api_key="sk-generated-key-abc123",

base_url="http://localhost:4000",

)

response = client.chat.completions.create(

model="gpt-4o",

messages=[{"role": "user", "content": "Hello!"}],

)

3.7 Rate Limiting Configuration

Rate Limiting in config.yaml

model_list:

- model_name: gpt-4o

litellm_params:

model: openai/gpt-4o

api_key: os.environ/OPENAI_API_KEY

rpm: 500 # Model deployment level RPM

tpm: 100000 # Model deployment level TPM

general_settings:

master_key: sk-master-key-1234

database_url: os.environ/DATABASE_URL

Per-key rate limiting

curl -X POST http://localhost:4000/key/generate \

-H "Authorization: Bearer sk-master-key-1234" \

-H "Content-Type: application/json" \

-d '{

"rpm_limit": 50,

"tpm_limit": 20000,

"max_budget": 10.0,

"budget_duration": "daily"

}'

Per-team rate limiting

curl -X POST http://localhost:4000/team/new \

-H "Authorization: Bearer sk-master-key-1234" \

-H "Content-Type: application/json" \

-d '{

"team_alias": "backend-team",

"rpm_limit": 200,

"tpm_limit": 80000,

"max_budget": 500.0,

"budget_duration": "monthly"

}'

3.8 Budget Management

Set per-user budget

curl -X POST http://localhost:4000/user/new \

-H "Authorization: Bearer sk-master-key-1234" \

-H "Content-Type: application/json" \

-d '{

"user_id": "user-123",

"max_budget": 50.0,

"budget_duration": "monthly",

"models": ["gpt-4o-mini", "claude-sonnet"]

}'

Check budget usage

curl http://localhost:4000/user/info?user_id=user-123 \

-H "Authorization: Bearer sk-master-key-1234"

3.9 Caching

config.yaml

litellm_settings:

cache: true

cache_params:

type: redis

host: redis

port: 6379

ttl: 3600 # 1 hour cache

Cache control from client

from openai import OpenAI

client = OpenAI(

api_key="sk-key",

base_url="http://localhost:4000",

)

Use cache (default)

response = client.chat.completions.create(

model="gpt-4o",

messages=[{"role": "user", "content": "What is Python?"}],

)

Skip cache

response = client.chat.completions.create(

model="gpt-4o",

messages=[{"role": "user", "content": "What is Python?"}],

extra_body={"cache": {"no-cache": True}},

)

4. Cost Tracking

4.1 Automatic Cost Calculation

LiteLLM automatically calculates the cost of each request.

response = litellm.completion(

model="gpt-4o",

messages=[{"role": "user", "content": "Hello!"}],

)

Cost information

print(f"Input tokens: {response.usage.prompt_tokens}")

print(f"Output tokens: {response.usage.completion_tokens}")

LiteLLM cost calculation

from litellm import completion_cost

cost = completion_cost(completion_response=response)

print(f"Cost: ${cost:.6f}")

4.2 Cost Queries via Proxy

Total spend

curl http://localhost:4000/global/spend \

-H "Authorization: Bearer sk-master-key-1234"

Per-key spend

curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \

-H "Authorization: Bearer sk-master-key-1234"

Per-model spend

curl "http://localhost:4000/global/spend?model=gpt-4o" \

-H "Authorization: Bearer sk-master-key-1234"

Per-team spend

curl "http://localhost:4000/team/info?team_id=team-backend" \

-H "Authorization: Bearer sk-master-key-1234"

Spend by date range

curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \

-H "Authorization: Bearer sk-master-key-1234"

4.3 Budget Alert Configuration

config.yaml

general_settings:

alerting:

- slack

alerting_threshold: 300 # Alert if no response within 300 seconds

alert_types:

- budget_alerts # On budget exceeded

- spend_reports # Weekly/monthly cost reports

- failed_tracking # Failed request tracking

environment_variables:

SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

5. Production Deployment

5.1 Docker Compose

docker-compose.yml

version: '3.8'

services:

litellm:

image: ghcr.io/berriai/litellm:main-latest

container_name: litellm-proxy

ports:

- '4000:4000'

volumes:

- ./config.yaml:/app/config.yaml

environment:

- OPENAI_API_KEY=sk-xxx

- ANTHROPIC_API_KEY=sk-ant-xxx

- AZURE_API_KEY=xxx

- AWS_ACCESS_KEY_ID=xxx

- AWS_SECRET_ACCESS_KEY=xxx

- DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm

- REDIS_HOST=redis

- REDIS_PORT=6379

command: --config /app/config.yaml --port 4000

depends_on:

postgres:

condition: service_healthy

redis:

condition: service_started

healthcheck:

test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']

interval: 30s

timeout: 10s

retries: 3

postgres:

image: postgres:16-alpine

container_name: litellm-postgres

environment:

POSTGRES_DB: litellm

POSTGRES_USER: litellm

POSTGRES_PASSWORD: password

volumes:

- postgres_data:/var/lib/postgresql/data

healthcheck:

test: ['CMD-SHELL', 'pg_isready -U litellm']

interval: 5s

timeout: 5s

retries: 5

redis:

image: redis:7-alpine

container_name: litellm-redis

ports:

- '6379:6379'

volumes:

- redis_data:/data

volumes:

postgres_data:

redis_data:

5.2 Kubernetes Helm Chart

Add Helm repository

helm repo add litellm https://berriai.github.io/litellm/

helm repo update

Install

helm install litellm litellm/litellm-helm \

--namespace litellm \

--create-namespace \

--values values.yaml

values.yaml

replicaCount: 3

image:

repository: ghcr.io/berriai/litellm

tag: main-latest

service:

type: ClusterIP

port: 4000

ingress:

enabled: true

className: nginx

hosts:

- host: litellm.internal.company.com

paths:

- path: /

pathType: Prefix

resources:

requests:

cpu: 500m

memory: 512Mi

limits:

cpu: 2000m

memory: 2Gi

env:

- name: OPENAI_API_KEY

valueFrom:

secretKeyRef:

name: litellm-secrets

key: openai-api-key

- name: ANTHROPIC_API_KEY

valueFrom:

secretKeyRef:

name: litellm-secrets

key: anthropic-api-key

- name: DATABASE_URL

valueFrom:

secretKeyRef:

name: litellm-secrets

key: database-url

postgresql:

enabled: true

auth:

database: litellm

username: litellm

redis:

enabled: true

5.3 Health Check and Metrics

Health Check

curl http://localhost:4000/health

Prometheus Metrics

curl http://localhost:4000/metrics

**Key Prometheus Metrics:**

litellm_requests_total: Total request count

litellm_request_duration_seconds: Request processing time

litellm_tokens_total: Total token usage

litellm_spend_total: Total spend

litellm_errors_total: Error count

litellm_cache_hits_total: Cache hit count

5.4 Logging Integration

config.yaml - External logging service integration

litellm_settings:

success_callback: ['langfuse']

failure_callback: ['langfuse']

environment_variables:

LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY

LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY

LANGFUSE_HOST: https://cloud.langfuse.com

**Supported Logging Services:**

| Service | Purpose |

| --------------- | --------------------------------------- |

| Langfuse | LLM observability, prompt management |

| Helicone | Request logging, cost analysis |

| Lunary | LLM monitoring |

| Custom Callback | Integration with custom logging systems |

Custom Callback example

def my_custom_callback(kwargs, completion_response, start_time, end_time):

Called for every request

model = kwargs.get("model")

messages = kwargs.get("messages")

cost = completion_cost(completion_response=completion_response)

Custom logic (DB storage, alerts, etc.)

log_to_database(

model=model,

cost=cost,

latency=(end_time - start_time).total_seconds(),

tokens=completion_response.usage.total_tokens,

)

litellm.success_callback = [my_custom_callback]

6. Real-World Use Cases

6.1 Enterprise AI Gateway

Centralize all LLM calls across the organization through LiteLLM Proxy.

+------------------+

| Frontend App |----+

+------------------+ |

| +----------------+

+------------------+ +---->| | +----------+

| Backend Service |----+ | LiteLLM Proxy |---->| OpenAI |

+------------------+ | | | +----------+

| | - Auth |

+------------------+ | | - Rate Limit | +----------+

| Data Pipeline |----+ | - Cost Track |---->| Anthropic|

+------------------+ | | - Audit Log | +----------+

| | |

+------------------+ | +--------+-------+ +----------+

| Internal Tools |----+ | | Azure |

+------------------+ v +----------+

+--------+-------+

| PostgreSQL |

| (spend logs) |

+----------------+

6.2 A/B Testing

from openai import OpenAI

client = OpenAI(

api_key="sk-proxy-key",

base_url="http://litellm-proxy:4000",

)

def get_completion_with_ab_test(prompt: str, test_name: str):

50/50 A/B test

model = random.choice(["gpt-4o", "claude-sonnet"])

response = client.chat.completions.create(

model=model,

messages=[{"role": "user", "content": prompt}],

extra_body={

"metadata": {

"test_name": test_name,

"variant": model,

}

},

)

return {

"model": model,

"content": response.choices[0].message.content,

"tokens": response.usage.total_tokens,

}

6.3 Cost-Optimized Routing

def smart_route(prompt: str, complexity: str = "auto"):

"""Select appropriate model based on complexity"""

if complexity == "auto":

Simple heuristic: based on token count and keywords

word_count = len(prompt.split())

if word_count < 50:

complexity = "simple"

elif any(kw in prompt.lower() for kw in

["analyze", "compare", "complex", "detailed"]):

complexity = "complex"

else:

complexity = "medium"

model_map = {

"simple": "gpt-4o-mini", # Cheap model

"medium": "claude-sonnet", # Mid performance/price

"complex": "gpt-4o", # High performance model

}

model = model_map[complexity]

response = client.chat.completions.create(

model=model,

messages=[{"role": "user", "content": prompt}],

)

return response

6.4 Disaster Recovery (Automatic Failover)

config.yaml - Multi-provider failover

model_list:

Primary: OpenAI

- model_name: main-model

litellm_params:

model: openai/gpt-4o

api_key: os.environ/OPENAI_API_KEY

Secondary: Azure OpenAI (different region)

- model_name: main-model-fallback-1

litellm_params:

model: azure/gpt-4o

api_base: https://eastus.openai.azure.com

api_key: os.environ/AZURE_KEY

Tertiary: Anthropic Claude

- model_name: main-model-fallback-2

litellm_params:

model: anthropic/claude-sonnet-4-20250514

api_key: os.environ/ANTHROPIC_API_KEY

router_settings:

fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]

num_retries: 2

timeout: 30

allowed_fails: 3

cooldown_time: 60 # 60 second cooldown for failed models

7. Comparison: LiteLLM vs Alternatives

7.1 Tool Comparison

| Feature | LiteLLM | LangChain | OpenRouter | Portkey |

| ------------------ | ------------- | --------------------- | ----------- | ----------------- |

| **Type** | Gateway + SDK | Framework | Hosted API | Hosted Gateway |

| **Hosting** | Self-hosted | N/A (library) | Cloud | Cloud + Self |

| **Model Count** | 100+ | Various | 200+ | 250+ |

| **Cost Tracking** | Built-in | Requires custom impl | Yes | Yes |

| **Rate Limiting** | Built-in | None | Yes | Yes |

| **Load Balancing** | Built-in | None | Yes | Yes |

| **Fallback** | Built-in | Manual implementation | Yes | Yes |

| **API Key Mgmt** | Virtual Keys | None | None | Yes |

| **Pricing** | Free (OSS) | Free (OSS) | Markup | Free + Enterprise |

| **Data Privacy** | Full control | Full control | Third-party | Third-party |

7.2 When to Choose Which Tool

Choose LiteLLM when:

- Data privacy is important (finance, healthcare, government)

- Must operate on own infrastructure

- Cost tracking and rate limiting are needed

- Already using multiple providers

Choose LangChain when:

- Building complex LLM pipelines (RAG, Agents)

- Need prompt chaining, memory management

- (LiteLLM and LangChain can be used together)

Choose OpenRouter when:

- Rapid prototyping

- Don't want to manage infrastructure

- Single API key for all models

Choose Portkey when:

- Enterprise-level management UI needed

- Advanced features like guardrails, A/B testing needed

- Prefer managed services

8. Practical Tips

8.1 Environment Variable Management

.env file (never commit to Git)

OPENAI_API_KEY=sk-xxx

ANTHROPIC_API_KEY=sk-ant-xxx

AZURE_API_KEY=xxx

AWS_ACCESS_KEY_ID=xxx

AWS_SECRET_ACCESS_KEY=xxx

DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm

LITELLM_MASTER_KEY=sk-master-key-change-me

8.2 Model Alias Configuration

config.yaml

model_list:

- model_name: fast

litellm_params:

model: gpt-4o-mini

api_key: os.environ/OPENAI_API_KEY

- model_name: smart

litellm_params:

model: gpt-4o

api_key: os.environ/OPENAI_API_KEY

- model_name: creative

litellm_params:

model: anthropic/claude-sonnet-4-20250514

api_key: os.environ/ANTHROPIC_API_KEY

Call with meaningful names

response = client.chat.completions.create(

model="fast", # gpt-4o-mini

messages=[{"role": "user", "content": "Quick question"}],

)

response = client.chat.completions.create(

model="smart", # gpt-4o

messages=[{"role": "user", "content": "Complex analysis"}],

)

8.3 Error Handling Patterns

from openai import OpenAI, APIError, RateLimitError, APITimeoutError

client = OpenAI(

api_key="sk-proxy-key",

base_url="http://litellm-proxy:4000",

)

def safe_completion(messages, model="gpt-4o", max_retries=3):

for attempt in range(max_retries):

try:

response = client.chat.completions.create(

model=model,

messages=messages,

timeout=30,

)

return response

except RateLimitError:

LiteLLM Proxy handles rate limits, but client should too

wait = 2 ** attempt

print(f"Rate limited, waiting {wait}s...")

time.sleep(wait)

except APITimeoutError:

print(f"Timeout on attempt {attempt + 1}")

if attempt == max_retries - 1:

raise

except APIError as e:

print(f"API error: {e}")

raise

return None

9. Conclusion

LiteLLM serves as an essential AI Gateway in multi-LLM environments.

**Key Takeaways:**

1. **Unified SDK**: Call 100+ LLMs through a single completion() function

2. **Proxy Server**: Centralized management via OpenAI-compatible API Gateway

3. **Cost Control**: Automatic cost tracking, budget management, alerts

4. **Reliability**: Built-in load balancing, fallback, rate limiting

5. **Production**: Docker/Kubernetes deployment, Prometheus monitoring, external logging

Especially in enterprise environments using multiple LLM providers, deploying LiteLLM Proxy enables consistent centralized handling of API key management, cost tracking, and failure recovery.

References

- LiteLLM Official Docs: https://docs.litellm.ai/

- LiteLLM GitHub: https://github.com/BerriAI/litellm

- LiteLLM Proxy Config Guide: https://docs.litellm.ai/docs/proxy/configs

- LiteLLM Docker Deployment: https://docs.litellm.ai/docs/proxy/deploy

현재 단락 (1/765)

As services leveraging LLMs (Large Language Models) proliferate, efficiently managing models from mu...

작성 글자: 0원문 글자: 22,823작성 단락: 0/765