[Architecture] LiteLLM 완전 가이드: 100+ LLM 통합 서빙과 비용 관리

개요

LLM(Large Language Model)을 활용하는 서비스가 늘어나면서, 여러 Provider의 모델을 효율적으로 관리하는 것이 중요해졌습니다. OpenAI, Anthropic, Azure OpenAI, AWS Bedrock 등 각 Provider마다 서로 다른 SDK와 API 형식을 사용하면 코드 복잡도가 급격히 증가합니다.

LiteLLM은 이 문제를 해결하는 오픈소스 도구로, OpenAI SDK 호환 인터페이스를 통해 100개 이상의 LLM을 하나의 API로 통합합니다. 이 글에서는 LiteLLM SDK 사용법부터 Proxy Server 구축, 비용 추적, 프로덕션 배포까지 총정리합니다.

1. LiteLLM이란

1.1 핵심 가치

LiteLLM은 BerriAI가 개발한 오픈소스 프로젝트로, 두 가지 핵심 컴포넌트를 제공합니다.

1. Python SDK: 통합 인터페이스로 100+ LLM 호출

litellm.completion()
  |
  +-- model="gpt-4o"           --> OpenAI API
  +-- model="claude-sonnet-4-20250514"  --> Anthropic API
  +-- model="azure/gpt-4o"     --> Azure OpenAI
  +-- model="bedrock/claude-3" --> AWS Bedrock
  +-- model="vertex_ai/gemini" --> Google Vertex AI
  +-- model="ollama/llama3"    --> Local Ollama

2. Proxy Server (AI Gateway): OpenAI 호환 REST API 서버

Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers
                      (Rate Limiting, Cost Tracking, Load Balancing)

1.2 왜 LiteLLM인가

문제	LiteLLM 해결 방식
Provider별 다른 SDK	통합 completion() 함수
API 키 분산 관리	Proxy에서 중앙 관리
비용 추적 어려움	자동 비용 계산 및 추적
Provider 장애	자동 Fallback 지원
Rate Limit 관리	내장 Rate Limiting
모델 전환 비용	코드 변경 없이 모델 교체

1.3 지원 Provider 목록

Commercial Providers:
  - OpenAI (GPT-4o, GPT-4o-mini, o1, o3, etc.)
  - Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)
  - Azure OpenAI
  - AWS Bedrock (Claude, Titan, Llama)
  - Google Vertex AI (Gemini)
  - Google AI Studio
  - Cohere (Command R+)
  - Mistral AI
  - Together AI
  - Groq
  - Fireworks AI
  - Perplexity
  - DeepSeek

Self-Hosted / Local:
  - Ollama
  - vLLM
  - Hugging Face TGI
  - NVIDIA NIM
  - OpenAI-compatible endpoints

2. LiteLLM SDK 사용법

2.1 설치

pip install litellm

2.2 기본 사용: completion()

import litellm

# OpenAI
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Kubernetes?"},
    ],
    temperature=0.7,
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Anthropic Claude
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "user", "content": "Explain microservices architecture."},
    ],
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Azure OpenAI
response = litellm.completion(
    model="azure/gpt-4o-deployment",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
    api_base="https://my-resource.openai.azure.com",
    api_version="2024-02-15-preview",
    api_key="your-azure-key",
)

# AWS Bedrock
response = litellm.completion(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[
        {"role": "user", "content": "Summarize this text."},
    ],
)

# Ollama (로컬)
response = litellm.completion(
    model="ollama/llama3",
    messages=[
        {"role": "user", "content": "Write a Python function."},
    ],
    api_base="http://localhost:11434",
)

2.3 Streaming

import litellm

# 동기 스트리밍
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a long essay."}],
    stream=True,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

2.4 Async 호출

import asyncio
import litellm

async def main():
    # 단일 비동기 호출
    response = await litellm.acompletion(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.choices[0].message.content)

    # 비동기 스트리밍
    response = await litellm.acompletion(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": "Explain Docker."}],
        stream=True,
    )
    async for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)

    # 병렬 호출
    tasks = [
        litellm.acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Question {i}"}],
        )
        for i in range(5)
    ]
    responses = await asyncio.gather(*tasks)
    for resp in responses:
        print(resp.choices[0].message.content[:50])

asyncio.run(main())

2.5 Function Calling (Tool Use)

import litellm
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["location"],
            },
        },
    }
]

# OpenAI와 Claude 모두 동일한 인터페이스
for model in ["gpt-4o", "claude-sonnet-4-20250514"]:
    response = litellm.completion(
        model=model,
        messages=[
            {"role": "user", "content": "What's the weather in Seoul?"}
        ],
        tools=tools,
        tool_choice="auto",
    )

    if response.choices[0].message.tool_calls:
        tool_call = response.choices[0].message.tool_calls[0]
        print(f"Model: {model}")
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

2.6 Embedding

import litellm

# OpenAI Embedding
response = litellm.embedding(
    model="text-embedding-3-small",
    input=["Hello world", "How are you?"],
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

# Cohere Embedding
response = litellm.embedding(
    model="cohere/embed-english-v3.0",
    input=["Search query text"],
    input_type="search_query",
)

# Bedrock Embedding
response = litellm.embedding(
    model="bedrock/amazon.titan-embed-text-v2:0",
    input=["Document text for embedding"],
)

2.7 Image/Vision 모델

import litellm

# GPT-4o Vision
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.png",
                    },
                },
            ],
        }
    ],
)

# Claude Vision
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this architecture diagram."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/png;base64,iVBORw0KGgo...",
                    },
                },
            ],
        }
    ],
)

3. LiteLLM Proxy Server (AI Gateway)

3.1 Proxy란

LiteLLM Proxy는 자체 호스팅 가능한 OpenAI 호환 API Gateway입니다. 기존 OpenAI SDK를 사용하는 모든 클라이언트가 코드 변경 없이 Proxy에 연결할 수 있습니다.

+-------------------+
| Your Application  |
| (OpenAI SDK)      |
+--------+----------+
         |
         v
+--------+----------+
| LiteLLM Proxy     |
| - Rate Limiting   |
| - Cost Tracking   |
| - Load Balancing  |
| - Fallback        |
| - Key Management  |
+--------+----------+
         |
    +----+----+----+----+
    |    |    |    |    |
    v    v    v    v    v
  OpenAI Azure Anthropic Bedrock Ollama

3.2 설치 및 실행

# pip 설치
pip install 'litellm[proxy]'

# 기본 실행
litellm --model gpt-4o --port 4000

# 설정 파일로 실행
litellm --config config.yaml --port 4000

# Docker 실행
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v ./config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-xxx \
  -e ANTHROPIC_API_KEY=sk-ant-xxx \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

3.3 config.yaml 설정

# config.yaml
model_list:
  # OpenAI 모델
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # Rate limit: requests per minute
      tpm: 100000 # Rate limit: tokens per minute

  # Claude 모델 (여러 배포로 로드밸런싱)
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 200
      tpm: 80000

  # Azure OpenAI
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-deployment
      api_base: https://my-resource.openai.azure.com
      api_version: '2024-02-15-preview'
      api_key: os.environ/AZURE_API_KEY
      rpm: 300

  # AWS Bedrock Claude
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  # 로컬 Ollama
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama:11434

# 라우터 설정
router_settings:
  routing_strategy: 'latency-based-routing'
  num_retries: 3
  timeout: 60
  allowed_fails: 2
  cooldown_time: 30

# 일반 설정
general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

litellm_settings:
  drop_params: true
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379

3.4 모델 라우팅과 Load Balancing

# 같은 model_name으로 여러 배포를 등록하면 자동 로드밸런싱
model_list:
  # gpt-4o 그룹: 3개 배포
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY_1
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-east
      api_base: https://east.openai.azure.com
      api_key: os.environ/AZURE_KEY_EAST
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-west
      api_base: https://west.openai.azure.com
      api_key: os.environ/AZURE_KEY_WEST

router_settings:
  # 라우팅 전략
  routing_strategy: 'latency-based-routing'
  # 옵션:
  #   simple-shuffle: 랜덤 선택
  #   least-busy: 가장 적은 진행 중 요청
  #   usage-based-routing: TPM/RPM 사용량 기반
  #   latency-based-routing: 응답 시간 기반 (권장)

라우팅 전략 비교:

전략	설명	적합한 경우
simple-shuffle	랜덤 분배	모든 배포 성능이 유사한 경우
least-busy	진행 중 요청 수 기준	요청 처리 시간이 다양한 경우
usage-based-routing	RPM/TPM 사용량 기준	Rate Limit에 근접한 경우
latency-based-routing	응답 시간 기준	지연시간 최적화가 중요한 경우

3.5 Fallback 설정

model_list:
  - model_name: primary-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: fallback-model
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  num_retries: 2
  timeout: 30
  fallbacks: [{ 'primary-model': ['fallback-model'] }]
  # 특정 에러에만 fallback
  retry_policy:
    RateLimitError: 3 # 429 에러 시 3번 재시도
    ContentPolicyViolationError: 0 # 콘텐츠 정책 위반은 재시도 안 함
    AuthenticationError: 0 # 인증 에러는 재시도 안 함

3.6 API Key 관리 (Virtual Keys)

# 마스터 키로 가상 키 생성
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4o", "claude-sonnet"],
    "max_budget": 100.0,
    "budget_duration": "monthly",
    "metadata": {
      "team": "backend",
      "user": "developer-1"
    },
    "tpm_limit": 50000,
    "rpm_limit": 100
  }'

응답:

{
  "key": "sk-generated-key-abc123",
  "expires": "2026-04-20T00:00:00Z",
  "max_budget": 100.0,
  "models": ["gpt-4o", "claude-sonnet"]
}

# 생성된 키로 API 호출
from openai import OpenAI

client = OpenAI(
    api_key="sk-generated-key-abc123",
    base_url="http://localhost:4000",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

3.7 Rate Limiting 설정

# config.yaml에서 Rate Limiting
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # 모델 배포 수준 RPM
      tpm: 100000 # 모델 배포 수준 TPM

general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL

# 키별 Rate Limiting 설정
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "rpm_limit": 50,
    "tpm_limit": 20000,
    "max_budget": 10.0,
    "budget_duration": "daily"
  }'

# 팀별 Rate Limiting
curl -X POST http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "backend-team",
    "rpm_limit": 200,
    "tpm_limit": 80000,
    "max_budget": 500.0,
    "budget_duration": "monthly"
  }'

3.8 Budget Management

# 사용자별 예산 설정
curl -X POST http://localhost:4000/user/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user-123",
    "max_budget": 50.0,
    "budget_duration": "monthly",
    "models": ["gpt-4o-mini", "claude-sonnet"]
  }'

# 예산 사용량 확인
curl http://localhost:4000/user/info?user_id=user-123 \
  -H "Authorization: Bearer sk-master-key-1234"

3.9 Caching

# config.yaml
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 3600 # 1시간 캐시


    # Semantic Caching (유사한 질문에 대해 캐시 히트)
    # supported_call_types: ["acompletion", "completion"]
    # similarity_threshold: 0.8

# 클라이언트에서 캐시 제어
from openai import OpenAI

client = OpenAI(
    api_key="sk-key",
    base_url="http://localhost:4000",
)

# 캐시 사용 (기본)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
)

# 캐시 건너뛰기
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    extra_body={"cache": {"no-cache": True}},
)

4. 비용 추적 (Cost Tracking)

4.1 자동 비용 계산

LiteLLM은 각 요청의 비용을 자동으로 계산합니다.

import litellm

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

# 비용 정보
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

# litellm의 비용 계산
from litellm import completion_cost

cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

4.2 Proxy에서 비용 조회

# 전체 비용 조회
curl http://localhost:4000/global/spend \
  -H "Authorization: Bearer sk-master-key-1234"

# 키별 비용 조회
curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \
  -H "Authorization: Bearer sk-master-key-1234"

# 모델별 비용 조회
curl "http://localhost:4000/global/spend?model=gpt-4o" \
  -H "Authorization: Bearer sk-master-key-1234"

# 팀별 비용 조회
curl "http://localhost:4000/team/info?team_id=team-backend" \
  -H "Authorization: Bearer sk-master-key-1234"

# 기간별 비용 조회
curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \
  -H "Authorization: Bearer sk-master-key-1234"

4.3 비용 알림 설정

# config.yaml
general_settings:
  alerting:
    - slack
  alerting_threshold: 300 # 300초 내 응답 없으면 알림
  alert_types:
    - budget_alerts # 예산 초과 시
    - spend_reports # 주간/월간 비용 리포트
    - failed_tracking # 실패한 요청 추적

environment_variables:
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

5. 프로덕션 배포

5.1 Docker Compose

# docker-compose.yml
version: '3.8'

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - '4000:4000'
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=sk-xxx
      - ANTHROPIC_API_KEY=sk-ant-xxx
      - AZURE_API_KEY=xxx
      - AWS_ACCESS_KEY_ID=xxx
      - AWS_SECRET_ACCESS_KEY=xxx
      - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
      - REDIS_HOST=redis
      - REDIS_PORT=6379
    command: --config /app/config.yaml --port 4000
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']
      interval: 30s
      timeout: 10s
      retries: 3

  postgres:
    image: postgres:16-alpine
    container_name: litellm-postgres
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U litellm']
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    container_name: litellm-redis
    ports:
      - '6379:6379'
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

5.2 Kubernetes Helm Chart

# Helm repository 추가
helm repo add litellm https://berriai.github.io/litellm/
helm repo update

# 설치
helm install litellm litellm/litellm-helm \
  --namespace litellm \
  --create-namespace \
  --values values.yaml

# values.yaml
replicaCount: 3

image:
  repository: ghcr.io/berriai/litellm
  tag: main-latest

service:
  type: ClusterIP
  port: 4000

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: litellm.internal.company.com
      paths:
        - path: /
          pathType: Prefix

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: openai-api-key
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: anthropic-api-key
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: database-url

configMap:
  config.yaml: |
    model_list:
      - model_name: gpt-4o
        litellm_params:
          model: openai/gpt-4o
          api_key: os.environ/OPENAI_API_KEY
      - model_name: claude-sonnet
        litellm_params:
          model: anthropic/claude-sonnet-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY
    router_settings:
      routing_strategy: latency-based-routing
      num_retries: 3

postgresql:
  enabled: true
  auth:
    database: litellm
    username: litellm

redis:
  enabled: true

5.3 Health Check와 Metrics

# Health Check
curl http://localhost:4000/health

# Prometheus Metrics
curl http://localhost:4000/metrics

# Prometheus scrape 설정
scrape_configs:
  - job_name: 'litellm'
    static_configs:
      - targets: ['litellm:4000']
    metrics_path: /metrics
    scrape_interval: 15s

주요 Prometheus 메트릭:

litellm_requests_total: 전체 요청 수
litellm_request_duration_seconds: 요청 처리 시간
litellm_tokens_total: 전체 토큰 사용량
litellm_spend_total: 전체 비용
litellm_errors_total: 에러 수
litellm_cache_hits_total: 캐시 히트 수

5.4 Logging 연동

# config.yaml - 외부 로깅 서비스 연동
litellm_settings:
  success_callback: ['langfuse']
  failure_callback: ['langfuse']

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: https://cloud.langfuse.com

지원하는 로깅 서비스:

서비스	용도
Langfuse	LLM 관찰성, 프롬프트 관리
Helicone	요청 로깅, 비용 분석
Lunary	LLM 모니터링
Custom Callback	자체 로깅 시스템 연동

# Custom Callback 예시
import litellm

def my_custom_callback(kwargs, completion_response, start_time, end_time):
    # 모든 요청에 대해 호출됨
    model = kwargs.get("model")
    messages = kwargs.get("messages")
    cost = completion_cost(completion_response=completion_response)

    # 커스텀 로직 (DB 저장, 알림 등)
    log_to_database(
        model=model,
        cost=cost,
        latency=(end_time - start_time).total_seconds(),
        tokens=completion_response.usage.total_tokens,
    )

litellm.success_callback = [my_custom_callback]

6. 실전 사용 사례

6.1 기업 AI Gateway

사내 모든 LLM 호출을 LiteLLM Proxy로 중앙화합니다.

+------------------+
| Frontend App     |----+
+------------------+    |
                        |     +----------------+
+------------------+    +---->|                |     +----------+
| Backend Service  |----+     | LiteLLM Proxy  |---->| OpenAI   |
+------------------+    |     |                |     +----------+
                        |     | - Auth         |
+------------------+    |     | - Rate Limit   |     +----------+
| Data Pipeline    |----+     | - Cost Track   |---->| Anthropic|
+------------------+    |     | - Audit Log    |     +----------+
                        |     |                |
+------------------+    |     +--------+-------+     +----------+
| Internal Tools   |----+              |             | Azure    |
+------------------+                   v             +----------+
                              +--------+-------+
                              | PostgreSQL     |
                              | (spend logs)   |
                              +----------------+

6.2 A/B 테스트

from openai import OpenAI
import random

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def get_completion_with_ab_test(prompt: str, test_name: str):
    # 50/50 A/B 테스트
    model = random.choice(["gpt-4o", "claude-sonnet"])

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        extra_body={
            "metadata": {
                "test_name": test_name,
                "variant": model,
            }
        },
    )

    return {
        "model": model,
        "content": response.choices[0].message.content,
        "tokens": response.usage.total_tokens,
    }

6.3 비용 최적화 라우팅

def smart_route(prompt: str, complexity: str = "auto"):
    """복잡도에 따라 적절한 모델 선택"""

    if complexity == "auto":
        # 간단한 휴리스틱: 토큰 수와 키워드 기반
        word_count = len(prompt.split())
        if word_count < 50:
            complexity = "simple"
        elif any(kw in prompt.lower() for kw in
                 ["analyze", "compare", "complex", "detailed"]):
            complexity = "complex"
        else:
            complexity = "medium"

    model_map = {
        "simple": "gpt-4o-mini",      # 저렴한 모델
        "medium": "claude-sonnet",      # 중간 성능/가격
        "complex": "gpt-4o",           # 고성능 모델
    }

    model = model_map[complexity]

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    return response

6.4 Disaster Recovery (자동 Failover)

# config.yaml - 다중 Provider Failover
model_list:
  # Primary: OpenAI
  - model_name: main-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  # Secondary: Azure OpenAI (다른 리전)
  - model_name: main-model-fallback-1
    litellm_params:
      model: azure/gpt-4o
      api_base: https://eastus.openai.azure.com
      api_key: os.environ/AZURE_KEY

  # Tertiary: Anthropic Claude
  - model_name: main-model-fallback-2
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]
  num_retries: 2
  timeout: 30
  allowed_fails: 3
  cooldown_time: 60 # 실패한 모델 60초 쿨다운

7. 비교: LiteLLM vs 대안

7.1 주요 도구 비교

기능	LiteLLM	LangChain	OpenRouter	Portkey
타입	Gateway + SDK	Framework	Hosted API	Hosted Gateway
호스팅	Self-hosted	N/A (라이브러리)	Cloud	Cloud + Self
모델 수	100+	다양	200+	250+
비용 추적	내장	별도 구현 필요	있음	있음
Rate Limiting	내장	없음	있음	있음
Load Balancing	내장	없음	있음	있음
Fallback	내장	수동 구현	있음	있음
API Key 관리	Virtual Keys	없음	없음	있음
가격	무료 (OSS)	무료 (OSS)	마크업	무료 + Enterprise
데이터 프라이버시	완전 제어	완전 제어	제3자 경유	제3자 경유

7.2 언제 어떤 도구를 선택할까

LiteLLM을 선택해야 하는 경우:
  - 데이터 프라이버시가 중요 (금융, 의료, 정부)
  - 자체 인프라에서 운영해야 하는 경우
  - 비용 추적과 Rate Limiting이 필요한 경우
  - 여러 Provider를 이미 사용 중인 경우

LangChain을 선택해야 하는 경우:
  - RAG, Agent 등 복잡한 LLM 파이프라인 구축
  - 프롬프트 체이닝, 메모리 관리 등이 필요한 경우
  - (LiteLLM과 LangChain은 함께 사용 가능)

OpenRouter를 선택해야 하는 경우:
  - 빠른 프로토타이핑
  - 인프라 관리를 원하지 않는 경우
  - 단일 API 키로 모든 모델 접근

Portkey를 선택해야 하는 경우:
  - 엔터프라이즈 수준의 관리 UI 필요
  - Guardrails, A/B 테스트 등 고급 기능 필요
  - 매니지드 서비스 선호

8. 실전 팁

8.1 환경 변수 관리

# .env 파일 (절대 Git에 커밋하지 마세요)
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
AZURE_API_KEY=xxx
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm
LITELLM_MASTER_KEY=sk-master-key-change-me

8.2 모델 별칭(Alias) 설정

# config.yaml
model_list:
  - model_name: fast
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: smart
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: creative
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

# 사용 시 의미 있는 이름으로 호출
response = client.chat.completions.create(
    model="fast",  # gpt-4o-mini
    messages=[{"role": "user", "content": "Quick question"}],
)

response = client.chat.completions.create(
    model="smart",  # gpt-4o
    messages=[{"role": "user", "content": "Complex analysis"}],
)

8.3 에러 처리 패턴

from openai import OpenAI, APIError, RateLimitError, APITimeoutError

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def safe_completion(messages, model="gpt-4o", max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=30,
            )
            return response
        except RateLimitError:
            # LiteLLM Proxy가 Rate Limit 관리하지만 클라이언트도 처리
            import time
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
        except APITimeoutError:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise
        except APIError as e:
            print(f"API error: {e}")
            raise

    return None

9. 마무리

LiteLLM은 멀티 LLM 환경에서 필수적인 AI Gateway 역할을 합니다.

핵심 정리:

통합 SDK: 100+ LLM을 하나의 completion() 함수로 호출
Proxy Server: OpenAI 호환 API Gateway로 중앙 관리
비용 제어: 자동 비용 추적, Budget 관리, 알림
안정성: Load Balancing, Fallback, Rate Limiting 내장
프로덕션: Docker/Kubernetes 배포, Prometheus 모니터링, 외부 로깅 연동

특히 여러 LLM Provider를 사용하는 기업 환경에서, LiteLLM Proxy를 도입하면 API 키 관리, 비용 추적, 장애 대응을 중앙에서 일관되게 처리할 수 있습니다.

참고 자료

LiteLLM 공식 문서: https://docs.litellm.ai/
LiteLLM GitHub: https://github.com/BerriAI/litellm
LiteLLM Proxy 설정 가이드: https://docs.litellm.ai/docs/proxy/configs
LiteLLM Docker 배포: https://docs.litellm.ai/docs/proxy/deploy