[Architecture] LiteLLM完全ガイド：100以上のLLMを統合サービングする方法

概要

LLM（Large Language Model）を活用するサービスが増加するにつれ、複数のProviderのモデルを効率的に管理することが重要になっています。 OpenAI、Anthropic、Azure OpenAI、AWS Bedrock等、各Providerごとに異なるSDKとAPIフォーマットを使用すると、コードの複雑度が急激に増加します。

LiteLLMはこの問題を解決するオープンソースツールで、OpenAI SDK互換インターフェースを通じて100以上のLLMを1つのAPIに統合します。この記事では、LiteLLM SDKの使い方からProxyサーバー構築、コスト追跡、本番デプロイまで総整理します。

1. LiteLLMとは

1.1 核心的価値

LiteLLMはBerriAIが開発したオープンソースプロジェクトで、2つの核心コンポーネントを提供します。

1. Python SDK: 統合インターフェースで100+ LLMを呼び出し

litellm.completion()
  |
  +-- model="gpt-4o"           --> OpenAI API
  +-- model="claude-sonnet-4-20250514"  --> Anthropic API
  +-- model="azure/gpt-4o"     --> Azure OpenAI
  +-- model="bedrock/claude-3" --> AWS Bedrock
  +-- model="vertex_ai/gemini" --> Google Vertex AI
  +-- model="ollama/llama3"    --> Local Ollama

2. Proxy Server（AI Gateway）: OpenAI互換REST APIサーバー

Any OpenAI Client --> LiteLLM Proxy --> Multiple LLM Providers
                      (Rate Limiting, Cost Tracking, Load Balancing)

1.2 なぜLiteLLMか

問題	LiteLLMの解決方法
Provider別の異なるSDK	統合completion()関数
APIキーの分散管理	Proxyで中央管理
コスト追跡の困難さ	自動コスト計算および追跡
Provider障害	自動Fallbackサポート
Rate Limit管理	内蔵Rate Limiting
モデル切替コスト	コード変更なしでモデル交換

1.3 サポートProvider一覧

Commercial Providers:
  - OpenAI (GPT-4o, GPT-4o-mini, o1, o3等)
  - Anthropic (Claude Sonnet, Claude Haiku, Claude Opus)
  - Azure OpenAI
  - AWS Bedrock (Claude, Titan, Llama)
  - Google Vertex AI (Gemini)
  - Google AI Studio
  - Cohere (Command R+)
  - Mistral AI
  - Together AI
  - Groq
  - Fireworks AI
  - Perplexity
  - DeepSeek

Self-Hosted / Local:
  - Ollama
  - vLLM
  - Hugging Face TGI
  - NVIDIA NIM
  - OpenAI互換エンドポイント

2. LiteLLM SDK使用法

2.1 インストール

pip install litellm

2.2 基本使用: completion()

import litellm

# OpenAI
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Kubernetes?"},
    ],
    temperature=0.7,
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Anthropic Claude
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "user", "content": "Explain microservices architecture."},
    ],
    max_tokens=1000,
)
print(response.choices[0].message.content)

# Azure OpenAI
response = litellm.completion(
    model="azure/gpt-4o-deployment",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
    api_base="https://my-resource.openai.azure.com",
    api_version="2024-02-15-preview",
    api_key="your-azure-key",
)

# AWS Bedrock
response = litellm.completion(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[
        {"role": "user", "content": "Summarize this text."},
    ],
)

# Ollama（ローカル）
response = litellm.completion(
    model="ollama/llama3",
    messages=[
        {"role": "user", "content": "Write a Python function."},
    ],
    api_base="http://localhost:11434",
)

2.3 ストリーミング

import litellm

# 同期ストリーミング
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a long essay."}],
    stream=True,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

2.4 非同期呼び出し

import asyncio
import litellm

async def main():
    # 単一非同期呼び出し
    response = await litellm.acompletion(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.choices[0].message.content)

    # 非同期ストリーミング
    response = await litellm.acompletion(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": "Explain Docker."}],
        stream=True,
    )
    async for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)

    # 並列呼び出し
    tasks = [
        litellm.acompletion(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Question {i}"}],
        )
        for i in range(5)
    ]
    responses = await asyncio.gather(*tasks)
    for resp in responses:
        print(resp.choices[0].message.content[:50])

asyncio.run(main())

2.5 Function Calling（Tool Use）

import litellm
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["location"],
            },
        },
    }
]

# OpenAIとClaude両方同じインターフェース
for model in ["gpt-4o", "claude-sonnet-4-20250514"]:
    response = litellm.completion(
        model=model,
        messages=[
            {"role": "user", "content": "What's the weather in Seoul?"}
        ],
        tools=tools,
        tool_choice="auto",
    )

    if response.choices[0].message.tool_calls:
        tool_call = response.choices[0].message.tool_calls[0]
        print(f"Model: {model}")
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

2.6 Embedding

import litellm

# OpenAI Embedding
response = litellm.embedding(
    model="text-embedding-3-small",
    input=["Hello world", "How are you?"],
)
print(f"Embedding dimension: {len(response.data[0]['embedding'])}")

# Cohere Embedding
response = litellm.embedding(
    model="cohere/embed-english-v3.0",
    input=["Search query text"],
    input_type="search_query",
)

# Bedrock Embedding
response = litellm.embedding(
    model="bedrock/amazon.titan-embed-text-v2:0",
    input=["Document text for embedding"],
)

2.7 Image/Visionモデル

import litellm

# GPT-4o Vision
response = litellm.completion(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.png",
                    },
                },
            ],
        }
    ],
)

# Claude Vision
response = litellm.completion(
    model="claude-sonnet-4-20250514",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this architecture diagram."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/png;base64,iVBORw0KGgo...",
                    },
                },
            ],
        }
    ],
)

3. LiteLLM Proxy Server（AI Gateway）

3.1 Proxyとは

LiteLLM Proxyは自己ホスト可能なOpenAI互換API Gatewayです。既存のOpenAI SDKを使用するすべてのクライアントがコード変更なしでProxyに接続できます。

+-------------------+
| Your Application  |
| (OpenAI SDK)      |
+--------+----------+
         |
         v
+--------+----------+
| LiteLLM Proxy     |
| - Rate Limiting   |
| - Cost Tracking   |
| - Load Balancing  |
| - Fallback        |
| - Key Management  |
+--------+----------+
         |
    +----+----+----+----+
    |    |    |    |    |
    v    v    v    v    v
  OpenAI Azure Anthropic Bedrock Ollama

3.2 インストールと実行

# pipインストール
pip install 'litellm[proxy]'

# 基本実行
litellm --model gpt-4o --port 4000

# 設定ファイルで実行
litellm --config config.yaml --port 4000

# Docker実行
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v ./config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=sk-xxx \
  -e ANTHROPIC_API_KEY=sk-ant-xxx \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

3.3 config.yaml設定

# config.yaml
model_list:
  # OpenAIモデル
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # Rate limit: requests per minute
      tpm: 100000 # Rate limit: tokens per minute

  # Claudeモデル（複数デプロイメントでロードバランシング）
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 200
      tpm: 80000

  # Azure OpenAI
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-deployment
      api_base: https://my-resource.openai.azure.com
      api_version: '2024-02-15-preview'
      api_key: os.environ/AZURE_API_KEY
      rpm: 300

  # AWS Bedrock Claude
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  # ローカルOllama
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama:11434

# ルーター設定
router_settings:
  routing_strategy: 'latency-based-routing'
  num_retries: 3
  timeout: 60
  allowed_fails: 2
  cooldown_time: 30

# 一般設定
general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

litellm_settings:
  drop_params: true
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379

3.4 モデルルーティングとLoad Balancing

# 同じmodel_nameで複数のデプロイメントを登録すると自動ロードバランシング
model_list:
  # gpt-4oグループ: 3つのデプロイメント
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY_1
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-east
      api_base: https://east.openai.azure.com
      api_key: os.environ/AZURE_KEY_EAST
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-west
      api_base: https://west.openai.azure.com
      api_key: os.environ/AZURE_KEY_WEST

router_settings:
  # ルーティング戦略
  routing_strategy: 'latency-based-routing'
  # オプション:
  #   simple-shuffle: ランダム選択
  #   least-busy: 最も少ない進行中リクエスト
  #   usage-based-routing: TPM/RPM使用量ベース
  #   latency-based-routing: レスポンスタイムベース（推奨）

ルーティング戦略比較：

戦略	説明	適した場合
simple-shuffle	ランダム分配	すべてのデプロイのパフォーマンスが類似
least-busy	進行中リクエスト数基準	リクエスト処理時間が多様
usage-based-routing	RPM/TPM使用量基準	Rate Limitに近づいている場合
latency-based-routing	レスポンスタイム基準	レイテンシ最適化が重要

3.5 Fallback設定

model_list:
  - model_name: primary-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: fallback-model
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  num_retries: 2
  timeout: 30
  fallbacks: [{ 'primary-model': ['fallback-model'] }]
  # 特定エラーのみfallback
  retry_policy:
    RateLimitError: 3 # 429エラー時3回リトライ
    ContentPolicyViolationError: 0 # コンテンツポリシー違反はリトライなし
    AuthenticationError: 0 # 認証エラーはリトライなし

3.6 APIキー管理（Virtual Keys）

# マスターキーで仮想キー生成
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4o", "claude-sonnet"],
    "max_budget": 100.0,
    "budget_duration": "monthly",
    "metadata": {
      "team": "backend",
      "user": "developer-1"
    },
    "tpm_limit": 50000,
    "rpm_limit": 100
  }'

レスポンス：

{
  "key": "sk-generated-key-abc123",
  "expires": "2026-04-20T00:00:00Z",
  "max_budget": 100.0,
  "models": ["gpt-4o", "claude-sonnet"]
}

# 生成されたキーでAPI呼び出し
from openai import OpenAI

client = OpenAI(
    api_key="sk-generated-key-abc123",
    base_url="http://localhost:4000",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

3.7 Rate Limiting設定

# config.yamlでRate Limiting
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500 # モデルデプロイメントレベルRPM
      tpm: 100000 # モデルデプロイメントレベルTPM

general_settings:
  master_key: sk-master-key-1234
  database_url: os.environ/DATABASE_URL

# キー別Rate Limiting設定
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "rpm_limit": 50,
    "tpm_limit": 20000,
    "max_budget": 10.0,
    "budget_duration": "daily"
  }'

# チーム別Rate Limiting
curl -X POST http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "backend-team",
    "rpm_limit": 200,
    "tpm_limit": 80000,
    "max_budget": 500.0,
    "budget_duration": "monthly"
  }'

3.8 Budget Management

# ユーザー別予算設定
curl -X POST http://localhost:4000/user/new \
  -H "Authorization: Bearer sk-master-key-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user-123",
    "max_budget": 50.0,
    "budget_duration": "monthly",
    "models": ["gpt-4o-mini", "claude-sonnet"]
  }'

# 予算使用状況確認
curl http://localhost:4000/user/info?user_id=user-123 \
  -H "Authorization: Bearer sk-master-key-1234"

3.9 キャッシング

# config.yaml
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 3600 # 1時間キャッシュ

# クライアントからキャッシュ制御
from openai import OpenAI

client = OpenAI(
    api_key="sk-key",
    base_url="http://localhost:4000",
)

# キャッシュ使用（デフォルト）
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
)

# キャッシュスキップ
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    extra_body={"cache": {"no-cache": True}},
)

4. コスト追跡（Cost Tracking）

4.1 自動コスト計算

LiteLLMは各リクエストのコストを自動的に計算します。

import litellm

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

# コスト情報
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

# litellmのコスト計算
from litellm import completion_cost

cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

4.2 Proxyでコスト照会

# 全体コスト照会
curl http://localhost:4000/global/spend \
  -H "Authorization: Bearer sk-master-key-1234"

# キー別コスト照会
curl "http://localhost:4000/global/spend?api_key=sk-key-abc" \
  -H "Authorization: Bearer sk-master-key-1234"

# モデル別コスト照会
curl "http://localhost:4000/global/spend?model=gpt-4o" \
  -H "Authorization: Bearer sk-master-key-1234"

# チーム別コスト照会
curl "http://localhost:4000/team/info?team_id=team-backend" \
  -H "Authorization: Bearer sk-master-key-1234"

# 期間別コスト照会
curl "http://localhost:4000/global/spend/logs?start_date=2026-03-01&end_date=2026-03-20" \
  -H "Authorization: Bearer sk-master-key-1234"

4.3 コストアラート設定

# config.yaml
general_settings:
  alerting:
    - slack
  alerting_threshold: 300
  alert_types:
    - budget_alerts # 予算超過時
    - spend_reports # 週間/月間コストレポート
    - failed_tracking # 失敗したリクエスト追跡

environment_variables:
  SLACK_WEBHOOK_URL: os.environ/SLACK_WEBHOOK_URL

5. 本番デプロイ

5.1 Docker Compose

# docker-compose.yml
version: '3.8'

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - '4000:4000'
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=sk-xxx
      - ANTHROPIC_API_KEY=sk-ant-xxx
      - AZURE_API_KEY=xxx
      - AWS_ACCESS_KEY_ID=xxx
      - AWS_SECRET_ACCESS_KEY=xxx
      - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
      - REDIS_HOST=redis
      - REDIS_PORT=6379
    command: --config /app/config.yaml --port 4000
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:4000/health']
      interval: 30s
      timeout: 10s
      retries: 3

  postgres:
    image: postgres:16-alpine
    container_name: litellm-postgres
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U litellm']
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    container_name: litellm-redis
    ports:
      - '6379:6379'
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

5.2 Kubernetes Helm Chart

# Helmリポジトリ追加
helm repo add litellm https://berriai.github.io/litellm/
helm repo update

# インストール
helm install litellm litellm/litellm-helm \
  --namespace litellm \
  --create-namespace \
  --values values.yaml

# values.yaml
replicaCount: 3

image:
  repository: ghcr.io/berriai/litellm
  tag: main-latest

service:
  type: ClusterIP
  port: 4000

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: litellm.internal.company.com
      paths:
        - path: /
          pathType: Prefix

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: openai-api-key
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: anthropic-api-key
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: database-url

postgresql:
  enabled: true
  auth:
    database: litellm
    username: litellm

redis:
  enabled: true

5.3 Health CheckとMetrics

# Health Check
curl http://localhost:4000/health

# Prometheus Metrics
curl http://localhost:4000/metrics

主要Prometheusメトリクス：

litellm_requests_total: 全体リクエスト数
litellm_request_duration_seconds: リクエスト処理時間
litellm_tokens_total: 全体トークン使用量
litellm_spend_total: 全体コスト
litellm_errors_total: エラー数
litellm_cache_hits_total: キャッシュヒット数

5.4 ロギング連携

# config.yaml - 外部ロギングサービス連携
litellm_settings:
  success_callback: ['langfuse']
  failure_callback: ['langfuse']

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: https://cloud.langfuse.com

サポートするロギングサービス：

サービス	用途
Langfuse	LLM観測性、プロンプト管理
Helicone	リクエストロギング、コスト分析
Lunary	LLMモニタリング
Custom Callback	自社ロギングシステム連携

# Custom Callbackの例
import litellm

def my_custom_callback(kwargs, completion_response, start_time, end_time):
    model = kwargs.get("model")
    messages = kwargs.get("messages")
    cost = completion_cost(completion_response=completion_response)

    log_to_database(
        model=model,
        cost=cost,
        latency=(end_time - start_time).total_seconds(),
        tokens=completion_response.usage.total_tokens,
    )

litellm.success_callback = [my_custom_callback]

6. 実践的なユースケース

6.1 企業AI Gateway

社内のすべてのLLM呼び出しをLiteLLM Proxyに中央集約します。

+------------------+
| Frontend App     |----+
+------------------+    |
                        |     +----------------+
+------------------+    +---->|                |     +----------+
| Backend Service  |----+     | LiteLLM Proxy  |---->| OpenAI   |
+------------------+    |     |                |     +----------+
                        |     | - Auth         |
+------------------+    |     | - Rate Limit   |     +----------+
| Data Pipeline    |----+     | - Cost Track   |---->| Anthropic|
+------------------+    |     | - Audit Log    |     +----------+
                        |     |                |
+------------------+    |     +--------+-------+     +----------+
| Internal Tools   |----+              |             | Azure    |
+------------------+                   v             +----------+
                              +--------+-------+
                              | PostgreSQL     |
                              | (spend logs)   |
                              +----------------+

6.2 A/Bテスト

from openai import OpenAI
import random

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def get_completion_with_ab_test(prompt: str, test_name: str):
    # 50/50 A/Bテスト
    model = random.choice(["gpt-4o", "claude-sonnet"])

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        extra_body={
            "metadata": {
                "test_name": test_name,
                "variant": model,
            }
        },
    )

    return {
        "model": model,
        "content": response.choices[0].message.content,
        "tokens": response.usage.total_tokens,
    }

6.3 コスト最適化ルーティング

def smart_route(prompt: str, complexity: str = "auto"):
    """複雑度に応じて適切なモデルを選択"""

    if complexity == "auto":
        word_count = len(prompt.split())
        if word_count < 50:
            complexity = "simple"
        elif any(kw in prompt.lower() for kw in
                 ["analyze", "compare", "complex", "detailed"]):
            complexity = "complex"
        else:
            complexity = "medium"

    model_map = {
        "simple": "gpt-4o-mini",      # 安価なモデル
        "medium": "claude-sonnet",      # 中間パフォーマンス/価格
        "complex": "gpt-4o",           # 高性能モデル
    }

    model = model_map[complexity]

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    return response

6.4 Disaster Recovery（自動Failover）

# config.yaml - 複数Provider Failover
model_list:
  # Primary: OpenAI
  - model_name: main-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  # Secondary: Azure OpenAI（別リージョン）
  - model_name: main-model-fallback-1
    litellm_params:
      model: azure/gpt-4o
      api_base: https://eastus.openai.azure.com
      api_key: os.environ/AZURE_KEY

  # Tertiary: Anthropic Claude
  - model_name: main-model-fallback-2
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{ 'main-model': ['main-model-fallback-1', 'main-model-fallback-2'] }]
  num_retries: 2
  timeout: 30
  allowed_fails: 3
  cooldown_time: 60 # 失敗したモデル60秒クールダウン

7. 比較：LiteLLM vs 代替ツール

7.1 主要ツール比較

機能	LiteLLM	LangChain	OpenRouter	Portkey
タイプ	Gateway + SDK	Framework	Hosted API	Hosted Gateway
ホスティング	セルフホスト	N/A（ライブラリ）	Cloud	Cloud + Self
モデル数	100+	多様	200+	250+
コスト追跡	内蔵	別途実装必要	あり	あり
Rate Limiting	内蔵	なし	あり	あり
Load Balancing	内蔵	なし	あり	あり
Fallback	内蔵	手動実装	あり	あり
APIキー管理	Virtual Keys	なし	なし	あり
価格	無料（OSS）	無料（OSS）	マークアップ	無料 + Enterprise
データプライバシー	完全制御	完全制御	第三者経由	第三者経由

7.2 どのツールを選ぶべきか

LiteLLMを選択すべきケース:
  - データプライバシーが重要（金融、医療、政府）
  - 自社インフラで運用する必要がある場合
  - コスト追跡とRate Limitingが必要な場合
  - 複数のProviderを既に使用している場合

LangChainを選択すべきケース:
  - RAG、Agent等の複雑なLLMパイプライン構築
  - プロンプトチェイニング、メモリ管理等が必要な場合
  - （LiteLLMとLangChainは併用可能）

OpenRouterを選択すべきケース:
  - 素早いプロトタイピング
  - インフラ管理を望まない場合
  - 単一APIキーですべてのモデルにアクセス

Portkeyを選択すべきケース:
  - エンタープライズレベルの管理UIが必要
  - ガードレール、A/Bテスト等の高度な機能が必要
  - マネージドサービスを好む場合

8. 実践的なヒント

8.1 環境変数管理

# .envファイル（絶対にGitにコミットしないでください）
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
AZURE_API_KEY=xxx
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DATABASE_URL=postgresql://litellm:password@localhost:5432/litellm
LITELLM_MASTER_KEY=sk-master-key-change-me

8.2 モデルエイリアス設定

# config.yaml
model_list:
  - model_name: fast
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: smart
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: creative
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

# 意味のある名前で呼び出し
response = client.chat.completions.create(
    model="fast",  # gpt-4o-mini
    messages=[{"role": "user", "content": "Quick question"}],
)

response = client.chat.completions.create(
    model="smart",  # gpt-4o
    messages=[{"role": "user", "content": "Complex analysis"}],
)

8.3 エラーハンドリングパターン

from openai import OpenAI, APIError, RateLimitError, APITimeoutError

client = OpenAI(
    api_key="sk-proxy-key",
    base_url="http://litellm-proxy:4000",
)

def safe_completion(messages, model="gpt-4o", max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=30,
            )
            return response
        except RateLimitError:
            import time
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
        except APITimeoutError:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise
        except APIError as e:
            print(f"API error: {e}")
            raise

    return None

9. まとめ

LiteLLMはマルチLLM環境で必須のAI Gateway役割を果たします。

主要ポイント：

統合SDK: 100+ LLMを1つのcompletion()関数で呼び出し
Proxy Server: OpenAI互換API Gatewayで中央管理
コスト制御: 自動コスト追跡、Budget管理、アラート
安定性: Load Balancing、Fallback、Rate Limiting内蔵
本番運用: Docker/Kubernetesデプロイ、Prometheusモニタリング、外部ロギング連携

特に複数のLLM Providerを使用する企業環境で、LiteLLM Proxyを導入すればAPIキー管理、コスト追跡、障害対応を中央で一貫して処理できます。

参考資料

LiteLLM公式ドキュメント: https://docs.litellm.ai/
LiteLLM GitHub: https://github.com/BerriAI/litellm
LiteLLM Proxy設定ガイド: https://docs.litellm.ai/docs/proxy/configs
LiteLLM Dockerデプロイ: https://docs.litellm.ai/docs/proxy/deploy