- Published on
LiteLLM in Practice — 100+ LLMs Behind One Interface
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Overview
- 1. Install and First Call
- 2. Calling Many Providers with the Same Code
- 3. Using OpenRouter through LiteLLM
- 4. Streaming and Async
- 5. Router — Load Balancing and Fallbacks
- 6. Proxy Server (AI Gateway)
- 7. SDK vs Proxy, and Practical Tips
- References
Overview
Build an LLM app for long enough and you hit the same wall: every provider ships a different SDK, different request parameters, and a different response shape. Start with OpenAI code, then try to move to Anthropic, and you end up rewriting everything from client setup to message parsing. LiteLLM removes exactly that friction.
LiteLLM (by BerriAI) comes in two pieces. One is a Python SDK; the other is a Proxy Server (an AI Gateway). Both share a single goal: call 100+ LLM APIs in the OpenAI request/response format. On top of that uniform surface, LiteLLM layers operational features — cost tracking, retries, load balancing, fallbacks, guardrails, and logging.
This post is a practical quickstart meant to get you productive fast. Install it, make your first call, swap providers, then move through streaming, the Router, and the Proxy along the shortest path. A deeper LiteLLM complete guide already lives on this blog and covers internals and production configuration in depth, so reach for that when you need the mechanics. This post stays deliberately light.
Fix one framing first. LiteLLM is not "yet another provider." It is a single unified layer you put over many providers. And underneath it you can even place a router like OpenRouter that itself bundles hundreds of models. That means one LiteLLM setup can call both your directly-connected providers and providers reached through OpenRouter using the same code.
1. Install and First Call
Installing is a one-liner. It makes no difference whether you use pip or uv.
pip install litellm
# with uv
uv add litellm
The smallest possible call looks like this. You hand the single completion function a model and messages, and the response comes back in OpenAI shape.
from litellm import completion
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
resp = completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
Note the access pattern. resp.choices[0].message.content is the exact structure you already know from the OpenAI SDK. Because that shape holds no matter which provider you call, you only write your response-handling code once.
You can hand simple reliability options straight to the SDK too. Set num_retries for the retry count and timeout for how long to wait, and a single call becomes a bit more resilient to transient errors. These options work on a plain call without any Router.
from litellm import completion
resp = completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
num_retries=2,
timeout=30,
)
print(resp.choices[0].message.content)
2. Calling Many Providers with the Same Code
Switching providers takes just two steps. Change the prefix of the model string, and set that provider's environment variable. That is the whole change. The message structure and the response parsing stay identical.
| Provider | Example model prefix | Environment variable |
|---|---|---|
| OpenAI | openai/gpt-4o | OPENAI_API_KEY |
| Anthropic | anthropic/claude-3-5-sonnet-20241022 | ANTHROPIC_API_KEY |
| Gemini (AI Studio) | gemini/gemini-1.5-pro | GEMINI_API_KEY |
| Vertex AI | vertex_ai/gemini-1.5-pro | GCP credentials |
| AWS Bedrock | bedrock/... | AWS credentials |
| Azure OpenAI | azure/... | Azure config |
| Ollama (local) | ollama/llama3 | none (runs locally) |
Moving to Anthropic, for example, looks like this.
from litellm import completion
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
resp = completion(
model="anthropic/claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
To experiment with a local model, start Ollama and change the prefix to ollama/. No API key required.
from litellm import completion
resp = completion(
model="ollama/llama3",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
Thanks to this structure, transitions like "local Ollama in development, Gemini in staging, Bedrock in production" become a matter of environment variables and model strings — no code changes at all.
3. Using OpenRouter through LiteLLM
This is the combination this post most wants to highlight. OpenRouter is, on its own, a service that bundles 300+ models behind a single API. Put LiteLLM on top of it and you get OpenRouter's broad model coverage together with LiteLLM's uniform interface and operational features (retries, fallbacks, cost tracking).
The method is the same pattern as before. Set OPENROUTER_API_KEY, then start the model string with openrouter/ followed by the OpenRouter model id.
from litellm import completion
import os
os.environ["OPENROUTER_API_KEY"] = "sk-or-..."
resp = completion(
model="openrouter/openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
Break the prefix apart and it makes sense. The leading openrouter/ tells LiteLLM "route this through OpenRouter," and the trailing openai/gpt-4o is the model id that OpenRouter recognizes. In other words, after openrouter/ you can put any model id OpenRouter supports.
The practical payoff is clear. Wire your few contracted providers directly into LiteLLM, attach the more experimental or varied models through OpenRouter, and keep all your application code on the same completion call. The integration cost of adding another model effectively drops to zero.
4. Streaming and Async
To stream tokens in real time, add stream=True and iterate over the return value. You pull each fragment from the delta and concatenate.
from litellm import completion
resp = completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about the sea"}],
stream=True,
)
for chunk in resp:
print(chunk.choices[0].delta.content or "", end="")
Some chunks arrive with delta.content set to None, so guarding with or "" is the convention. That keeps an empty fragment from raising an exception.
In async code, use acompletion. The signature matches completion; you just add await.
import asyncio
from litellm import acompletion
async def main():
resp = await acompletion(
model="anthropic/claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
asyncio.run(main())
Streaming and async compose. When you stream tokens as server-sent events from an async web framework like FastAPI, you end up using both together.
5. Router — Load Balancing and Fallbacks
Once call volume grows and reliability matters, the SDK's Router class enters. The Router groups several deployments under one alias, spreads requests across them, and absorbs failures automatically.
First define a model_list. Each entry has a public-facing model_name and the real call parameters in litellm_params. Attach the same model_name to multiple deployments and the Router will distribute requests sent to that alias across those deployments.
from litellm import Router
model_list = [
{
"model_name": "my-gpt",
"litellm_params": {
"model": "openai/gpt-4o",
"api_key": "sk-...",
},
},
{
"model_name": "my-gpt",
"litellm_params": {
"model": "azure/gpt-4o-deployment",
"api_key": "...",
"api_base": "https://your-resource.openai.azure.com",
},
},
]
router = Router(model_list=model_list)
resp = router.completion(
model="my-gpt",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
The Router's real value is in fallbacks. You can configure three kinds, distinguished by the type of failure.
| Fallback kind | Trigger | Note |
|---|---|---|
fallbacks | 429/500-class errors | routes to another model |
context_window_fallbacks | context exceeded | needs enable_pre_call_checks=True |
content_policy_fallbacks | content-policy refusal | reroutes to another model |
Here is an example wiring all three. The only thing to watch is that context-window fallbacks require the pre-call check to be on.
from litellm import Router
router = Router(
model_list=model_list,
fallbacks=[{"my-gpt": ["claude"]}],
context_window_fallbacks=[{"my-gpt": ["claude-long"]}],
content_policy_fallbacks=[{"my-gpt": ["safe-model"]}],
enable_pre_call_checks=True,
)
Beyond that, the Router also manages retries, cooldowns, and timeouts. If one deployment keeps failing, it goes onto a cooldown list so no traffic is sent to it for a while, then returns after a set interval. Your application code is still a single line: router.completion(model="my-gpt", ...).
6. Proxy Server (AI Gateway)
When multiple teams and apps need to share one gateway, the Proxy Server is the answer. The Proxy runs as its own process and acts as a gateway that accepts any OpenAI-compatible client.
First write a model_list into config.yaml. It is safest to have API keys reference environment variables via os.environ/.
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
Then point at the config file to start the proxy. By default it listens on http://0.0.0.0:4000.
litellm --config config.yaml
Now point any OpenAI-compatible client's base URL at the proxy and pass a dummy/base key, and it just works. The real provider keys are held by the proxy from its config, so they are never exposed to the client.
from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:4000",
api_key="anything", # virtual/dummy key managed by the proxy
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
What the Proxy layer adds goes beyond plain proxying. Per-team virtual keys, budget limits, a cost-tracking dashboard, and request logging are all managed centrally at the gateway. Applications never need the real provider keys, and your operations team sees who spent what in one place.
7. SDK vs Proxy, and Practical Tips
Which one you use comes down to scale.
- Use the SDK when: it is a single app or script. Import the library and call it directly inside your code — no separate infrastructure needed. Good for personal projects, batch jobs, and single services.
- Use the Proxy when: multiple teams or apps must share one gateway. If you need centralized key management, budgets, and cost dashboards, the Proxy is the right call. Building an org-wide LLM platform naturally trends this direction.
A few things that pay off in practice.
- Write response code once: no matter which provider you call, the
resp.choices[0].message.contentshape holds, so you never rewrite response parsing per provider. - Keys via environment variables: whether SDK or proxy config, inject keys through environment variables instead of hardcoding them. In the proxy, use the
os.environ/reference. - The model string is the switch: switching providers is just prefix replacement plus setting an env var. The strategy of changing only the model string per deployment environment works well.
- Keep OpenRouter alongside: wire your frequent providers directly and attach experimental or varied models with the
openrouter/prefix, widening your options at no integration cost. - Start small, then grow: begin with the SDK, and when the team grows and central management is needed, carry the same
model_listconcept straight over into a Proxy config.
To sum up, LiteLLM pushes provider fragmentation out of your code. The first call takes a few lines, and the moment you need it you can add reliability with the Router and org-level control with the Proxy. For deeper internals and production configuration, continue with the LiteLLM complete guide.