- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Why LLM Costs Are Scarier Than You Think
- Strategy 1: Prompt Caching — Instant 90% Reduction
- Strategy 2: Model Routing — 70% Cost Reduction
- Strategy 3: Semantic Caching — 100% Off for Repeated Queries
- Strategy 4: Batch API — 50% Off for Non-Real-Time Work
- Strategy 5: Output Token Optimization
- Cost Monitoring Dashboard
- Model Cost Reference (2025)
- Combined Savings Scenario
- Conclusion
Why LLM Costs Are Scarier Than You Think
50,000/month. This isn't an exaggeration.
Scenario: B2B SaaS, 5,000 daily active users
Usage pattern:
- 1 user × 10 conversations per day
- 1 conversation = 200 input tokens + 300 output tokens
Daily token usage:
5,000 users × 10 conversations × 500 tokens = 25,000,000 tokens/day
Monthly:
25,000,000 × 30 = 750,000,000 tokens/month
Monthly cost comparison:
GPT-4o ($2.50/1M input + $10/1M output):
→ Input $37,500 + Output $45,000 = $82,500/month
GPT-4o-mini ($0.15/1M input + $0.60/1M output):
→ Input $2,250 + Output $2,700 = $4,950/month
Self-hosted Llama: ~$500-2,000/month
Switching from GPT-4o to GPT-4o-mini saves $77,000 per month for this single example. This is why cost optimization belongs at the top of your engineering priorities.
Strategy 1: Prompt Caching — Instant 90% Reduction
The most powerful and most overlooked optimization. When your system prompt or long context is cached, subsequent requests using that same prompt pay 10% of the normal input price.
Anthropic Prompt Caching
import anthropic
client = anthropic.Anthropic()
# A long system prompt that gets sent with every single request
COMPANY_KNOWLEDGE_BASE = """
[Thousands of tokens of company documentation, product info, policies...]
...sending this on every request without caching is burning money.
"""
def chat_with_caching(user_message: str) -> dict:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": COMPANY_KNOWLEDGE_BASE,
"cache_control": {"type": "ephemeral"} # Cache this block!
},
{
"type": "text",
"text": "You are a customer support specialist. Answer based on the company info above."
# Not cached: short, may change
}
],
messages=[{"role": "user", "content": user_message}]
)
usage = response.usage
print(f"Cache write tokens: {usage.cache_creation_input_tokens} (1.25x cost)")
print(f"Cache read tokens: {usage.cache_read_input_tokens} (0.1x cost — 90% off!)")
print(f"Normal input tokens: {usage.input_tokens} (1x cost)")
return {
"content": response.content[0].text,
"cache_hit": usage.cache_read_input_tokens > 0
}
# First call: cache creation (costs 1.25x, written to cache)
result1 = chat_with_caching("What's your refund policy?")
# All subsequent calls: cache hit (costs 0.1x — 90% savings!)
result2 = chat_with_caching("How long does shipping take?")
result3 = chat_with_caching("What's the warranty period?")
Cost math for this setup:
- System prompt: 5,000 tokens
- Daily requests: 10,000
- Without caching: 10,000 × 5,000 = 50M input tokens/day
- With 95% cache hit rate: 500k (misses) + 4,750k × 0.1 (hits) = 975k effective tokens/day
- Savings: 80% on system prompt tokens
OpenAI Automatic Prompt Caching
from openai import OpenAI
client = OpenAI()
# OpenAI automatically caches prompts over 1,024 tokens
# 50% discount applied automatically, no configuration needed
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
# Longer than 1,024 tokens → auto-cached at 50% discount
"content": LONG_SYSTEM_PROMPT # 2,000+ tokens recommended
},
{"role": "user", "content": "user question here"}
]
)
# Check if cache was hit
usage = response.usage
if hasattr(usage, 'prompt_tokens_details'):
cached = usage.prompt_tokens_details.cached_tokens
total_input = usage.prompt_tokens
savings_pct = (cached / total_input * 50) if total_input > 0 else 0
print(f"Cached: {cached}/{total_input} tokens ({savings_pct:.1f}% cost reduction)")
Strategy 2: Model Routing — 70% Cost Reduction
Not every request needs the same capability. Routing simple questions to cheap models while reserving expensive models for complex tasks is one of the highest-leverage optimizations available.
from openai import OpenAI
import re
client = OpenAI()
class ModelRouter:
"""Route requests to the cheapest model that can handle them"""
SIMPLE_PATTERNS = [
r"^(what is|what are|define|who is|when was|where is)",
r"^(translate|how do you say|what does .* mean)",
r"(yes or no|true or false|is it)",
]
COMPLEX_PATTERNS = [
r"(analyze|compare|design|architect|evaluate)",
r"(step.by.step|detailed|comprehensive|in.depth)",
r"(implement|code|build|create a system|write a program)",
r"(explain why|what causes|pros and cons|trade.?offs)",
r"(review|critique|refactor|optimize)",
]
def classify(self, query: str) -> str:
lower = query.lower()
if any(re.search(p, lower) for p in self.COMPLEX_PATTERNS):
return "complex"
if (len(query.split()) < 20 and
any(re.search(p, lower) for p in self.SIMPLE_PATTERNS)):
return "simple"
if len(query) > 500:
return "complex"
return "medium"
def complete(self, query: str, system: str = "") -> dict:
complexity = self.classify(query)
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": query})
if complexity in ("simple", "medium"):
# GPT-4o-mini: ~60x cheaper than GPT-4o
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=500 if complexity == "medium" else 200
)
model = "gpt-4o-mini"
else:
# Complex reasoning: use capable model
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
model = "gpt-4o"
return {
"answer": response.choices[0].message.content,
"model": model,
"complexity": complexity,
"tokens": response.usage.total_tokens
}
# Real-world cost impact:
# Assume traffic distribution: 60% simple, 25% medium, 15% complex
# Cost with all GPT-4o: 100% × $12.50/1M avg = $12.50
# Cost with routing: 85% × $0.375 + 15% × $12.50 = $0.32 + $1.88 = $2.20
# → ~82% cost reduction
router = ModelRouter()
examples = [
"What is a transformer architecture?", # simple → mini
"Implement a LRU cache in Python", # complex → GPT-4o
"What does HNSW stand for?", # simple → mini
"Design a distributed rate limiting system", # complex → GPT-4o
]
for q in examples:
result = router.complete(q)
print(f"[{result['complexity'].upper()}] → {result['model']} | {q[:50]}")
Strategy 3: Semantic Caching — 100% Off for Repeated Queries
When users ask similar questions repeatedly — which they always do in any sufficiently large user base — you can return cached responses instead of making API calls.
import hashlib
import numpy as np
from openai import OpenAI
from datetime import datetime, timedelta
client = OpenAI()
class SemanticCache:
"""
Cache responses by semantic meaning, not exact text.
"What is RAG?" and "Explain RAG to me" return the same cached result.
"""
def __init__(self, similarity_threshold: float = 0.92, ttl_hours: int = 24):
self.cache: dict = {}
self.threshold = similarity_threshold
self.ttl = timedelta(hours=ttl_hours)
self.stats = {"hits": 0, "misses": 0}
def _embed(self, text: str) -> list:
return client.embeddings.create(
input=text,
model="text-embedding-3-small",
dimensions=256 # Small dims for fast cache lookup
).data[0].embedding
def _similarity(self, a: list, b: list) -> float:
a_np, b_np = np.array(a), np.array(b)
return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))
def lookup(self, query: str) -> tuple:
"""Returns (cached_response, similarity) or (None, 0)"""
query_emb = self._embed(query)
best_score, best_response = 0.0, None
for entry in self.cache.values():
if datetime.now() - entry["ts"] > self.ttl:
continue
score = self._similarity(query_emb, entry["emb"])
if score > best_score:
best_score, best_response = score, entry["response"]
if best_score >= self.threshold:
self.stats["hits"] += 1
return best_response, best_score
self.stats["misses"] += 1
return None, best_score
def store(self, query: str, response: str) -> None:
key = hashlib.md5(query.encode()).hexdigest()
self.cache[key] = {
"emb": self._embed(query),
"response": response,
"ts": datetime.now()
}
def ask(self, query: str, model: str = "gpt-4o-mini") -> dict:
cached, score = self.lookup(query)
if cached:
return {"response": cached, "source": "cache", "similarity": score}
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}]
).choices[0].message.content
self.store(query, response)
return {"response": response, "source": "api", "similarity": 0}
@property
def hit_rate(self) -> float:
total = self.stats["hits"] + self.stats["misses"]
return self.stats["hits"] / total if total else 0
# Pre-warm cache with FAQ
cache = SemanticCache(similarity_threshold=0.92)
faqs = [
"What is your refund policy?",
"How long does shipping take?",
"How do I reset my password?",
"What payment methods do you accept?",
]
for q in faqs:
cache.ask(q) # Stores in cache
# These semantically similar queries hit cache:
r1 = cache.ask("Can I get a refund?") # Similar to "refund policy" → cache
r2 = cache.ask("How fast is delivery?") # Similar to "shipping take" → cache
r3 = cache.ask("I forgot my password") # Similar to "reset password" → cache
print(f"Cache hit rate: {cache.hit_rate:.1%}") # ~75% for FAQ-heavy workloads
Expected impact on customer support workloads: 60-70% of queries are semantically similar to previously answered questions. Semantic caching with a 0.90-0.95 threshold typically achieves 40-65% hit rates in production.
Strategy 4: Batch API — 50% Off for Non-Real-Time Work
For tasks that don't need instant responses, OpenAI's Batch API delivers 50% savings.
from openai import OpenAI
import json
import tempfile
client = OpenAI()
def submit_batch_job(texts: list, task_description: str) -> str:
"""
Process large volumes of text at 50% cost reduction.
Completed within 24 hours.
Ideal for: sentiment analysis, classification, summarization, translation
"""
requests = [
{
"custom_id": f"item-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": task_description},
{"role": "user", "content": text}
],
"max_tokens": 100,
"response_format": {"type": "json_object"}
}
}
for i, text in enumerate(texts)
]
# Write JSONL file
with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False) as f:
for req in requests:
f.write(json.dumps(req) + '\n')
tmp_path = f.name
# Upload and submit
with open(tmp_path, 'rb') as f:
batch_file = client.files.create(file=f, purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch submitted: {batch.id}")
print(f"Cost: 50% less than synchronous API")
print(f"Expected completion: within 24 hours")
return batch.id
def retrieve_batch_results(batch_id: str) -> list:
batch = client.batches.retrieve(batch_id)
if batch.status != "completed":
print(f"Status: {batch.status} — not ready yet")
return []
content = client.files.content(batch.output_file_id)
results = []
for line in content.text.strip().split('\n'):
item = json.loads(line)
body = item["response"]["body"]
answer = body["choices"][0]["message"]["content"]
results.append({
"id": item["custom_id"],
"result": json.loads(answer) if answer.startswith('{') else answer
})
return results
# Example: Classify 10,000 support tickets overnight
tickets = ["My order hasn't arrived", "I love this product!", "Wrong item received"] * 3334
batch_id = submit_batch_job(
texts=tickets,
task_description='Classify the support ticket. Respond in JSON: {"category": "shipping|product|billing|other", "priority": "high|medium|low", "sentiment": "positive|negative|neutral"}'
)
# Run this the next day
results = retrieve_batch_results(batch_id)
Best batch use cases:
- Bulk classification/summarization/translation of existing data
- Nightly report generation
- Offline customer feedback analysis
- Content moderation (non-real-time)
- Embedding generation for large document sets
Strategy 5: Output Token Optimization
Output tokens cost 3-5x more than input tokens. Structured outputs can reduce response length by 70-90%.
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal
client = OpenAI()
# Bad: verbose prose response (wastes tokens)
def analyze_verbose(text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Analyze the sentiment of: {text}"}]
)
# Returns: "The sentiment of this text is overwhelmingly positive. The author expresses..."
# Typical: 80-200 output tokens
return response.choices[0].message.content
# Good: structured minimal output
class SentimentResult(BaseModel):
sentiment: Literal["positive", "negative", "neutral"]
confidence: float
reason: str # Max 5 words enforced by prompt
def analyze_structured(text: str) -> SentimentResult:
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Classify sentiment. Reason must be 5 words max."
},
{"role": "user", "content": text}
],
response_format=SentimentResult
)
# Returns: {"sentiment": "positive", "confidence": 0.94, "reason": "enthusiastic praise"}
# Typical: 20-30 output tokens → 85% savings
return response.choices[0].message.parsed
# Token-aware prompting
def summarize_concisely(text: str, target_words: int = 50) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"Summarize in exactly {target_words} words or fewer. Be direct. No preamble."
},
{"role": "user", "content": text}
],
max_tokens=target_words * 2 # Hard limit as safety net
)
return response.choices[0].message.content
# For JSON tasks, avoid asking the model to "explain" anything
EXTRACTION_PROMPT = """Extract the requested data. Return ONLY valid JSON. No explanation.
If a field is missing, use null."""
def extract_structured_data(document: str, schema: dict) -> dict:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": EXTRACTION_PROMPT},
{"role": "user", "content": f"Schema: {json.dumps(schema)}\n\nDocument: {document}"}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
Cost Monitoring Dashboard
You can't optimize what you don't measure.
from collections import defaultdict
from datetime import date
from openai import OpenAI
import json
client = OpenAI()
class CostTracker:
"""Real-time LLM API cost tracking with budget alerts"""
# 2025 pricing ($/1M tokens)
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
"claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
}
def __init__(self, daily_budget: float = 100.0):
self.daily_budget = daily_budget
self.by_model = defaultdict(lambda: {"in": 0, "out": 0, "cost": 0.0})
self.by_day = defaultdict(float)
def record(self, model: str, input_tokens: int, output_tokens: int) -> float:
p = self.PRICING.get(model, {"input": 0, "output": 0})
cost = (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
today = date.today().isoformat()
self.by_model[model]["in"] += input_tokens
self.by_model[model]["out"] += output_tokens
self.by_model[model]["cost"] += cost
self.by_day[today] += cost
daily_so_far = self.by_day[today]
if daily_so_far > self.daily_budget * 0.8:
print(f"WARNING: 80% of daily budget used (${daily_so_far:.2f} / ${self.daily_budget:.2f})")
if daily_so_far > self.daily_budget:
raise RuntimeError(f"Daily budget exceeded! ${daily_so_far:.2f} > ${self.daily_budget:.2f}")
return cost
def report(self) -> None:
total = sum(d["cost"] for d in self.by_model.values())
today = date.today().isoformat()
print(f"\n{'='*50}")
print(f"Today's spend: ${self.by_day[today]:.4f} / ${self.daily_budget:.2f} budget")
print(f"Total tracked: ${total:.4f}")
print(f"\nBy model:")
for model, d in sorted(self.by_model.items(), key=lambda x: -x[1]["cost"]):
pct = d["cost"] / total * 100 if total else 0
print(f" {model:<45} ${d['cost']:.4f} ({pct:.1f}%)")
print(f" {d['in']:>12,} input | {d['out']:>12,} output tokens")
tracker = CostTracker(daily_budget=50.0)
def tracked_call(model: str, messages: list, **kwargs) -> str:
"""Drop-in replacement for chat completions with cost tracking"""
response = client.chat.completions.create(model=model, messages=messages, **kwargs)
tracker.record(model, response.usage.prompt_tokens, response.usage.completion_tokens)
return response.choices[0].message.content
Model Cost Reference (2025)
| Model | Input (1M tokens) | Output (1M tokens) | Best For |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Complex reasoning, multimodal |
| GPT-4o-mini | $0.15 | $0.60 | Most tasks |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Coding, analysis, long docs |
| Claude 3 Haiku | $0.25 | $1.25 | Fast, simple tasks |
| Llama 3.1 70B (self-hosted) | ~$0.05-0.15 | ~$0.05-0.15 | High-volume, private data |
Combined Savings Scenario
Applying all five strategies to a realistic workload:
Baseline: 100,000 requests/month, GPT-4o, avg 1,000 tokens/request
Before optimization:
100,000 × 1,000 × ($2.50 + $10.00) / 1,000,000 = $1,250/month
After all strategies:
1. Prompt Caching (80% of requests, 85% reduction): saves ~$850
2. Model routing (70% to mini): saves ~$280 additional
3. Semantic caching (40% cache hit rate): saves ~$48 additional
4. Batch API (20% of requests): saves ~$24 additional
5. Output optimization (30% shorter outputs): saves ~$24 additional
Optimized monthly cost: ~$24/month
Savings: ~98% (in ideal scenario, 60-80% realistic)
Conclusion
LLM cost optimization is architecture work. It's far easier to build with these patterns from day one than to retrofit them when your invoice becomes alarming.
Prioritized action plan:
- This hour: Add prompt caching to your system prompt (3 lines of code, up to 90% savings on that portion).
- This week: Implement model routing — even a simple length-based heuristic cuts costs significantly.
- This month: Add semantic caching for FAQ-heavy workloads, set up cost monitoring with budget alerts.
- This quarter: Move batch-compatible workloads to Batch API, establish structured output patterns as the default.
Cost optimization doesn't degrade user experience. Done right, it funds better features with the money you save.