Split View: LLM API 비용을 90% 줄이는 실전 최적화 전략

LLM API 비용을 90% 줄이는 실전 최적화 전략

들어가며: LLM 비용이 예상보다 무서운 이유
전략 1: Prompt Caching — 즉각적인 90% 절감
- Anthropic Prompt Caching
- OpenAI 자동 Prompt Caching
전략 2: 모델 라우팅 — 70% 비용 절감
전략 3: 시맨틱 캐싱 — 반복 쿼리 100% 절감
전략 4: Batch API — 비실시간 처리 50% 절감
전략 5: 출력 토큰 최적화
비용 모니터링 대시보드 구축
모델별 비용 비교 표 (2025년 기준)
종합 절감 시나리오
마치며

들어가며: LLM 비용이 예상보다 무서운 이유

개발 단계에서 월 $50이던 API 비용이 사용자가 늘면서 갑자기 월$ 50,000이 된다. 이것은 과장이 아니다.

실제 계산을 해보자:

시나리오: 소규모 B2B SaaS, 일 활성 사용자 5,000명

사용 패턴:
- 사용자 1명 × 하루 10회 대화
- 대화 1회 = 입력 200 토큰 + 출력 300 토큰

일일 토큰 사용량:
5,000명 × 10회 × 500 토큰 = 25,000,000 토큰/일 (2,500만 토큰)

월 기준:
25,000,000 × 30 = 750,000,000 토큰/월 (7억 5천만 토큰)

비용 비교 (월):
- GPT-4o:          $2.50/1M input + $10/1M output
  → 입력 $37,500 + 출력 $45,000 = 월 $82,500
- GPT-4o-mini:     $0.15/1M input + $0.60/1M output
  → 입력 $2,250 + 출력 $2,700 = 월 $4,950
- 자체 호스팅 Llama: 서버 비용 ~$500-2,000/월

같은 기능을 GPT-4o 대신 GPT-4o-mini로 처리하면 월 $77,000을 아낀다. 이것이 비용 최적화가 엔지니어링 우선순위의 최상단에 있어야 하는 이유다.

전략 1: Prompt Caching — 즉각적인 90% 절감

가장 강력하면서도 가장 간과되는 최적화다. 시스템 프롬프트나 긴 컨텍스트를 캐싱하면 재사용 시 토큰 비용이 대폭 줄어든다.

Anthropic Prompt Caching

import anthropic

client = anthropic.Anthropic()

# 회사 전체 RAG 컨텍스트나 긴 시스템 프롬프트
COMPANY_KNOWLEDGE_BASE = """
[여기에 수천 토큰의 회사 문서, 제품 정보, 정책 등]
...이 내용이 매 요청마다 반복 전송되면 비용이 폭발한다.
"""

def chat_with_caching(user_message: str) -> dict:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": COMPANY_KNOWLEDGE_BASE,
                "cache_control": {"type": "ephemeral"}  # 이 블록을 캐시!
            },
            {
                "type": "text",
                "text": "당신은 고객 지원 전문가입니다. 위의 회사 정보를 바탕으로 답변하세요."
                # 이 부분은 캐시하지 않음 (짧고 변경 가능)
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )

    # 사용량 추적
    usage = response.usage
    print(f"캐시 쓰기: {usage.cache_creation_input_tokens} tokens (1.25x 비용)")
    print(f"캐시 읽기: {usage.cache_read_input_tokens} tokens (0.1x 비용! 90% 절감)")
    print(f"일반 입력: {usage.input_tokens} tokens (1x 비용)")

    return {
        "content": response.content[0].text,
        "cache_saved": usage.cache_read_input_tokens > 0
    }

# 첫 번째 호출: 캐시 생성 (1.25x 비용)
result1 = chat_with_caching("환불 정책이 어떻게 되나요?")

# 두 번째 호출부터: 캐시 히트 (0.1x 비용!)
result2 = chat_with_caching("배송 기간은 얼마나 걸리나요?")
result3 = chat_with_caching("제품 보증 기간은?")

비용 절감 계산:

시스템 프롬프트 5,000 토큰
하루 10,000 요청
캐시 없이: 10,000 × 5,000 = 5,000만 토큰/일
캐시 히트 95%: 10,000 × 5,000 × 0.05 + 10,000 × 5,000 × 0.95 × 0.1 = 250만 + 475만 = 725만 토큰/일
절감률: 85%

OpenAI 자동 Prompt Caching

from openai import OpenAI
client = OpenAI()

# OpenAI는 1,024 토큰 이상의 프롬프트를 자동으로 캐싱
# 별도 설정 없이 50% 할인 자동 적용
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            # 1,024 토큰 이상이면 자동 캐싱 (50% 할인)
            "content": LONG_SYSTEM_PROMPT  # 2,000 토큰 이상 권장
        },
        {"role": "user", "content": "질문입니다"}
    ]
)

# 캐시 히트 확인
usage = response.usage
if hasattr(usage, 'prompt_tokens_details'):
    cached = usage.prompt_tokens_details.cached_tokens
    print(f"캐시된 토큰: {cached} (50% 할인 적용)")

전략 2: 모델 라우팅 — 70% 비용 절감

모든 요청이 동일한 처리 능력을 필요로 하지 않는다. 간단한 질문에 GPT-4o를 쓰는 것은 볼트 조이는 데 전동드릴을 쓰는 것과 같다.

from openai import OpenAI
from anthropic import Anthropic
import re

openai_client = OpenAI()
anthropic_client = Anthropic()

class ModelRouter:
    """요청 복잡도에 따라 최적 모델로 라우팅"""

    # 간단한 요청 패턴
    SIMPLE_PATTERNS = [
        r"^(what is|what are|define|who is|when was|where is)",
        r"^(번역|translate|어떻게 말해)",
        r"^(yes/no|맞나요|맞아요\?)",
    ]

    # 복잡한 처리가 필요한 패턴
    COMPLEX_PATTERNS = [
        r"(analyze|분석|compare|비교|design|설계)",
        r"(step.by.step|단계별|detailed|자세히)",
        r"(code|코드|implement|구현|architecture|아키텍처)",
        r"(explain why|왜|reason|이유|pros.cons|장단점)",
    ]

    def classify_query(self, query: str) -> str:
        """쿼리 복잡도 분류: 'simple', 'medium', 'complex'"""
        query_lower = query.lower()

        # 복잡한 요청
        if any(re.search(p, query_lower) for p in self.COMPLEX_PATTERNS):
            return "complex"

        # 단순한 요청
        if (len(query.split()) < 15 and
                any(re.search(p, query_lower) for p in self.SIMPLE_PATTERNS)):
            return "simple"

        # 길이 기반 추가 판단
        if len(query) > 500:
            return "complex"

        return "medium"

    def route(self, query: str, context: str = "") -> dict:
        """적절한 모델로 라우팅하여 요청 처리"""

        complexity = self.classify_query(query)

        if complexity == "simple":
            # GPT-4o-mini: 복잡하지 않은 요청에 최적, 60배 저렴
            response = openai_client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": query}],
                max_tokens=200
            )
            model_used = "gpt-4o-mini"
            cost_multiplier = 1  # 기준

        elif complexity == "medium":
            response = openai_client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": query}],
                max_tokens=500
            )
            model_used = "gpt-4o-mini"
            cost_multiplier = 1

        else:  # complex
            # 복잡한 추론: GPT-4o 또는 Claude Sonnet 사용
            response = openai_client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": query}]
            )
            model_used = "gpt-4o"
            cost_multiplier = 60  # GPT-4o-mini 대비 약 60배

        return {
            "answer": response.choices[0].message.content,
            "model": model_used,
            "complexity": complexity
        }


# 실제 사용 + 비용 분석
router = ModelRouter()

queries = [
    "파이썬이 뭐야?",                           # simple
    "마이크로서비스 아키텍처를 설계해줘",         # complex
    "오늘 날씨 어때?",                           # simple
    "RESTful API와 GraphQL을 상세히 비교해줘",   # complex
    "Python 3.11 출시일이 언제야?",             # simple
]

# 80%가 simple/medium, 20%가 complex라고 가정하면:
# 전체를 GPT-4o로 처리 시 = 100% × 60 = 비용 60
# 라우팅 적용 시 = 80% × 1 + 20% × 60 = 80 + 12 = 비용 92의 차이
# → 약 85% 비용 절감!

실제 데이터 기반 최적화: A/B 테스트를 통해 라우팅 임계값을 조정하면 품질 손실 없이 최대 80%까지 비용 절감이 가능하다.

전략 3: 시맨틱 캐싱 — 반복 쿼리 100% 절감

동일하거나 매우 유사한 쿼리가 반복될 때, 매번 API를 호출할 필요가 없다.

import hashlib
import json
import numpy as np
from openai import OpenAI
from datetime import datetime, timedelta

client = OpenAI()

class SemanticCache:
    """
    의미적으로 유사한 쿼리를 캐싱
    "RAG가 뭐야?"와 "RAG에 대해 설명해줘"는 같은 캐시 항목을 반환
    """

    def __init__(self, similarity_threshold: float = 0.95, ttl_hours: int = 24):
        self.cache = {}  # {query_hash: {embedding, response, created_at}}
        self.threshold = similarity_threshold
        self.ttl = timedelta(hours=ttl_hours)
        self.stats = {"hits": 0, "misses": 0, "saved_tokens": 0}

    def _get_embedding(self, text: str) -> list:
        response = client.embeddings.create(
            input=text,
            model="text-embedding-3-small",
            dimensions=256  # 빠른 캐시 조회를 위해 작은 차원 사용
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a: list, b: list) -> float:
        a_np, b_np = np.array(a), np.array(b)
        return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))

    def get(self, query: str) -> tuple:
        """캐시 조회. 반환: (cached_response or None, similarity_score)"""
        query_embedding = self._get_embedding(query)

        best_similarity = 0
        best_response = None

        for key, entry in self.cache.items():
            # TTL 확인
            if datetime.now() - entry["created_at"] > self.ttl:
                continue

            similarity = self._cosine_similarity(query_embedding, entry["embedding"])
            if similarity > best_similarity:
                best_similarity = similarity
                best_response = entry["response"]

        if best_similarity >= self.threshold:
            self.stats["hits"] += 1
            return best_response, best_similarity

        self.stats["misses"] += 1
        return None, best_similarity

    def set(self, query: str, response: str, tokens_used: int) -> None:
        """캐시에 저장"""
        embedding = self._get_embedding(query)
        key = hashlib.md5(query.encode()).hexdigest()
        self.cache[key] = {
            "embedding": embedding,
            "response": response,
            "created_at": datetime.now(),
            "tokens": tokens_used
        }

    def chat(self, query: str) -> dict:
        """캐시를 활용한 채팅"""
        cached_response, similarity = self.get(query)

        if cached_response:
            return {
                "response": cached_response,
                "cache_hit": True,
                "similarity": similarity,
                "api_cost": 0  # 캐시 히트 = API 비용 없음!
            }

        # 캐시 미스: API 호출
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": query}]
        )
        answer = response.choices[0].message.content
        tokens = response.usage.total_tokens

        # 캐시에 저장
        self.set(query, answer, tokens)

        return {
            "response": answer,
            "cache_hit": False,
            "similarity": 0,
            "api_cost": tokens * 0.000001  # 대략적인 비용
        }


# 실제 사용 예시
cache = SemanticCache(similarity_threshold=0.92)

# FAQ 사전 워밍 (자주 묻는 질문 미리 캐싱)
faq_questions = [
    "환불 정책이 어떻게 되나요?",
    "배송 기간은 얼마나 걸리나요?",
    "회원 가입 방법이 뭐예요?",
]

for q in faq_questions:
    cache.chat(q)  # 첫 번째 호출: API 호출 후 캐싱

# 이후 유사한 질문은 캐시에서 반환
result = cache.chat("반품하려면 어떻게 해야 하나요?")  # "환불 정책"과 유사 → 캐시 히트
result2 = cache.chat("배송이 얼마나 걸려요?")  # "배송 기간"과 동의어 → 캐시 히트

고객 지원 시나리오에서의 효과: 반복적인 FAQ 질문의 60-70%는 의미적으로 유사한 이전 질문과 매칭된다. 이것만으로 전체 API 호출의 절반 이상을 제거할 수 있다.

전략 4: Batch API — 비실시간 처리 50% 절감

실시간 응답이 필요 없는 태스크에서는 Batch API를 사용해 50% 할인을 받을 수 있다.

from openai import OpenAI
import json
import tempfile

client = OpenAI()

def run_batch_analysis(texts: list) -> list:
    """
    대량의 텍스트 분석을 50% 저렴하게 처리
    - 24시간 내 처리 (비실시간)
    - 50% 할인 자동 적용
    적합한 태스크: 감성 분석, 분류, 요약, 번역
    """

    # 배치 요청 준비
    requests = []
    for i, text in enumerate(texts):
        requests.append({
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [
                    {
                        "role": "system",
                        "content": "다음 텍스트의 감성을 분석하세요. JSON으로 응답: {\"sentiment\": \"positive/negative/neutral\", \"confidence\": 0.0-1.0}"
                    },
                    {"role": "user", "content": text}
                ],
                "max_tokens": 50
            }
        })

    # 임시 파일로 배치 업로드
    with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False) as f:
        for req in requests:
            f.write(json.dumps(req, ensure_ascii=False) + '\n')
        temp_path = f.name

    # 파일 업로드
    with open(temp_path, 'rb') as f:
        batch_file = client.files.create(file=f, purpose="batch")

    # 배치 제출 (24시간 내 처리, 50% 할인)
    batch = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )

    print(f"배치 ID: {batch.id}")
    print(f"상태: {batch.status}")
    print(f"비용: 실시간 대비 50% 절감")
    return batch.id

def get_batch_results(batch_id: str) -> list:
    """배치 완료 후 결과 조회"""
    batch = client.batches.retrieve(batch_id)

    if batch.status != "completed":
        print(f"아직 처리 중: {batch.status}")
        return []

    # 결과 파일 다운로드
    result_file = client.files.content(batch.output_file_id)
    results = []
    for line in result_file.text.strip().split('\n'):
        result = json.loads(line)
        response_body = result["response"]["body"]
        content = response_body["choices"][0]["message"]["content"]
        results.append({
            "id": result["custom_id"],
            "result": json.loads(content)
        })

    return results

# 사용 예시: 상품 리뷰 10,000개 감성 분석
reviews = ["정말 좋은 제품이에요!", "배송이 너무 늦었어요", ...] * 5000  # 1만 개
batch_id = run_batch_analysis(reviews)
# 24시간 후:
results = get_batch_results(batch_id)

적합한 배치 처리 태스크:

기존 데이터 대량 분류/요약/번역
야간 보고서 생성
오프라인 고객 피드백 분석
콘텐츠 모더레이션 (비실시간)

전략 5: 출력 토큰 최적화

출력 토큰은 입력 토큰보다 3-5배 비싸다. 출력을 짧게 만들면 비용이 크게 줄어든다.

from openai import OpenAI
from pydantic import BaseModel
from typing import Literal

client = OpenAI()

# 나쁜 예: 긴 산문 형태의 응답 요청
def analyze_sentiment_verbose(text: str) -> str:
    """비효율적: 불필요하게 긴 출력 생성"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"다음 텍스트의 감성을 분석해주세요: {text}"
        }]
    )
    # 출력 예시: "이 텍스트는 긍정적인 감성을 나타냅니다. 작성자가 제품에 만족했음을 알 수 있으며..."
    # 평균 100-200 토큰 출력
    return response.choices[0].message.content


# 좋은 예: 구조화된 간결한 출력
class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float
    key_phrase: str  # 최대 5단어

def analyze_sentiment_structured(text: str) -> SentimentResult:
    """효율적: 필요한 정보만 담은 구조화 출력"""
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Analyze sentiment: {text}"
        }],
        response_format=SentimentResult
    )
    # 출력 예시: {"sentiment": "positive", "confidence": 0.92, "key_phrase": "정말 좋은 제품"}
    # 평균 20-30 토큰 출력 → 80% 절감!
    return response.choices[0].message.parsed


# max_tokens로 출력 길이 제한
def summarize_with_limit(text: str, max_words: int = 50) -> str:
    """출력 길이 명시적 제한"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": f"요약은 반드시 {max_words}단어 이내로 작성하세요."
        }, {
            "role": "user",
            "content": f"다음을 요약해주세요: {text}"
        }],
        max_tokens=max_words * 2  # 한국어는 토큰/단어 비율 고려
    )
    return response.choices[0].message.content

비용 모니터링 대시보드 구축

모니터링 없이는 최적화가 불가능하다.

import time
from collections import defaultdict
from datetime import datetime, date
from openai import OpenAI

client = OpenAI()

class CostTracker:
    """LLM API 비용 실시간 추적"""

    # 2025년 기준 가격 ($/1M 토큰)
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
        "claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
    }

    def __init__(self, daily_budget_usd: float = 100.0):
        self.daily_budget = daily_budget_usd
        self.usage = defaultdict(lambda: {"input_tokens": 0, "output_tokens": 0, "cost": 0})
        self.daily_cost = defaultdict(float)

    def track(self, model: str, input_tokens: int, output_tokens: int) -> float:
        """사용량 추적 및 비용 계산"""
        if model not in self.PRICING:
            return 0.0

        pricing = self.PRICING[model]
        cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000

        today = date.today().isoformat()
        self.usage[model]["input_tokens"] += input_tokens
        self.usage[model]["output_tokens"] += output_tokens
        self.usage[model]["cost"] += cost
        self.daily_cost[today] += cost

        # 예산 임계값 경고
        if self.daily_cost[today] > self.daily_budget * 0.8:
            print(f"경고: 일일 예산의 80% 도달! (${self.daily_cost[today]:.2f} / ${self.daily_budget:.2f})")

        if self.daily_cost[today] > self.daily_budget:
            raise Exception(f"일일 예산 초과! ${self.daily_cost[today]:.2f} > ${self.daily_budget:.2f}")

        return cost

    def report(self) -> None:
        """비용 보고서 출력"""
        print("\n=== LLM API 비용 보고서 ===")
        total_cost = 0
        for model, data in self.usage.items():
            print(f"\n모델: {model}")
            print(f"  입력 토큰: {data['input_tokens']:,}")
            print(f"  출력 토큰: {data['output_tokens']:,}")
            print(f"  비용: ${data['cost']:.4f}")
            total_cost += data['cost']
        print(f"\n총 비용: ${total_cost:.4f}")

        today = date.today().isoformat()
        print(f"오늘 비용: ${self.daily_cost[today]:.4f} / ${self.daily_budget:.2f} (예산)")


# 전역 트래커
tracker = CostTracker(daily_budget_usd=50.0)

def tracked_completion(model: str, messages: list, **kwargs) -> str:
    """비용 추적이 포함된 API 호출 래퍼"""
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        **kwargs
    )
    tracker.track(
        model=model,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens
    )
    return response.choices[0].message.content

모델별 비용 비교 표 (2025년 기준)

모델	입력 (1M 토큰)	출력 (1M 토큰)	특징	주요 용도
GPT-4o	$2.50	$10.00	최고 성능	복잡한 추론, 멀티모달
GPT-4o-mini	$0.15	$0.60	성능/비용 균형	대부분의 태스크
Claude 3.5 Sonnet	$3.00	$15.00	코딩 최강	코딩, 분석, 긴 문서
Claude 3 Haiku	$0.25	$1.25	빠르고 저렴	간단한 태스크
Llama 3.1 70B (자체 호스팅)	~$0.05-0.15	~$0.05-0.15	대규모 절감	고볼륨 자체 운영

종합 절감 시나리오

5가지 전략을 모두 적용했을 때의 현실적인 절감 효과:

기준: 월 100,000 요청, GPT-4o, 평균 1,000 토큰/요청

최적화 전 월 비용:
100,000 × 1,000 토큰 × ($2.50 + $10.00) / 1,000,000 = $1,250/월

전략 적용 후:
1. Prompt Caching (80% 요청에 적용, 85% 절감): -$850
2. 모델 라우팅 (70%를 mini로): -$175 추가
3. 시맨틱 캐싱 (40% 히트율): -$45 추가
4. Batch API (20%를 배치로): -$25 추가
5. 출력 최적화 (30% 감소): -$75 추가

최적화 후 월 비용: 약 $80/월
절감률: 94%

단, 이것은 이상적인 시나리오다. 현실에서는 50-80% 절감이 일반적이다.

마치며

LLM API 비용 최적화는 아키텍처 설계의 일부다. 나중에 "비용이 너무 많이 나온다"고 발견했을 때 레트로핏하는 것보다, 처음부터 이 전략들을 염두에 두고 설계하는 것이 훨씬 쉽다.

실천 순서:

즉시: Prompt Caching 적용 (코드 3줄, 최대 90% 절감)
이번 주: 모델 라우팅 구현 (복잡도 기반 분류)
이번 달: 시맨틱 캐싱 + 비용 모니터링 구축
장기: Batch API로 비실시간 태스크 이전

비용 최적화는 사용자 경험을 해치지 않는다. 오히려 불필요한 지연을 줄이고, 절감된 비용으로 더 나은 기능을 개발할 수 있게 해준다.

Practical LLM API Cost Optimization: How to Cut Costs by 90%

Why LLM Costs Are Scarier Than You Think
Strategy 1: Prompt Caching — Instant 90% Reduction
- Anthropic Prompt Caching
- OpenAI Automatic Prompt Caching
Strategy 2: Model Routing — 70% Cost Reduction
Strategy 3: Semantic Caching — 100% Off for Repeated Queries
Strategy 4: Batch API — 50% Off for Non-Real-Time Work
Strategy 5: Output Token Optimization
Cost Monitoring Dashboard
Model Cost Reference (2025)
Combined Savings Scenario
Conclusion

Why LLM Costs Are Scarier Than You Think

$50/month in development. Then users arrive, and it becomes$ 50,000/month. This isn't an exaggeration.

Scenario: B2B SaaS, 5,000 daily active users

Usage pattern:
- 1 user × 10 conversations per day
- 1 conversation = 200 input tokens + 300 output tokens

Daily token usage:
5,000 users × 10 conversations × 500 tokens = 25,000,000 tokens/day

Monthly:
25,000,000 × 30 = 750,000,000 tokens/month

Monthly cost comparison:
GPT-4o ($2.50/1M input + $10/1M output):
  → Input $37,500 + Output $45,000 = $82,500/month

GPT-4o-mini ($0.15/1M input + $0.60/1M output):
  → Input $2,250 + Output $2,700 = $4,950/month

Self-hosted Llama: ~$500-2,000/month

Switching from GPT-4o to GPT-4o-mini saves $77,000 per month for this single example. This is why cost optimization belongs at the top of your engineering priorities.

Strategy 1: Prompt Caching — Instant 90% Reduction

The most powerful and most overlooked optimization. When your system prompt or long context is cached, subsequent requests using that same prompt pay 10% of the normal input price.

Anthropic Prompt Caching

import anthropic

client = anthropic.Anthropic()

# A long system prompt that gets sent with every single request
COMPANY_KNOWLEDGE_BASE = """
[Thousands of tokens of company documentation, product info, policies...]
...sending this on every request without caching is burning money.
"""

def chat_with_caching(user_message: str) -> dict:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": COMPANY_KNOWLEDGE_BASE,
                "cache_control": {"type": "ephemeral"}  # Cache this block!
            },
            {
                "type": "text",
                "text": "You are a customer support specialist. Answer based on the company info above."
                # Not cached: short, may change
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )

    usage = response.usage
    print(f"Cache write tokens: {usage.cache_creation_input_tokens} (1.25x cost)")
    print(f"Cache read tokens:  {usage.cache_read_input_tokens} (0.1x cost — 90% off!)")
    print(f"Normal input tokens: {usage.input_tokens} (1x cost)")

    return {
        "content": response.content[0].text,
        "cache_hit": usage.cache_read_input_tokens > 0
    }

# First call: cache creation (costs 1.25x, written to cache)
result1 = chat_with_caching("What's your refund policy?")

# All subsequent calls: cache hit (costs 0.1x — 90% savings!)
result2 = chat_with_caching("How long does shipping take?")
result3 = chat_with_caching("What's the warranty period?")

Cost math for this setup:

System prompt: 5,000 tokens
Daily requests: 10,000
Without caching: 10,000 × 5,000 = 50M input tokens/day
With 95% cache hit rate: 500k (misses) + 4,750k × 0.1 (hits) = 975k effective tokens/day
Savings: 80% on system prompt tokens

OpenAI Automatic Prompt Caching

from openai import OpenAI
client = OpenAI()

# OpenAI automatically caches prompts over 1,024 tokens
# 50% discount applied automatically, no configuration needed
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            # Longer than 1,024 tokens → auto-cached at 50% discount
            "content": LONG_SYSTEM_PROMPT   # 2,000+ tokens recommended
        },
        {"role": "user", "content": "user question here"}
    ]
)

# Check if cache was hit
usage = response.usage
if hasattr(usage, 'prompt_tokens_details'):
    cached = usage.prompt_tokens_details.cached_tokens
    total_input = usage.prompt_tokens
    savings_pct = (cached / total_input * 50) if total_input > 0 else 0
    print(f"Cached: {cached}/{total_input} tokens ({savings_pct:.1f}% cost reduction)")

Strategy 2: Model Routing — 70% Cost Reduction

Not every request needs the same capability. Routing simple questions to cheap models while reserving expensive models for complex tasks is one of the highest-leverage optimizations available.

from openai import OpenAI
import re

client = OpenAI()

class ModelRouter:
    """Route requests to the cheapest model that can handle them"""

    SIMPLE_PATTERNS = [
        r"^(what is|what are|define|who is|when was|where is)",
        r"^(translate|how do you say|what does .* mean)",
        r"(yes or no|true or false|is it)",
    ]

    COMPLEX_PATTERNS = [
        r"(analyze|compare|design|architect|evaluate)",
        r"(step.by.step|detailed|comprehensive|in.depth)",
        r"(implement|code|build|create a system|write a program)",
        r"(explain why|what causes|pros and cons|trade.?offs)",
        r"(review|critique|refactor|optimize)",
    ]

    def classify(self, query: str) -> str:
        lower = query.lower()

        if any(re.search(p, lower) for p in self.COMPLEX_PATTERNS):
            return "complex"

        if (len(query.split()) < 20 and
                any(re.search(p, lower) for p in self.SIMPLE_PATTERNS)):
            return "simple"

        if len(query) > 500:
            return "complex"

        return "medium"

    def complete(self, query: str, system: str = "") -> dict:
        complexity = self.classify(query)

        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": query})

        if complexity in ("simple", "medium"):
            # GPT-4o-mini: ~60x cheaper than GPT-4o
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
                max_tokens=500 if complexity == "medium" else 200
            )
            model = "gpt-4o-mini"
        else:
            # Complex reasoning: use capable model
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages
            )
            model = "gpt-4o"

        return {
            "answer": response.choices[0].message.content,
            "model": model,
            "complexity": complexity,
            "tokens": response.usage.total_tokens
        }


# Real-world cost impact:
# Assume traffic distribution: 60% simple, 25% medium, 15% complex
# Cost with all GPT-4o:    100% × $12.50/1M avg = $12.50
# Cost with routing:       85% × $0.375 + 15% × $12.50 = $0.32 + $1.88 = $2.20
# → ~82% cost reduction

router = ModelRouter()

examples = [
    "What is a transformer architecture?",     # simple → mini
    "Implement a LRU cache in Python",          # complex → GPT-4o
    "What does HNSW stand for?",                # simple → mini
    "Design a distributed rate limiting system", # complex → GPT-4o
]

for q in examples:
    result = router.complete(q)
    print(f"[{result['complexity'].upper()}] → {result['model']} | {q[:50]}")

Strategy 3: Semantic Caching — 100% Off for Repeated Queries

When users ask similar questions repeatedly — which they always do in any sufficiently large user base — you can return cached responses instead of making API calls.

import hashlib
import numpy as np
from openai import OpenAI
from datetime import datetime, timedelta

client = OpenAI()

class SemanticCache:
    """
    Cache responses by semantic meaning, not exact text.
    "What is RAG?" and "Explain RAG to me" return the same cached result.
    """

    def __init__(self, similarity_threshold: float = 0.92, ttl_hours: int = 24):
        self.cache: dict = {}
        self.threshold = similarity_threshold
        self.ttl = timedelta(hours=ttl_hours)
        self.stats = {"hits": 0, "misses": 0}

    def _embed(self, text: str) -> list:
        return client.embeddings.create(
            input=text,
            model="text-embedding-3-small",
            dimensions=256   # Small dims for fast cache lookup
        ).data[0].embedding

    def _similarity(self, a: list, b: list) -> float:
        a_np, b_np = np.array(a), np.array(b)
        return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))

    def lookup(self, query: str) -> tuple:
        """Returns (cached_response, similarity) or (None, 0)"""
        query_emb = self._embed(query)
        best_score, best_response = 0.0, None

        for entry in self.cache.values():
            if datetime.now() - entry["ts"] > self.ttl:
                continue
            score = self._similarity(query_emb, entry["emb"])
            if score > best_score:
                best_score, best_response = score, entry["response"]

        if best_score >= self.threshold:
            self.stats["hits"] += 1
            return best_response, best_score

        self.stats["misses"] += 1
        return None, best_score

    def store(self, query: str, response: str) -> None:
        key = hashlib.md5(query.encode()).hexdigest()
        self.cache[key] = {
            "emb": self._embed(query),
            "response": response,
            "ts": datetime.now()
        }

    def ask(self, query: str, model: str = "gpt-4o-mini") -> dict:
        cached, score = self.lookup(query)

        if cached:
            return {"response": cached, "source": "cache", "similarity": score}

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}]
        ).choices[0].message.content

        self.store(query, response)
        return {"response": response, "source": "api", "similarity": 0}

    @property
    def hit_rate(self) -> float:
        total = self.stats["hits"] + self.stats["misses"]
        return self.stats["hits"] / total if total else 0


# Pre-warm cache with FAQ
cache = SemanticCache(similarity_threshold=0.92)
faqs = [
    "What is your refund policy?",
    "How long does shipping take?",
    "How do I reset my password?",
    "What payment methods do you accept?",
]
for q in faqs:
    cache.ask(q)  # Stores in cache

# These semantically similar queries hit cache:
r1 = cache.ask("Can I get a refund?")         # Similar to "refund policy" → cache
r2 = cache.ask("How fast is delivery?")        # Similar to "shipping take" → cache
r3 = cache.ask("I forgot my password")         # Similar to "reset password" → cache

print(f"Cache hit rate: {cache.hit_rate:.1%}")  # ~75% for FAQ-heavy workloads

Expected impact on customer support workloads: 60-70% of queries are semantically similar to previously answered questions. Semantic caching with a 0.90-0.95 threshold typically achieves 40-65% hit rates in production.

Strategy 4: Batch API — 50% Off for Non-Real-Time Work

For tasks that don't need instant responses, OpenAI's Batch API delivers 50% savings.

from openai import OpenAI
import json
import tempfile

client = OpenAI()

def submit_batch_job(texts: list, task_description: str) -> str:
    """
    Process large volumes of text at 50% cost reduction.
    Completed within 24 hours.
    Ideal for: sentiment analysis, classification, summarization, translation
    """

    requests = [
        {
            "custom_id": f"item-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [
                    {"role": "system", "content": task_description},
                    {"role": "user", "content": text}
                ],
                "max_tokens": 100,
                "response_format": {"type": "json_object"}
            }
        }
        for i, text in enumerate(texts)
    ]

    # Write JSONL file
    with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False) as f:
        for req in requests:
            f.write(json.dumps(req) + '\n')
        tmp_path = f.name

    # Upload and submit
    with open(tmp_path, 'rb') as f:
        batch_file = client.files.create(file=f, purpose="batch")

    batch = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )

    print(f"Batch submitted: {batch.id}")
    print(f"Cost: 50% less than synchronous API")
    print(f"Expected completion: within 24 hours")
    return batch.id


def retrieve_batch_results(batch_id: str) -> list:
    batch = client.batches.retrieve(batch_id)

    if batch.status != "completed":
        print(f"Status: {batch.status} — not ready yet")
        return []

    content = client.files.content(batch.output_file_id)
    results = []
    for line in content.text.strip().split('\n'):
        item = json.loads(line)
        body = item["response"]["body"]
        answer = body["choices"][0]["message"]["content"]
        results.append({
            "id": item["custom_id"],
            "result": json.loads(answer) if answer.startswith('{') else answer
        })

    return results


# Example: Classify 10,000 support tickets overnight
tickets = ["My order hasn't arrived", "I love this product!", "Wrong item received"] * 3334

batch_id = submit_batch_job(
    texts=tickets,
    task_description='Classify the support ticket. Respond in JSON: {"category": "shipping|product|billing|other", "priority": "high|medium|low", "sentiment": "positive|negative|neutral"}'
)

# Run this the next day
results = retrieve_batch_results(batch_id)

Best batch use cases:

Bulk classification/summarization/translation of existing data
Nightly report generation
Offline customer feedback analysis
Content moderation (non-real-time)
Embedding generation for large document sets

Strategy 5: Output Token Optimization

Output tokens cost 3-5x more than input tokens. Structured outputs can reduce response length by 70-90%.

from openai import OpenAI
from pydantic import BaseModel
from typing import Literal

client = OpenAI()

# Bad: verbose prose response (wastes tokens)
def analyze_verbose(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Analyze the sentiment of: {text}"}]
    )
    # Returns: "The sentiment of this text is overwhelmingly positive. The author expresses..."
    # Typical: 80-200 output tokens
    return response.choices[0].message.content


# Good: structured minimal output
class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float
    reason: str   # Max 5 words enforced by prompt

def analyze_structured(text: str) -> SentimentResult:
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Classify sentiment. Reason must be 5 words max."
            },
            {"role": "user", "content": text}
        ],
        response_format=SentimentResult
    )
    # Returns: {"sentiment": "positive", "confidence": 0.94, "reason": "enthusiastic praise"}
    # Typical: 20-30 output tokens → 85% savings
    return response.choices[0].message.parsed


# Token-aware prompting
def summarize_concisely(text: str, target_words: int = 50) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"Summarize in exactly {target_words} words or fewer. Be direct. No preamble."
            },
            {"role": "user", "content": text}
        ],
        max_tokens=target_words * 2  # Hard limit as safety net
    )
    return response.choices[0].message.content


# For JSON tasks, avoid asking the model to "explain" anything
EXTRACTION_PROMPT = """Extract the requested data. Return ONLY valid JSON. No explanation.
If a field is missing, use null."""

def extract_structured_data(document: str, schema: dict) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": EXTRACTION_PROMPT},
            {"role": "user", "content": f"Schema: {json.dumps(schema)}\n\nDocument: {document}"}
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

Cost Monitoring Dashboard

You can't optimize what you don't measure.

from collections import defaultdict
from datetime import date
from openai import OpenAI
import json

client = OpenAI()

class CostTracker:
    """Real-time LLM API cost tracking with budget alerts"""

    # 2025 pricing ($/1M tokens)
    PRICING = {
        "gpt-4o":           {"input": 2.50,  "output": 10.00},
        "gpt-4o-mini":      {"input": 0.15,  "output": 0.60},
        "claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
        "claude-3-haiku-20240307":    {"input": 0.25, "output": 1.25},
    }

    def __init__(self, daily_budget: float = 100.0):
        self.daily_budget = daily_budget
        self.by_model = defaultdict(lambda: {"in": 0, "out": 0, "cost": 0.0})
        self.by_day = defaultdict(float)

    def record(self, model: str, input_tokens: int, output_tokens: int) -> float:
        p = self.PRICING.get(model, {"input": 0, "output": 0})
        cost = (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
        today = date.today().isoformat()

        self.by_model[model]["in"] += input_tokens
        self.by_model[model]["out"] += output_tokens
        self.by_model[model]["cost"] += cost
        self.by_day[today] += cost

        daily_so_far = self.by_day[today]
        if daily_so_far > self.daily_budget * 0.8:
            print(f"WARNING: 80% of daily budget used (${daily_so_far:.2f} / ${self.daily_budget:.2f})")
        if daily_so_far > self.daily_budget:
            raise RuntimeError(f"Daily budget exceeded! ${daily_so_far:.2f} > ${self.daily_budget:.2f}")

        return cost

    def report(self) -> None:
        total = sum(d["cost"] for d in self.by_model.values())
        today = date.today().isoformat()
        print(f"\n{'='*50}")
        print(f"Today's spend: ${self.by_day[today]:.4f} / ${self.daily_budget:.2f} budget")
        print(f"Total tracked: ${total:.4f}")
        print(f"\nBy model:")
        for model, d in sorted(self.by_model.items(), key=lambda x: -x[1]["cost"]):
            pct = d["cost"] / total * 100 if total else 0
            print(f"  {model:<45} ${d['cost']:.4f} ({pct:.1f}%)")
            print(f"    {d['in']:>12,} input | {d['out']:>12,} output tokens")


tracker = CostTracker(daily_budget=50.0)

def tracked_call(model: str, messages: list, **kwargs) -> str:
    """Drop-in replacement for chat completions with cost tracking"""
    response = client.chat.completions.create(model=model, messages=messages, **kwargs)
    tracker.record(model, response.usage.prompt_tokens, response.usage.completion_tokens)
    return response.choices[0].message.content

Model Cost Reference (2025)

Model	Input (1M tokens)	Output (1M tokens)	Best For
GPT-4o	$2.50	$10.00	Complex reasoning, multimodal
GPT-4o-mini	$0.15	$0.60	Most tasks
Claude 3.5 Sonnet	$3.00	$15.00	Coding, analysis, long docs
Claude 3 Haiku	$0.25	$1.25	Fast, simple tasks
Llama 3.1 70B (self-hosted)	~$0.05-0.15	~$0.05-0.15	High-volume, private data

Combined Savings Scenario

Applying all five strategies to a realistic workload:

Baseline: 100,000 requests/month, GPT-4o, avg 1,000 tokens/request

Before optimization:
100,000 × 1,000 × ($2.50 + $10.00) / 1,000,000 = $1,250/month

After all strategies:
1. Prompt Caching (80% of requests, 85% reduction):   saves ~$850
2. Model routing (70% to mini):                        saves ~$280 additional
3. Semantic caching (40% cache hit rate):              saves ~$48 additional
4. Batch API (20% of requests):                        saves ~$24 additional
5. Output optimization (30% shorter outputs):          saves ~$24 additional

Optimized monthly cost: ~$24/month
Savings: ~98% (in ideal scenario, 60-80% realistic)

Conclusion

LLM cost optimization is architecture work. It's far easier to build with these patterns from day one than to retrofit them when your invoice becomes alarming.

Prioritized action plan:

This hour: Add prompt caching to your system prompt (3 lines of code, up to 90% savings on that portion).
This week: Implement model routing — even a simple length-based heuristic cuts costs significantly.
This month: Add semantic caching for FAQ-heavy workloads, set up cost monitoring with budget alerts.
This quarter: Move batch-compatible workloads to Batch API, establish structured output patterns as the default.

Cost optimization doesn't degrade user experience. Done right, it funds better features with the money you save.