Skip to content
Published on

LLM Application Development Guide: From Prototype to Production

Authors

Table of Contents

  1. Overview of LLM Application Development
  2. Prompt Engineering Fundamentals
  3. LLM APIs and SDKs
  4. Retrieval-Augmented Generation (RAG)
  5. Tool Use and Function Calling
  6. Streaming and Async Patterns
  7. Evaluation and Testing
  8. Cost Optimization
  9. Production Deployment
  10. Observability and Monitoring

1. Overview of LLM Application Development

1.1 What Is an LLM Application?

An LLM application is any software system that uses a large language model as a core component to process natural language, generate content, reason over information, or take actions. Unlike traditional software where every behavior is explicitly programmed, LLM applications delegate significant portions of logic to a pre-trained model.

Common LLM application categories:

CategoryExamplesKey Challenges
Chatbots and assistantsCustomer support, personal assistantsContext management, tone consistency
Document QAContract review, internal searchRetrieval accuracy, hallucination
Code generationAutocomplete, PR review, test writingCorrectness, security
Content generationMarketing copy, summarizationQuality control, brand voice
Data extractionForm parsing, structured outputSchema adherence, robustness
Autonomous agentsResearch agents, task automationReliability, cost control

1.2 The Development Stack

A modern LLM application typically has these layers:

┌─────────────────────────────────────────┐
User Interface  (Web, Mobile, API, Slack, CLI)├─────────────────────────────────────────┤
Application Logic  (Orchestration, Business Rules)├─────────────────────────────────────────┤
LLM Orchestration Layer  (LangChain, LlamaIndex, raw SDK)├─────────────────────────────────────────┤
LLM Provider(s)  (OpenAI, Anthropic, Google, local)├─────────────────────────────────────────┤
Supporting Services  (Vector DB, cache, search, tools)└─────────────────────────────────────────┘

1.3 Key Principles

1. Start simple, add complexity only when needed. A direct API call with a well-crafted prompt often outperforms elaborate orchestration frameworks. Add abstractions when you have a proven use case for them.

2. Treat prompts as code. Version-control your prompts, write tests for them, and track changes carefully. Prompt regressions are as damaging as code regressions.

3. Evaluate before you ship. LLM outputs are non-deterministic. Without systematic evaluation you cannot know whether your changes improved or harmed quality.

4. Design for failure. LLMs hallucinate, timeout, and return unexpected formats. Build retry logic, fallbacks, and validation from the start.


2. Prompt Engineering Fundamentals

2.1 Anatomy of a Prompt

A production prompt has four optional sections:

[System instructions]
You are a helpful customer support agent for Acme Corp.
Respond in the same language the user writes in.
Always be polite and concise. Never discuss competitors.

[Context / Retrieved documents]
Order #12345 placed on 2026-03-10. Status: shipped.
Tracking number: 1Z999AA10123456784

[Examples (few-shot)]
User: Where is my order?
Assistant: Your order #99999 shipped on March 5 and is in transit.
Expected delivery: March 12.

[User message]
I ordered something last week and haven't received it yet.

2.2 System Instructions Best Practices

Write system instructions that are:

  • Role-specific: Define exactly who the model is and what its purpose is.
  • Constraint-explicit: State what the model should and should not do.
  • Format-specified: Describe the expected output format when it matters.
  • Tone-defined: Specify formality, language, length expectations.
SYSTEM_PROMPT = """You are a senior Python code reviewer at a fintech company.

Your responsibilities:
- Review code for correctness, security vulnerabilities, and performance issues
- Suggest specific improvements with code examples
- Flag any PII handling that violates GDPR/CCPA

Output format:
- Start with a one-sentence overall assessment
- List issues with severity: [CRITICAL], [WARNING], [SUGGESTION]
- End with a revised code block if changes are needed

You do not generate new features. Only review what is given to you."""

2.3 Few-Shot Prompting

Few-shot examples show the model the expected input-output pattern. They are especially effective for:

  • Custom output formats
  • Domain-specific tone or terminology
  • Classification with unusual labels
FEW_SHOT_EXAMPLES = """
Extract the action items from the meeting note below.
Output as JSON array.

Meeting: John will update the deployment guide by Friday.
Sarah needs to review the Q1 budget before the board meeting.
Action items: [
  {"owner": "John", "task": "Update deployment guide", "due": "Friday"},
  {"owner": "Sarah", "task": "Review Q1 budget", "due": "Before board meeting"}
]

Meeting: We agreed that the API team will add rate limiting this sprint.
No owner was assigned for the documentation update.
Action items: [
  {"owner": "API team", "task": "Add rate limiting", "due": "This sprint"},
  {"owner": null, "task": "Documentation update", "due": null}
]

Meeting: {meeting_text}
Action items:"""

2.4 Chain-of-Thought (CoT)

For complex reasoning tasks, ask the model to show its work before giving the final answer.

COT_PROMPT = """Solve the following problem step by step.
Show your reasoning at each step, then give the final answer.

Problem: A customer has a $500 credit. They place an order for $320,
then return one $80 item. What is their remaining credit?

Let's think step by step:"""

Zero-shot CoT trigger: append "Let's think step by step." to your prompt without examples. This simple addition significantly improves multi-step reasoning on many models.

2.5 Structured Output

Force the model to produce parseable output using JSON mode or schema constraints.

from openai import OpenAI
import json

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Extract entities. Output valid JSON only."},
        {"role": "user", "content": "Apple announced the iPhone 16 in Cupertino on September 9, 2024."}
    ]
)

data = json.loads(response.choices[0].message.content)
# {"company": "Apple", "product": "iPhone 16", "location": "Cupertino", "date": "2024-09-09"}

With Pydantic and the OpenAI SDK's structured output feature:

from pydantic import BaseModel
from openai import OpenAI

class NewsEvent(BaseModel):
    company: str
    product: str
    location: str
    date: str

client = OpenAI()
response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract the event details."},
        {"role": "user", "content": "Apple announced the iPhone 16 in Cupertino on September 9, 2024."}
    ],
    response_format=NewsEvent,
)
event = response.choices[0].message.parsed
print(event.company)  # Apple

3. LLM APIs and SDKs

3.1 OpenAI SDK

from openai import OpenAI

client = OpenAI(api_key="sk-...")  # or set OPENAI_API_KEY env var

# Basic chat completion
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the Transformer architecture in 3 sentences."}
    ],
    temperature=0.7,
    max_tokens=200,
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

3.2 Anthropic SDK

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Explain attention mechanisms in transformers."}
    ]
)

print(message.content[0].text)
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")

3.3 Unified Interface with LiteLLM

LiteLLM provides a single interface across 100+ LLM providers:

from litellm import completion

# OpenAI
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

# Anthropic (same interface)
response = completion(
    model="anthropic/claude-opus-4-5",
    messages=[{"role": "user", "content": "Hello"}]
)

# Local Ollama model (same interface)
response = completion(
    model="ollama/llama3",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

3.4 Managing Conversation History

class ConversationManager:
    def __init__(self, system_prompt: str, max_history: int = 20):
        self.system_prompt = system_prompt
        self.max_history = max_history
        self.history: list[dict] = []
        self.client = OpenAI()

    def chat(self, user_message: str) -> str:
        self.history.append({"role": "user", "content": user_message})

        # Trim history to avoid exceeding context window
        if len(self.history) > self.max_history:
            self.history = self.history[-self.max_history:]

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": self.system_prompt},
                *self.history
            ]
        )

        assistant_message = response.choices[0].message.content
        self.history.append({"role": "assistant", "content": assistant_message})
        return assistant_message

4. Retrieval-Augmented Generation (RAG)

4.1 Why RAG?

LLMs have two fundamental limitations that RAG addresses:

  1. Knowledge cutoff: The model only knows what was in its training data.
  2. Context window limit: The model cannot "know" all your documents at once.

RAG solves both by retrieving the relevant pieces of information at inference time and injecting them into the prompt.

User Query
[Embed query] ──► [Vector Search] ──► Top-K relevant chunks
              [Build augmented prompt]
              System: You are a helpful assistant.
              Context: {retrieved chunks}
              User: {original query}
                   [LLM generates answer]

4.2 Document Ingestion Pipeline

from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# 1. Load documents
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(documents)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")

4.3 Retrieval and Generation

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load existing vectorstore
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")
)

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="mmr",          # Maximal Marginal Relevance for diversity
    search_kwargs={"k": 5, "fetch_k": 20}
)

# Custom prompt
QA_PROMPT = PromptTemplate(
    template="""Use the following context to answer the question.
If the answer is not in the context, say "I don't have information about that."
Do not make up information.

Context:
{context}

Question: {question}

Answer:""",
    input_variables=["context", "question"]
)

# Chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": QA_PROMPT},
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
for doc in result["source_documents"]:
    print(f"Source: {doc.metadata['source']}, page {doc.metadata.get('page', 'N/A')}")

4.4 Improving Retrieval Quality

Hybrid search combines dense (semantic) and sparse (keyword) retrieval:

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# Dense retriever (semantic)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Sparse retriever (BM25 keyword)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Ensemble with equal weight
ensemble_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

Reranking with a cross-encoder after initial retrieval:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, docs: list, top_k: int = 3) -> list:
    pairs = [(query, doc.page_content) for doc in docs]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)
    return [doc for _, doc in ranked[:top_k]]

5. Tool Use and Function Calling

5.1 Defining Tools

Tools let the LLM call external APIs, search databases, or execute code:

import json
import requests
from openai import OpenAI

client = OpenAI()

# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name, e.g. 'San Francisco'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for current information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"}
                },
                "required": ["query"]
            }
        }
    }
]

5.2 Handling Tool Calls

def get_weather(city: str, unit: str = "celsius") -> dict:
    # Real implementation would call a weather API
    return {"city": city, "temp": 18, "unit": unit, "condition": "sunny"}

def search_web(query: str) -> str:
    # Real implementation would call a search API
    return f"Search results for: {query}"

TOOL_MAP = {
    "get_weather": get_weather,
    "search_web": search_web,
}

def run_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        message = response.choices[0].message

        # No tool call → final answer
        if not message.tool_calls:
            return message.content

        # Add assistant's response with tool calls to history
        messages.append(message)

        # Execute each tool call
        for tool_call in message.tool_calls:
            func_name = tool_call.function.name
            func_args = json.loads(tool_call.function.arguments)

            result = TOOL_MAP[func_name](**func_args)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })

answer = run_agent("What's the weather like in Tokyo and should I bring an umbrella?")
print(answer)

5.3 Parallel Tool Calls

GPT-4o and Claude 3+ support parallel tool calls, greatly reducing latency for independent operations:

# The model may call multiple tools at once
# Your loop above already handles this — message.tool_calls is a list
# Both weather and search can be called in a single model turn

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Compare weather in Tokyo and Paris, and search for travel tips."}
    ],
    tools=tools,
    parallel_tool_calls=True  # default True for GPT-4o
)

# response may have 3 tool calls at once: weather(Tokyo), weather(Paris), search(travel tips)

6. Streaming and Async Patterns

6.1 Streaming Responses

Streaming dramatically improves perceived latency for users because text appears as it is generated:

from openai import OpenAI

client = OpenAI()

# Synchronous streaming
with client.chat.completions.stream(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a short story about a robot."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

6.2 Async Streaming with FastAPI

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI()

@app.post("/chat")
async def chat(body: dict):
    async def generate():
        async with client.chat.completions.stream(
            model="gpt-4o",
            messages=body["messages"]
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

6.3 Async Batch Processing

When processing many items, async concurrency provides large throughput gains:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def classify_one(text: str, semaphore: asyncio.Semaphore) -> str:
    async with semaphore:
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Classify as POSITIVE, NEGATIVE, or NEUTRAL."},
                {"role": "user", "content": text}
            ],
            max_tokens=10
        )
        return response.choices[0].message.content.strip()

async def classify_batch(texts: list[str], max_concurrent: int = 20) -> list[str]:
    semaphore = asyncio.Semaphore(max_concurrent)
    tasks = [classify_one(text, semaphore) for text in texts]
    return await asyncio.gather(*tasks)

# Usage
texts = ["I love this product!", "Terrible experience.", "It was okay."] * 100
results = asyncio.run(classify_batch(texts))

7. Evaluation and Testing

7.1 Why LLM Evaluation Is Hard

Traditional software testing uses deterministic assertions:

assert add(2, 3) == 5  # Always passes or fails

LLM outputs are non-deterministic and require:

  • Semantic equivalence checks (not string equality)
  • Rubric-based grading
  • Reference-free quality assessment
  • Statistical sampling (one run is not enough)

7.2 LLM-as-Judge

Use a capable LLM to evaluate another LLM's outputs:

from openai import OpenAI

client = OpenAI()

JUDGE_PROMPT = """You are evaluating an AI assistant's response.
Rate the response on the following criteria (1-5 each):
- Accuracy: Is the information correct?
- Helpfulness: Does it fully address the question?
- Conciseness: Is it appropriately brief?

Question: {question}
Response: {response}
Reference answer: {reference}

Output as JSON: {{"accuracy": X, "helpfulness": X, "conciseness": X, "reasoning": "..."}}"""

def evaluate(question: str, response: str, reference: str) -> dict:
    import json
    result = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                question=question,
                response=response,
                reference=reference
            )
        }]
    )
    return json.loads(result.choices[0].message.content)

7.3 Evaluation Frameworks

DeepEval provides comprehensive LLM evaluation metrics:

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualRecallMetric,
)
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    expected_output="Paris",
    retrieval_context=["France is a country in Western Europe. Its capital is Paris."]
)

metrics = [
    AnswerRelevancyMetric(threshold=0.8),
    FaithfulnessMetric(threshold=0.9),
    ContextualRecallMetric(threshold=0.8),
]

evaluate([test_case], metrics)

7.4 Regression Testing with Promptfoo

Promptfoo lets you define test cases in YAML and run them across model versions:

# promptfooconfig.yaml
prompts:
  - 'Summarize the following text in 2 sentences: {{text}}'

providers:
  - openai:gpt-4o
  - openai:gpt-4o-mini

tests:
  - vars:
      text: "The Eiffel Tower was built in 1889 for the World's Fair..."
    assert:
      - type: llm-rubric
        value: "The summary should mention the year 1889 and the World's Fair"
      - type: javascript
        value: "output.split('.').length <= 3" # max 3 sentences

8. Cost Optimization

8.1 Token Counting and Budgeting

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def estimate_cost(
    input_tokens: int,
    output_tokens: int,
    model: str = "gpt-4o"
) -> float:
    # Prices per 1M tokens (March 2026 approximate)
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-opus-4-5": {"input": 15.00, "output": 75.00},
        "claude-haiku-3-5": {"input": 0.80, "output": 4.00},
    }
    p = PRICING.get(model, {"input": 5.0, "output": 15.0})
    return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000

8.2 Prompt Caching

Anthropic and OpenAI both offer prompt caching for repeated system prompts or large context:

# Anthropic prompt caching
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": very_long_system_prompt,  # e.g. 50k token policy document
            "cache_control": {"type": "ephemeral"}  # Cache this prefix
        }
    ],
    messages=[{"role": "user", "content": user_question}]
)
# First call: full price. Subsequent calls: ~90% discount on cached tokens.

8.3 Model Routing

Route tasks to the cheapest model that can handle them:

def route_to_model(task: str, complexity: str) -> str:
    """Route to appropriate model based on task complexity."""
    if complexity == "simple":
        return "gpt-4o-mini"          # Simple classification, extraction
    elif complexity == "medium":
        return "gpt-4o"               # Summarization, Q&A
    else:
        return "claude-opus-4-5"      # Complex reasoning, code review

# Example: classify complexity before routing
def smart_complete(messages: list, task_description: str) -> str:
    from openai import OpenAI
    client = OpenAI()

    # Cheap classification step
    complexity = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Rate the complexity of this task as 'simple', 'medium', or 'complex': {task_description}"
        }],
        max_tokens=5
    ).choices[0].message.content.strip().lower()

    model = route_to_model(task_description, complexity)

    return client.chat.completions.create(
        model=model,
        messages=messages
    ).choices[0].message.content

9. Production Deployment

9.1 FastAPI Backend

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import AsyncOpenAI
import asyncio

app = FastAPI(title="LLM API")
client = AsyncOpenAI()

class ChatRequest(BaseModel):
    messages: list[dict]
    model: str = "gpt-4o"
    temperature: float = 0.7
    max_tokens: int = 1000

class ChatResponse(BaseModel):
    content: str
    usage: dict

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    try:
        response = await client.chat.completions.create(
            model=request.model,
            messages=request.messages,
            temperature=request.temperature,
            max_tokens=request.max_tokens,
        )
        return ChatResponse(
            content=response.choices[0].message.content,
            usage={
                "input_tokens": response.usage.prompt_tokens,
                "output_tokens": response.usage.completion_tokens
            }
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

9.2 Rate Limiting and Retry

import asyncio
import random
from functools import wraps

def with_retry(max_attempts: int = 3, base_delay: float = 1.0):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return await func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    # Exponential backoff with jitter
                    delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                    await asyncio.sleep(delay)
        return wrapper
    return decorator

@with_retry(max_attempts=3)
async def robust_completion(messages: list) -> str:
    client = AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

9.3 Caching with Redis

import hashlib
import json
import redis.asyncio as redis

redis_client = redis.from_url("redis://localhost:6379")
CACHE_TTL = 3600  # 1 hour

def cache_key(messages: list, model: str) -> str:
    payload = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    return f"llm:{hashlib.md5(payload.encode()).hexdigest()}"

async def cached_completion(messages: list, model: str = "gpt-4o") -> str:
    key = cache_key(messages, model)

    # Check cache
    cached = await redis_client.get(key)
    if cached:
        return cached.decode()

    # Generate
    from openai import AsyncOpenAI
    client = AsyncOpenAI()
    response = await client.chat.completions.create(model=model, messages=messages)
    result = response.choices[0].message.content

    # Store with TTL
    await redis_client.setex(key, CACHE_TTL, result)
    return result

10. Observability and Monitoring

10.1 Key Metrics to Track

MetricWhy It MattersAlert Threshold
Latency (p50, p95, p99)User experiencep95 > 5s for streaming
Token usageCostBudget deviation > 20%
Error rateReliability> 1% of requests
Cache hit rateCost efficiency< 30% (investigate)
Evaluation scoresQualityDrop > 5% from baseline

10.2 LangSmith Tracing

import os
from langchain_openai import ChatOpenAI
from langchain.callbacks.tracers import LangChainTracer

os.environ["LANGCHAIN_API_KEY"] = "ls__..."
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-llm-app"

# All LangChain calls are automatically traced
llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke("What is RAG?")
# Full trace (prompt, response, latency, tokens) visible in LangSmith UI

10.3 Custom Logging

import time
import logging
from dataclasses import dataclass, field, asdict

logger = logging.getLogger(__name__)

@dataclass
class LLMCallLog:
    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    success: bool
    error: str = ""
    metadata: dict = field(default_factory=dict)

async def traced_completion(messages: list, model: str = "gpt-4o", **metadata) -> str:
    from openai import AsyncOpenAI
    client = AsyncOpenAI()

    start = time.perf_counter()
    success = True
    error = ""
    input_tokens = output_tokens = 0

    try:
        response = await client.chat.completions.create(model=model, messages=messages)
        result = response.choices[0].message.content
        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        return result
    except Exception as e:
        success = False
        error = str(e)
        raise
    finally:
        log = LLMCallLog(
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            latency_ms=(time.perf_counter() - start) * 1000,
            success=success,
            error=error,
            metadata=metadata
        )
        logger.info("llm_call", extra=asdict(log))

10.4 Guardrails and Safety

from guardrails import Guard
from guardrails.hub import ToxicLanguage, DetectPII

guard = Guard().use_many(
    ToxicLanguage(threshold=0.5, on_fail="exception"),
    DetectPII(pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER"], on_fail="fix"),
)

def safe_completion(user_input: str) -> str:
    from openai import OpenAI
    client = OpenAI()

    # Validate input
    guard.validate(user_input)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_input}]
    )
    output = response.choices[0].message.content

    # Validate and fix output
    validated = guard.validate(output)
    return validated.validated_output

Summary

Building production-grade LLM applications requires mastery across multiple dimensions:

AreaKey Takeaway
Prompt engineeringTreat prompts as code; version, test, and iterate them
RAGHybrid search + reranking dramatically improves retrieval
Tool useParallel tool calls reduce latency for multi-step tasks
StreamingEssential for interactive UX; use SSE with FastAPI
EvaluationLLM-as-judge + automated test suites catch regressions
CostCaching, routing, and prompt caching can cut costs by 80%+
MonitoringTrack latency, tokens, and quality metrics from day one

The field evolves rapidly, but these fundamentals will serve you well regardless of which models or frameworks dominate next year. Start simple, measure everything, and iterate based on real usage data.

Knowledge Check Quiz

Q1. What does RAG stand for, and why is it useful?

RAG stands for Retrieval-Augmented Generation. It is useful because it lets an LLM answer questions about documents it was never trained on by retrieving relevant text at inference time and injecting it into the prompt. This addresses both the knowledge cutoff problem and the context window limitation.

Q2. What is the main advantage of parallel tool calls in an LLM agent?

Parallel tool calls allow the model to invoke multiple tools simultaneously in a single turn rather than sequentially. This reduces total latency for multi-step tasks where the tool calls are independent of each other.

Q3. Why is LLM-as-judge evaluation preferred over simple string matching?

LLM outputs are semantically equivalent in many different phrasings, so string matching produces false negatives. An LLM judge can assess semantic correctness, helpfulness, and quality using a rubric, providing much more accurate quality signals than deterministic comparison.

Q4. Name two techniques for reducing LLM API costs without degrading quality.

  1. Prompt caching: Cache repeated large prefixes (system prompts, reference documents) so they are only charged at full price once.
  2. Model routing: Direct simple tasks (classification, extraction) to cheap small models, reserving expensive large models only for complex reasoning tasks.