- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
- Overview of LLM Application Development
- Prompt Engineering Fundamentals
- LLM APIs and SDKs
- Retrieval-Augmented Generation (RAG)
- Tool Use and Function Calling
- Streaming and Async Patterns
- Evaluation and Testing
- Cost Optimization
- Production Deployment
- Observability and Monitoring
1. Overview of LLM Application Development
1.1 What Is an LLM Application?
An LLM application is any software system that uses a large language model as a core component to process natural language, generate content, reason over information, or take actions. Unlike traditional software where every behavior is explicitly programmed, LLM applications delegate significant portions of logic to a pre-trained model.
Common LLM application categories:
| Category | Examples | Key Challenges |
|---|---|---|
| Chatbots and assistants | Customer support, personal assistants | Context management, tone consistency |
| Document QA | Contract review, internal search | Retrieval accuracy, hallucination |
| Code generation | Autocomplete, PR review, test writing | Correctness, security |
| Content generation | Marketing copy, summarization | Quality control, brand voice |
| Data extraction | Form parsing, structured output | Schema adherence, robustness |
| Autonomous agents | Research agents, task automation | Reliability, cost control |
1.2 The Development Stack
A modern LLM application typically has these layers:
┌─────────────────────────────────────────┐
│ User Interface │
│ (Web, Mobile, API, Slack, CLI) │
├─────────────────────────────────────────┤
│ Application Logic │
│ (Orchestration, Business Rules) │
├─────────────────────────────────────────┤
│ LLM Orchestration Layer │
│ (LangChain, LlamaIndex, raw SDK) │
├─────────────────────────────────────────┤
│ LLM Provider(s) │
│ (OpenAI, Anthropic, Google, local) │
├─────────────────────────────────────────┤
│ Supporting Services │
│ (Vector DB, cache, search, tools) │
└─────────────────────────────────────────┘
1.3 Key Principles
1. Start simple, add complexity only when needed. A direct API call with a well-crafted prompt often outperforms elaborate orchestration frameworks. Add abstractions when you have a proven use case for them.
2. Treat prompts as code. Version-control your prompts, write tests for them, and track changes carefully. Prompt regressions are as damaging as code regressions.
3. Evaluate before you ship. LLM outputs are non-deterministic. Without systematic evaluation you cannot know whether your changes improved or harmed quality.
4. Design for failure. LLMs hallucinate, timeout, and return unexpected formats. Build retry logic, fallbacks, and validation from the start.
2. Prompt Engineering Fundamentals
2.1 Anatomy of a Prompt
A production prompt has four optional sections:
[System instructions]
You are a helpful customer support agent for Acme Corp.
Respond in the same language the user writes in.
Always be polite and concise. Never discuss competitors.
[Context / Retrieved documents]
Order #12345 placed on 2026-03-10. Status: shipped.
Tracking number: 1Z999AA10123456784
[Examples (few-shot)]
User: Where is my order?
Assistant: Your order #99999 shipped on March 5 and is in transit.
Expected delivery: March 12.
[User message]
I ordered something last week and haven't received it yet.
2.2 System Instructions Best Practices
Write system instructions that are:
- Role-specific: Define exactly who the model is and what its purpose is.
- Constraint-explicit: State what the model should and should not do.
- Format-specified: Describe the expected output format when it matters.
- Tone-defined: Specify formality, language, length expectations.
SYSTEM_PROMPT = """You are a senior Python code reviewer at a fintech company.
Your responsibilities:
- Review code for correctness, security vulnerabilities, and performance issues
- Suggest specific improvements with code examples
- Flag any PII handling that violates GDPR/CCPA
Output format:
- Start with a one-sentence overall assessment
- List issues with severity: [CRITICAL], [WARNING], [SUGGESTION]
- End with a revised code block if changes are needed
You do not generate new features. Only review what is given to you."""
2.3 Few-Shot Prompting
Few-shot examples show the model the expected input-output pattern. They are especially effective for:
- Custom output formats
- Domain-specific tone or terminology
- Classification with unusual labels
FEW_SHOT_EXAMPLES = """
Extract the action items from the meeting note below.
Output as JSON array.
Meeting: John will update the deployment guide by Friday.
Sarah needs to review the Q1 budget before the board meeting.
Action items: [
{"owner": "John", "task": "Update deployment guide", "due": "Friday"},
{"owner": "Sarah", "task": "Review Q1 budget", "due": "Before board meeting"}
]
Meeting: We agreed that the API team will add rate limiting this sprint.
No owner was assigned for the documentation update.
Action items: [
{"owner": "API team", "task": "Add rate limiting", "due": "This sprint"},
{"owner": null, "task": "Documentation update", "due": null}
]
Meeting: {meeting_text}
Action items:"""
2.4 Chain-of-Thought (CoT)
For complex reasoning tasks, ask the model to show its work before giving the final answer.
COT_PROMPT = """Solve the following problem step by step.
Show your reasoning at each step, then give the final answer.
Problem: A customer has a $500 credit. They place an order for $320,
then return one $80 item. What is their remaining credit?
Let's think step by step:"""
Zero-shot CoT trigger: append "Let's think step by step." to your prompt without examples. This simple addition significantly improves multi-step reasoning on many models.
2.5 Structured Output
Force the model to produce parseable output using JSON mode or schema constraints.
from openai import OpenAI
import json
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "Extract entities. Output valid JSON only."},
{"role": "user", "content": "Apple announced the iPhone 16 in Cupertino on September 9, 2024."}
]
)
data = json.loads(response.choices[0].message.content)
# {"company": "Apple", "product": "iPhone 16", "location": "Cupertino", "date": "2024-09-09"}
With Pydantic and the OpenAI SDK's structured output feature:
from pydantic import BaseModel
from openai import OpenAI
class NewsEvent(BaseModel):
company: str
product: str
location: str
date: str
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Extract the event details."},
{"role": "user", "content": "Apple announced the iPhone 16 in Cupertino on September 9, 2024."}
],
response_format=NewsEvent,
)
event = response.choices[0].message.parsed
print(event.company) # Apple
3. LLM APIs and SDKs
3.1 OpenAI SDK
from openai import OpenAI
client = OpenAI(api_key="sk-...") # or set OPENAI_API_KEY env var
# Basic chat completion
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the Transformer architecture in 3 sentences."}
],
temperature=0.7,
max_tokens=200,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
3.2 Anthropic SDK
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Explain attention mechanisms in transformers."}
]
)
print(message.content[0].text)
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
3.3 Unified Interface with LiteLLM
LiteLLM provides a single interface across 100+ LLM providers:
from litellm import completion
# OpenAI
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
# Anthropic (same interface)
response = completion(
model="anthropic/claude-opus-4-5",
messages=[{"role": "user", "content": "Hello"}]
)
# Local Ollama model (same interface)
response = completion(
model="ollama/llama3",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
3.4 Managing Conversation History
class ConversationManager:
def __init__(self, system_prompt: str, max_history: int = 20):
self.system_prompt = system_prompt
self.max_history = max_history
self.history: list[dict] = []
self.client = OpenAI()
def chat(self, user_message: str) -> str:
self.history.append({"role": "user", "content": user_message})
# Trim history to avoid exceeding context window
if len(self.history) > self.max_history:
self.history = self.history[-self.max_history:]
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": self.system_prompt},
*self.history
]
)
assistant_message = response.choices[0].message.content
self.history.append({"role": "assistant", "content": assistant_message})
return assistant_message
4. Retrieval-Augmented Generation (RAG)
4.1 Why RAG?
LLMs have two fundamental limitations that RAG addresses:
- Knowledge cutoff: The model only knows what was in its training data.
- Context window limit: The model cannot "know" all your documents at once.
RAG solves both by retrieving the relevant pieces of information at inference time and injecting them into the prompt.
User Query
│
▼
[Embed query] ──► [Vector Search] ──► Top-K relevant chunks
│
▼
[Build augmented prompt]
System: You are a helpful assistant.
Context: {retrieved chunks}
User: {original query}
│
▼
[LLM generates answer]
4.2 Document Ingestion Pipeline
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# 1. Load documents
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(documents)
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")
4.3 Retrieval and Generation
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Load existing vectorstore
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")
)
# Create retriever
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance for diversity
search_kwargs={"k": 5, "fetch_k": 20}
)
# Custom prompt
QA_PROMPT = PromptTemplate(
template="""Use the following context to answer the question.
If the answer is not in the context, say "I don't have information about that."
Do not make up information.
Context:
{context}
Question: {question}
Answer:""",
input_variables=["context", "question"]
)
# Chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": QA_PROMPT},
return_source_documents=True
)
result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
for doc in result["source_documents"]:
print(f"Source: {doc.metadata['source']}, page {doc.metadata.get('page', 'N/A')}")
4.4 Improving Retrieval Quality
Hybrid search combines dense (semantic) and sparse (keyword) retrieval:
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# Dense retriever (semantic)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Sparse retriever (BM25 keyword)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Ensemble with equal weight
ensemble_retriever = EnsembleRetriever(
retrievers=[dense_retriever, bm25_retriever],
weights=[0.6, 0.4]
)
Reranking with a cross-encoder after initial retrieval:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, docs: list, top_k: int = 3) -> list:
pairs = [(query, doc.page_content) for doc in docs]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)
return [doc for _, doc in ranked[:top_k]]
5. Tool Use and Function Calling
5.1 Defining Tools
Tools let the LLM call external APIs, search databases, or execute code:
import json
import requests
from openai import OpenAI
client = OpenAI()
# Define tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'San Francisco'"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["city"]
}
}
},
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for current information",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
}
]
5.2 Handling Tool Calls
def get_weather(city: str, unit: str = "celsius") -> dict:
# Real implementation would call a weather API
return {"city": city, "temp": 18, "unit": unit, "condition": "sunny"}
def search_web(query: str) -> str:
# Real implementation would call a search API
return f"Search results for: {query}"
TOOL_MAP = {
"get_weather": get_weather,
"search_web": search_web,
}
def run_agent(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
# No tool call → final answer
if not message.tool_calls:
return message.content
# Add assistant's response with tool calls to history
messages.append(message)
# Execute each tool call
for tool_call in message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
result = TOOL_MAP[func_name](**func_args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
answer = run_agent("What's the weather like in Tokyo and should I bring an umbrella?")
print(answer)
5.3 Parallel Tool Calls
GPT-4o and Claude 3+ support parallel tool calls, greatly reducing latency for independent operations:
# The model may call multiple tools at once
# Your loop above already handles this — message.tool_calls is a list
# Both weather and search can be called in a single model turn
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "Compare weather in Tokyo and Paris, and search for travel tips."}
],
tools=tools,
parallel_tool_calls=True # default True for GPT-4o
)
# response may have 3 tool calls at once: weather(Tokyo), weather(Paris), search(travel tips)
6. Streaming and Async Patterns
6.1 Streaming Responses
Streaming dramatically improves perceived latency for users because text appears as it is generated:
from openai import OpenAI
client = OpenAI()
# Synchronous streaming
with client.chat.completions.stream(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a short story about a robot."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
6.2 Async Streaming with FastAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
app = FastAPI()
client = AsyncOpenAI()
@app.post("/chat")
async def chat(body: dict):
async def generate():
async with client.chat.completions.stream(
model="gpt-4o",
messages=body["messages"]
) as stream:
async for text in stream.text_stream:
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
6.3 Async Batch Processing
When processing many items, async concurrency provides large throughput gains:
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def classify_one(text: str, semaphore: asyncio.Semaphore) -> str:
async with semaphore:
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Classify as POSITIVE, NEGATIVE, or NEUTRAL."},
{"role": "user", "content": text}
],
max_tokens=10
)
return response.choices[0].message.content.strip()
async def classify_batch(texts: list[str], max_concurrent: int = 20) -> list[str]:
semaphore = asyncio.Semaphore(max_concurrent)
tasks = [classify_one(text, semaphore) for text in texts]
return await asyncio.gather(*tasks)
# Usage
texts = ["I love this product!", "Terrible experience.", "It was okay."] * 100
results = asyncio.run(classify_batch(texts))
7. Evaluation and Testing
7.1 Why LLM Evaluation Is Hard
Traditional software testing uses deterministic assertions:
assert add(2, 3) == 5 # Always passes or fails
LLM outputs are non-deterministic and require:
- Semantic equivalence checks (not string equality)
- Rubric-based grading
- Reference-free quality assessment
- Statistical sampling (one run is not enough)
7.2 LLM-as-Judge
Use a capable LLM to evaluate another LLM's outputs:
from openai import OpenAI
client = OpenAI()
JUDGE_PROMPT = """You are evaluating an AI assistant's response.
Rate the response on the following criteria (1-5 each):
- Accuracy: Is the information correct?
- Helpfulness: Does it fully address the question?
- Conciseness: Is it appropriately brief?
Question: {question}
Response: {response}
Reference answer: {reference}
Output as JSON: {{"accuracy": X, "helpfulness": X, "conciseness": X, "reasoning": "..."}}"""
def evaluate(question: str, response: str, reference: str) -> dict:
import json
result = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
question=question,
response=response,
reference=reference
)
}]
)
return json.loads(result.choices[0].message.content)
7.3 Evaluation Frameworks
DeepEval provides comprehensive LLM evaluation metrics:
from deepeval import evaluate
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualRecallMetric,
)
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
expected_output="Paris",
retrieval_context=["France is a country in Western Europe. Its capital is Paris."]
)
metrics = [
AnswerRelevancyMetric(threshold=0.8),
FaithfulnessMetric(threshold=0.9),
ContextualRecallMetric(threshold=0.8),
]
evaluate([test_case], metrics)
7.4 Regression Testing with Promptfoo
Promptfoo lets you define test cases in YAML and run them across model versions:
# promptfooconfig.yaml
prompts:
- 'Summarize the following text in 2 sentences: {{text}}'
providers:
- openai:gpt-4o
- openai:gpt-4o-mini
tests:
- vars:
text: "The Eiffel Tower was built in 1889 for the World's Fair..."
assert:
- type: llm-rubric
value: "The summary should mention the year 1889 and the World's Fair"
- type: javascript
value: "output.split('.').length <= 3" # max 3 sentences
8. Cost Optimization
8.1 Token Counting and Budgeting
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def estimate_cost(
input_tokens: int,
output_tokens: int,
model: str = "gpt-4o"
) -> float:
# Prices per 1M tokens (March 2026 approximate)
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-opus-4-5": {"input": 15.00, "output": 75.00},
"claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}
p = PRICING.get(model, {"input": 5.0, "output": 15.0})
return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
8.2 Prompt Caching
Anthropic and OpenAI both offer prompt caching for repeated system prompts or large context:
# Anthropic prompt caching
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": very_long_system_prompt, # e.g. 50k token policy document
"cache_control": {"type": "ephemeral"} # Cache this prefix
}
],
messages=[{"role": "user", "content": user_question}]
)
# First call: full price. Subsequent calls: ~90% discount on cached tokens.
8.3 Model Routing
Route tasks to the cheapest model that can handle them:
def route_to_model(task: str, complexity: str) -> str:
"""Route to appropriate model based on task complexity."""
if complexity == "simple":
return "gpt-4o-mini" # Simple classification, extraction
elif complexity == "medium":
return "gpt-4o" # Summarization, Q&A
else:
return "claude-opus-4-5" # Complex reasoning, code review
# Example: classify complexity before routing
def smart_complete(messages: list, task_description: str) -> str:
from openai import OpenAI
client = OpenAI()
# Cheap classification step
complexity = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Rate the complexity of this task as 'simple', 'medium', or 'complex': {task_description}"
}],
max_tokens=5
).choices[0].message.content.strip().lower()
model = route_to_model(task_description, complexity)
return client.chat.completions.create(
model=model,
messages=messages
).choices[0].message.content
9. Production Deployment
9.1 FastAPI Backend
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import AsyncOpenAI
import asyncio
app = FastAPI(title="LLM API")
client = AsyncOpenAI()
class ChatRequest(BaseModel):
messages: list[dict]
model: str = "gpt-4o"
temperature: float = 0.7
max_tokens: int = 1000
class ChatResponse(BaseModel):
content: str
usage: dict
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
try:
response = await client.chat.completions.create(
model=request.model,
messages=request.messages,
temperature=request.temperature,
max_tokens=request.max_tokens,
)
return ChatResponse(
content=response.choices[0].message.content,
usage={
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens
}
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
9.2 Rate Limiting and Retry
import asyncio
import random
from functools import wraps
def with_retry(max_attempts: int = 3, base_delay: float = 1.0):
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return await func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1:
raise
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
return wrapper
return decorator
@with_retry(max_attempts=3)
async def robust_completion(messages: list) -> str:
client = AsyncOpenAI()
response = await client.chat.completions.create(
model="gpt-4o",
messages=messages
)
return response.choices[0].message.content
9.3 Caching with Redis
import hashlib
import json
import redis.asyncio as redis
redis_client = redis.from_url("redis://localhost:6379")
CACHE_TTL = 3600 # 1 hour
def cache_key(messages: list, model: str) -> str:
payload = json.dumps({"messages": messages, "model": model}, sort_keys=True)
return f"llm:{hashlib.md5(payload.encode()).hexdigest()}"
async def cached_completion(messages: list, model: str = "gpt-4o") -> str:
key = cache_key(messages, model)
# Check cache
cached = await redis_client.get(key)
if cached:
return cached.decode()
# Generate
from openai import AsyncOpenAI
client = AsyncOpenAI()
response = await client.chat.completions.create(model=model, messages=messages)
result = response.choices[0].message.content
# Store with TTL
await redis_client.setex(key, CACHE_TTL, result)
return result
10. Observability and Monitoring
10.1 Key Metrics to Track
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Latency (p50, p95, p99) | User experience | p95 > 5s for streaming |
| Token usage | Cost | Budget deviation > 20% |
| Error rate | Reliability | > 1% of requests |
| Cache hit rate | Cost efficiency | < 30% (investigate) |
| Evaluation scores | Quality | Drop > 5% from baseline |
10.2 LangSmith Tracing
import os
from langchain_openai import ChatOpenAI
from langchain.callbacks.tracers import LangChainTracer
os.environ["LANGCHAIN_API_KEY"] = "ls__..."
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-llm-app"
# All LangChain calls are automatically traced
llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke("What is RAG?")
# Full trace (prompt, response, latency, tokens) visible in LangSmith UI
10.3 Custom Logging
import time
import logging
from dataclasses import dataclass, field, asdict
logger = logging.getLogger(__name__)
@dataclass
class LLMCallLog:
model: str
input_tokens: int
output_tokens: int
latency_ms: float
success: bool
error: str = ""
metadata: dict = field(default_factory=dict)
async def traced_completion(messages: list, model: str = "gpt-4o", **metadata) -> str:
from openai import AsyncOpenAI
client = AsyncOpenAI()
start = time.perf_counter()
success = True
error = ""
input_tokens = output_tokens = 0
try:
response = await client.chat.completions.create(model=model, messages=messages)
result = response.choices[0].message.content
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
return result
except Exception as e:
success = False
error = str(e)
raise
finally:
log = LLMCallLog(
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
latency_ms=(time.perf_counter() - start) * 1000,
success=success,
error=error,
metadata=metadata
)
logger.info("llm_call", extra=asdict(log))
10.4 Guardrails and Safety
from guardrails import Guard
from guardrails.hub import ToxicLanguage, DetectPII
guard = Guard().use_many(
ToxicLanguage(threshold=0.5, on_fail="exception"),
DetectPII(pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER"], on_fail="fix"),
)
def safe_completion(user_input: str) -> str:
from openai import OpenAI
client = OpenAI()
# Validate input
guard.validate(user_input)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": user_input}]
)
output = response.choices[0].message.content
# Validate and fix output
validated = guard.validate(output)
return validated.validated_output
Summary
Building production-grade LLM applications requires mastery across multiple dimensions:
| Area | Key Takeaway |
|---|---|
| Prompt engineering | Treat prompts as code; version, test, and iterate them |
| RAG | Hybrid search + reranking dramatically improves retrieval |
| Tool use | Parallel tool calls reduce latency for multi-step tasks |
| Streaming | Essential for interactive UX; use SSE with FastAPI |
| Evaluation | LLM-as-judge + automated test suites catch regressions |
| Cost | Caching, routing, and prompt caching can cut costs by 80%+ |
| Monitoring | Track latency, tokens, and quality metrics from day one |
The field evolves rapidly, but these fundamentals will serve you well regardless of which models or frameworks dominate next year. Start simple, measure everything, and iterate based on real usage data.
Knowledge Check Quiz
Q1. What does RAG stand for, and why is it useful?
RAG stands for Retrieval-Augmented Generation. It is useful because it lets an LLM answer questions about documents it was never trained on by retrieving relevant text at inference time and injecting it into the prompt. This addresses both the knowledge cutoff problem and the context window limitation.
Q2. What is the main advantage of parallel tool calls in an LLM agent?
Parallel tool calls allow the model to invoke multiple tools simultaneously in a single turn rather than sequentially. This reduces total latency for multi-step tasks where the tool calls are independent of each other.
Q3. Why is LLM-as-judge evaluation preferred over simple string matching?
LLM outputs are semantically equivalent in many different phrasings, so string matching produces false negatives. An LLM judge can assess semantic correctness, helpfulness, and quality using a rubric, providing much more accurate quality signals than deterministic comparison.
Q4. Name two techniques for reducing LLM API costs without degrading quality.
- Prompt caching: Cache repeated large prefixes (system prompts, reference documents) so they are only charged at full price once.
- Model routing: Direct simple tasks (classification, extraction) to cheap small models, reserving expensive large models only for complex reasoning tasks.