💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. [Overview of LLM Application Development](#1-overview)

2. [Prompt Engineering Fundamentals](#2-prompt-engineering)

3. [LLM APIs and SDKs](#3-apis-and-sdks)

4. [Retrieval-Augmented Generation (RAG)](#4-rag)

5. [Tool Use and Function Calling](#5-tool-use)

6. [Streaming and Async Patterns](#6-streaming)

7. [Evaluation and Testing](#7-evaluation)

8. [Cost Optimization](#8-cost-optimization)

9. [Production Deployment](#9-production-deployment)

10. [Observability and Monitoring](#10-observability)

1. Overview of LLM Application Development

1.1 What Is an LLM Application?

An LLM application is any software system that uses a large language model as a core component to process natural language, generate content, reason over information, or take actions. Unlike traditional software where every behavior is explicitly programmed, LLM applications delegate significant portions of logic to a pre-trained model.

Common LLM application categories:

| Category | Examples | Key Challenges |

| ----------------------- | ------------------------------------- | ------------------------------------ |

| Chatbots and assistants | Customer support, personal assistants | Context management, tone consistency |

| Document QA | Contract review, internal search | Retrieval accuracy, hallucination |

| Code generation | Autocomplete, PR review, test writing | Correctness, security |

| Content generation | Marketing copy, summarization | Quality control, brand voice |

| Data extraction | Form parsing, structured output | Schema adherence, robustness |

| Autonomous agents | Research agents, task automation | Reliability, cost control |

1.2 The Development Stack

A modern LLM application typically has these layers:

┌─────────────────────────────────────────┐

│ User Interface │

│ (Web, Mobile, API, Slack, CLI) │

├─────────────────────────────────────────┤

│ Application Logic │

│ (Orchestration, Business Rules) │

├─────────────────────────────────────────┤

│ LLM Orchestration Layer │

│ (LangChain, LlamaIndex, raw SDK) │

├─────────────────────────────────────────┤

│ LLM Provider(s) │

│ (OpenAI, Anthropic, Google, local) │

├─────────────────────────────────────────┤

│ Supporting Services │

│ (Vector DB, cache, search, tools) │

└─────────────────────────────────────────┘

1.3 Key Principles

**1. Start simple, add complexity only when needed.**

A direct API call with a well-crafted prompt often outperforms elaborate orchestration frameworks. Add abstractions when you have a proven use case for them.

**2. Treat prompts as code.**

Version-control your prompts, write tests for them, and track changes carefully. Prompt regressions are as damaging as code regressions.

**3. Evaluate before you ship.**

LLM outputs are non-deterministic. Without systematic evaluation you cannot know whether your changes improved or harmed quality.

**4. Design for failure.**

LLMs hallucinate, timeout, and return unexpected formats. Build retry logic, fallbacks, and validation from the start.

2. Prompt Engineering Fundamentals

2.1 Anatomy of a Prompt

A production prompt has four optional sections:

[System instructions]

You are a helpful customer support agent for Acme Corp.

Respond in the same language the user writes in.

Always be polite and concise. Never discuss competitors.

[Context / Retrieved documents]

Order #12345 placed on 2026-03-10. Status: shipped.

Tracking number: 1Z999AA10123456784

[Examples (few-shot)]

User: Where is my order?

Assistant: Your order #99999 shipped on March 5 and is in transit.

Expected delivery: March 12.

[User message]

I ordered something last week and haven't received it yet.

2.2 System Instructions Best Practices

Write system instructions that are:

- **Role-specific**: Define exactly who the model is and what its purpose is.

- **Constraint-explicit**: State what the model should and should not do.

- **Format-specified**: Describe the expected output format when it matters.

- **Tone-defined**: Specify formality, language, length expectations.

SYSTEM_PROMPT = """You are a senior Python code reviewer at a fintech company.

Your responsibilities:

- Review code for correctness, security vulnerabilities, and performance issues

- Suggest specific improvements with code examples

- Flag any PII handling that violates GDPR/CCPA

Output format:

- Start with a one-sentence overall assessment

- List issues with severity: [CRITICAL], [WARNING], [SUGGESTION]

- End with a revised code block if changes are needed

You do not generate new features. Only review what is given to you."""

2.3 Few-Shot Prompting

Few-shot examples show the model the expected input-output pattern. They are especially effective for:

- Custom output formats

- Domain-specific tone or terminology

- Classification with unusual labels

FEW_SHOT_EXAMPLES = """

Extract the action items from the meeting note below.

Output as JSON array.

Meeting: John will update the deployment guide by Friday.

Sarah needs to review the Q1 budget before the board meeting.

Action items: [

{"owner": "John", "task": "Update deployment guide", "due": "Friday"},

{"owner": "Sarah", "task": "Review Q1 budget", "due": "Before board meeting"}

]

Meeting: We agreed that the API team will add rate limiting this sprint.

No owner was assigned for the documentation update.

Action items: [

{"owner": "API team", "task": "Add rate limiting", "due": "This sprint"},

{"owner": null, "task": "Documentation update", "due": null}

]

Meeting: {meeting_text}

Action items:"""

2.4 Chain-of-Thought (CoT)

For complex reasoning tasks, ask the model to show its work before giving the final answer.

COT_PROMPT = """Solve the following problem step by step.

Show your reasoning at each step, then give the final answer.

Problem: A customer has a $500 credit. They place an order for $320,

then return one $80 item. What is their remaining credit?

Let's think step by step:"""

Zero-shot CoT trigger: append "Let's think step by step." to your prompt without examples. This simple addition significantly improves multi-step reasoning on many models.

2.5 Structured Output

Force the model to produce parseable output using JSON mode or schema constraints.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(

model="gpt-4o",

response_format={"type": "json_object"},

messages=[

{"role": "system", "content": "Extract entities. Output valid JSON only."},

{"role": "user", "content": "Apple announced the iPhone 16 in Cupertino on September 9, 2024."}

]

)

data = json.loads(response.choices[0].message.content)

{"company": "Apple", "product": "iPhone 16", "location": "Cupertino", "date": "2024-09-09"}

With Pydantic and the OpenAI SDK's structured output feature:

from pydantic import BaseModel

from openai import OpenAI

class NewsEvent(BaseModel):

company: str

product: str

location: str

date: str

client = OpenAI()

response = client.beta.chat.completions.parse(

model="gpt-4o-2024-08-06",

messages=[

{"role": "system", "content": "Extract the event details."},

{"role": "user", "content": "Apple announced the iPhone 16 in Cupertino on September 9, 2024."}

response_format=NewsEvent,

)

event = response.choices[0].message.parsed

print(event.company) # Apple

3. LLM APIs and SDKs

3.1 OpenAI SDK

from openai import OpenAI

client = OpenAI(api_key="sk-...") # or set OPENAI_API_KEY env var

Basic chat completion

response = client.chat.completions.create(

model="gpt-4o",

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "Summarize the Transformer architecture in 3 sentences."}

temperature=0.7,

max_tokens=200,

)

print(response.choices[0].message.content)

print(f"Tokens used: {response.usage.total_tokens}")

3.2 Anthropic SDK

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(

model="claude-opus-4-5",

max_tokens=1024,

system="You are a helpful assistant.",

messages=[

{"role": "user", "content": "Explain attention mechanisms in transformers."}

]

)

print(message.content[0].text)

print(f"Input tokens: {message.usage.input_tokens}")

print(f"Output tokens: {message.usage.output_tokens}")

3.3 Unified Interface with LiteLLM

LiteLLM provides a single interface across 100+ LLM providers:

from litellm import completion

OpenAI

response = completion(

model="gpt-4o",

messages=[{"role": "user", "content": "Hello"}]

)

Anthropic (same interface)

response = completion(

model="anthropic/claude-opus-4-5",

messages=[{"role": "user", "content": "Hello"}]

)

Local Ollama model (same interface)

response = completion(

model="ollama/llama3",

messages=[{"role": "user", "content": "Hello"}]

)

print(response.choices[0].message.content)

3.4 Managing Conversation History

class ConversationManager:

def __init__(self, system_prompt: str, max_history: int = 20):

self.system_prompt = system_prompt

self.max_history = max_history

self.history: list[dict] = []

self.client = OpenAI()

def chat(self, user_message: str) -> str:

self.history.append({"role": "user", "content": user_message})

Trim history to avoid exceeding context window

if len(self.history) > self.max_history:

self.history = self.history[-self.max_history:]

response = self.client.chat.completions.create(

model="gpt-4o",

messages=[

{"role": "system", "content": self.system_prompt},

*self.history

]

)

assistant_message = response.choices[0].message.content

self.history.append({"role": "assistant", "content": assistant_message})

return assistant_message

4. Retrieval-Augmented Generation (RAG)

4.1 Why RAG?

LLMs have two fundamental limitations that RAG addresses:

1. **Knowledge cutoff**: The model only knows what was in its training data.

2. **Context window limit**: The model cannot "know" all your documents at once.

RAG solves both by retrieving the relevant pieces of information at inference time and injecting them into the prompt.

User Query

│

▼

[Embed query] ──► [Vector Search] ──► Top-K relevant chunks

│

▼

[Build augmented prompt]

System: You are a helpful assistant.

Context: {retrieved chunks}

User: {original query}

│

▼

[LLM generates answer]

4.2 Document Ingestion Pipeline

from langchain.document_loaders import PyPDFLoader, DirectoryLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_openai import OpenAIEmbeddings

from langchain_community.vectorstores import Chroma

1. Load documents

loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

2. Split into chunks

splitter = RecursiveCharacterTextSplitter(

chunk_size=1000,

chunk_overlap=200,

separators=["\n\n", "\n", " ", ""]

)

chunks = splitter.split_documents(documents)

3. Embed and store

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(

documents=chunks,

embedding=embeddings,

persist_directory="./chroma_db"

)

print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")

4.3 Retrieval and Generation

from langchain_openai import ChatOpenAI

from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

Load existing vectorstore

vectorstore = Chroma(

persist_directory="./chroma_db",

embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")

)

Create retriever

retriever = vectorstore.as_retriever(

search_type="mmr", # Maximal Marginal Relevance for diversity

search_kwargs={"k": 5, "fetch_k": 20}

)

Custom prompt

QA_PROMPT = PromptTemplate(

template="""Use the following context to answer the question.

If the answer is not in the context, say "I don't have information about that."

Do not make up information.

Context:

{context}

Question: {question}

Answer:""",

input_variables=["context", "question"]

)

Chain

llm = ChatOpenAI(model="gpt-4o", temperature=0)

qa_chain = RetrievalQA.from_chain_type(

llm=llm,

chain_type="stuff",

retriever=retriever,

chain_type_kwargs={"prompt": QA_PROMPT},

return_source_documents=True

)

result = qa_chain.invoke({"query": "What is the refund policy?"})

print(result["result"])

for doc in result["source_documents"]:

print(f"Source: {doc.metadata['source']}, page {doc.metadata.get('page', 'N/A')}")

4.4 Improving Retrieval Quality

**Hybrid search** combines dense (semantic) and sparse (keyword) retrieval:

from langchain_community.retrievers import BM25Retriever

from langchain.retrievers import EnsembleRetriever

Dense retriever (semantic)

dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

Sparse retriever (BM25 keyword)

bm25_retriever = BM25Retriever.from_documents(chunks)

bm25_retriever.k = 5

Ensemble with equal weight

ensemble_retriever = EnsembleRetriever(

retrievers=[dense_retriever, bm25_retriever],

weights=[0.6, 0.4]

)

**Reranking** with a cross-encoder after initial retrieval:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, docs: list, top_k: int = 3) -> list:

pairs = [(query, doc.page_content) for doc in docs]

scores = reranker.predict(pairs)

ranked = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)

return [doc for _, doc in ranked[:top_k]]

5. Tool Use and Function Calling

5.1 Defining Tools

Tools let the LLM call external APIs, search databases, or execute code:

from openai import OpenAI

client = OpenAI()

Define tools

tools = [

{

"type": "function",

"function": {

"name": "get_weather",

"description": "Get current weather for a city",

"parameters": {

"type": "object",

"properties": {

"city": {

"type": "string",

"description": "City name, e.g. 'San Francisco'"

"unit": {

"type": "string",

"enum": ["celsius", "fahrenheit"],

"description": "Temperature unit"

}

"required": ["city"]

}

{

"type": "function",

"function": {

"name": "search_web",

"description": "Search the web for current information",

"parameters": {

"type": "object",

"properties": {

"query": {"type": "string", "description": "Search query"}

"required": ["query"]

}

]

5.2 Handling Tool Calls

def get_weather(city: str, unit: str = "celsius") -> dict:

Real implementation would call a weather API

return {"city": city, "temp": 18, "unit": unit, "condition": "sunny"}

def search_web(query: str) -> str:

Real implementation would call a search API

return f"Search results for: {query}"

TOOL_MAP = {

"get_weather": get_weather,

"search_web": search_web,

}

def run_agent(user_message: str) -> str:

messages = [{"role": "user", "content": user_message}]

while True:

response = client.chat.completions.create(

model="gpt-4o",

messages=messages,

tools=tools,

tool_choice="auto"

)

message = response.choices[0].message

No tool call → final answer

if not message.tool_calls:

return message.content

Add assistant's response with tool calls to history

messages.append(message)

Execute each tool call

for tool_call in message.tool_calls:

func_name = tool_call.function.name

func_args = json.loads(tool_call.function.arguments)

result = TOOL_MAP[func_name](**func_args)

messages.append({

"role": "tool",

"tool_call_id": tool_call.id,

"content": json.dumps(result)

})

answer = run_agent("What's the weather like in Tokyo and should I bring an umbrella?")

print(answer)

5.3 Parallel Tool Calls

GPT-4o and Claude 3+ support parallel tool calls, greatly reducing latency for independent operations:

The model may call multiple tools at once

Your loop above already handles this — message.tool_calls is a list

Both weather and search can be called in a single model turn

response = client.chat.completions.create(

model="gpt-4o",

messages=[

{"role": "user", "content": "Compare weather in Tokyo and Paris, and search for travel tips."}

tools=tools,

parallel_tool_calls=True # default True for GPT-4o

)

response may have 3 tool calls at once: weather(Tokyo), weather(Paris), search(travel tips)

6. Streaming and Async Patterns

6.1 Streaming Responses

Streaming dramatically improves perceived latency for users because text appears as it is generated:

from openai import OpenAI

client = OpenAI()

Synchronous streaming

with client.chat.completions.stream(

model="gpt-4o",

messages=[{"role": "user", "content": "Write a short story about a robot."}]

) as stream:

for text in stream.text_stream:

print(text, end="", flush=True)

6.2 Async Streaming with FastAPI

from fastapi import FastAPI

from fastapi.responses import StreamingResponse

from openai import AsyncOpenAI

app = FastAPI()

client = AsyncOpenAI()

@app.post("/chat")

async def chat(body: dict):

async def generate():

async with client.chat.completions.stream(

model="gpt-4o",

messages=body["messages"]

) as stream:

async for text in stream.text_stream:

yield f"data: {text}\n\n"

yield "data: [DONE]\n\n"

return StreamingResponse(generate(), media_type="text/event-stream")

6.3 Async Batch Processing

When processing many items, async concurrency provides large throughput gains:

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def classify_one(text: str, semaphore: asyncio.Semaphore) -> str:

async with semaphore:

response = await client.chat.completions.create(

model="gpt-4o-mini",

messages=[

{"role": "system", "content": "Classify as POSITIVE, NEGATIVE, or NEUTRAL."},

{"role": "user", "content": text}

max_tokens=10

)

return response.choices[0].message.content.strip()

async def classify_batch(texts: list[str], max_concurrent: int = 20) -> list[str]:

semaphore = asyncio.Semaphore(max_concurrent)

tasks = [classify_one(text, semaphore) for text in texts]

return await asyncio.gather(*tasks)

Usage

texts = ["I love this product!", "Terrible experience.", "It was okay."] * 100

results = asyncio.run(classify_batch(texts))

7. Evaluation and Testing

7.1 Why LLM Evaluation Is Hard

Traditional software testing uses deterministic assertions:

assert add(2, 3) == 5 # Always passes or fails

LLM outputs are non-deterministic and require:

- **Semantic equivalence checks** (not string equality)

- **Rubric-based grading**

- **Reference-free quality assessment**

- **Statistical sampling** (one run is not enough)

7.2 LLM-as-Judge

Use a capable LLM to evaluate another LLM's outputs:

from openai import OpenAI

client = OpenAI()

JUDGE_PROMPT = """You are evaluating an AI assistant's response.

Rate the response on the following criteria (1-5 each):

- Accuracy: Is the information correct?

- Helpfulness: Does it fully address the question?

- Conciseness: Is it appropriately brief?

Question: {question}

Response: {response}

Reference answer: {reference}

Output as JSON: {{"accuracy": X, "helpfulness": X, "conciseness": X, "reasoning": "..."}}"""

def evaluate(question: str, response: str, reference: str) -> dict:

result = client.chat.completions.create(

model="gpt-4o",

response_format={"type": "json_object"},

messages=[{

"role": "user",

"content": JUDGE_PROMPT.format(

question=question,

response=response,

reference=reference

)

}]

)

return json.loads(result.choices[0].message.content)

7.3 Evaluation Frameworks

**DeepEval** provides comprehensive LLM evaluation metrics:

from deepeval import evaluate

from deepeval.metrics import (

AnswerRelevancyMetric,

FaithfulnessMetric,

ContextualRecallMetric,

)

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(

input="What is the capital of France?",

actual_output="The capital of France is Paris.",

expected_output="Paris",

retrieval_context=["France is a country in Western Europe. Its capital is Paris."]

)

metrics = [

AnswerRelevancyMetric(threshold=0.8),

FaithfulnessMetric(threshold=0.9),

ContextualRecallMetric(threshold=0.8),

]

evaluate([test_case], metrics)

7.4 Regression Testing with Promptfoo

Promptfoo lets you define test cases in YAML and run them across model versions:

promptfooconfig.yaml

prompts:

- 'Summarize the following text in 2 sentences: {{text}}'

providers:

- openai:gpt-4o

- openai:gpt-4o-mini

tests:

- vars:

text: "The Eiffel Tower was built in 1889 for the World's Fair..."

assert:

- type: llm-rubric

value: "The summary should mention the year 1889 and the World's Fair"

- type: javascript

value: "output.split('.').length <= 3" # max 3 sentences

8. Cost Optimization

8.1 Token Counting and Budgeting

def count_tokens(text: str, model: str = "gpt-4o") -> int:

encoding = tiktoken.encoding_for_model(model)

return len(encoding.encode(text))

def estimate_cost(

input_tokens: int,

output_tokens: int,

model: str = "gpt-4o"

) -> float:

Prices per 1M tokens (March 2026 approximate)

PRICING = {

"gpt-4o": {"input": 2.50, "output": 10.00},

"gpt-4o-mini": {"input": 0.15, "output": 0.60},

"claude-opus-4-5": {"input": 15.00, "output": 75.00},

"claude-haiku-3-5": {"input": 0.80, "output": 4.00},

}

p = PRICING.get(model, {"input": 5.0, "output": 15.0})

return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000

8.2 Prompt Caching

Anthropic and OpenAI both offer prompt caching for repeated system prompts or large context:

Anthropic prompt caching

client = anthropic.Anthropic()

response = client.messages.create(

model="claude-opus-4-5",

max_tokens=1024,

system=[

{

"type": "text",

"text": very_long_system_prompt, # e.g. 50k token policy document

"cache_control": {"type": "ephemeral"} # Cache this prefix

}

messages=[{"role": "user", "content": user_question}]

)

First call: full price. Subsequent calls: ~90% discount on cached tokens.

8.3 Model Routing

Route tasks to the cheapest model that can handle them:

def route_to_model(task: str, complexity: str) -> str:

"""Route to appropriate model based on task complexity."""

if complexity == "simple":

return "gpt-4o-mini" # Simple classification, extraction

elif complexity == "medium":

return "gpt-4o" # Summarization, Q&A

else:

return "claude-opus-4-5" # Complex reasoning, code review

Example: classify complexity before routing

def smart_complete(messages: list, task_description: str) -> str:

from openai import OpenAI

client = OpenAI()

Cheap classification step

complexity = client.chat.completions.create(

model="gpt-4o-mini",

messages=[{

"role": "user",

"content": f"Rate the complexity of this task as 'simple', 'medium', or 'complex': {task_description}"

}],

max_tokens=5

).choices[0].message.content.strip().lower()

model = route_to_model(task_description, complexity)

return client.chat.completions.create(

model=model,

messages=messages

).choices[0].message.content

9. Production Deployment

9.1 FastAPI Backend

from fastapi import FastAPI, HTTPException

from pydantic import BaseModel

from openai import AsyncOpenAI

app = FastAPI(title="LLM API")

client = AsyncOpenAI()

class ChatRequest(BaseModel):

messages: list[dict]

model: str = "gpt-4o"

temperature: float = 0.7

max_tokens: int = 1000

class ChatResponse(BaseModel):

content: str

usage: dict

@app.post("/chat", response_model=ChatResponse)

async def chat(request: ChatRequest):

try:

response = await client.chat.completions.create(

model=request.model,

messages=request.messages,

temperature=request.temperature,

max_tokens=request.max_tokens,

)

return ChatResponse(

content=response.choices[0].message.content,

usage={

"input_tokens": response.usage.prompt_tokens,

"output_tokens": response.usage.completion_tokens

}

)

except Exception as e:

raise HTTPException(status_code=500, detail=str(e))

9.2 Rate Limiting and Retry

from functools import wraps

def with_retry(max_attempts: int = 3, base_delay: float = 1.0):

def decorator(func):

@wraps(func)

async def wrapper(*args, **kwargs):

for attempt in range(max_attempts):

try:

return await func(*args, **kwargs)

except Exception as e:

if attempt == max_attempts - 1:

raise

Exponential backoff with jitter

delay = base_delay * (2 ** attempt) + random.uniform(0, 1)

await asyncio.sleep(delay)

return wrapper

return decorator

@with_retry(max_attempts=3)

async def robust_completion(messages: list) -> str:

client = AsyncOpenAI()

response = await client.chat.completions.create(

model="gpt-4o",

messages=messages

)

return response.choices[0].message.content

9.3 Caching with Redis

redis_client = redis.from_url("redis://localhost:6379")

CACHE_TTL = 3600 # 1 hour

def cache_key(messages: list, model: str) -> str:

payload = json.dumps({"messages": messages, "model": model}, sort_keys=True)

return f"llm:{hashlib.md5(payload.encode()).hexdigest()}"

async def cached_completion(messages: list, model: str = "gpt-4o") -> str:

key = cache_key(messages, model)

Check cache

cached = await redis_client.get(key)

if cached:

return cached.decode()

Generate

from openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(model=model, messages=messages)

result = response.choices[0].message.content

Store with TTL

await redis_client.setex(key, CACHE_TTL, result)

return result

10. Observability and Monitoring

10.1 Key Metrics to Track

| Metric | Why It Matters | Alert Threshold |

| ----------------------- | --------------- | ----------------------- |

| Latency (p50, p95, p99) | User experience | p95 > 5s for streaming |

| Token usage | Cost | Budget deviation > 20% |

| Error rate | Reliability | > 1% of requests |

| Cache hit rate | Cost efficiency | < 30% (investigate) |

| Evaluation scores | Quality | Drop > 5% from baseline |

10.2 LangSmith Tracing

from langchain_openai import ChatOpenAI

from langchain.callbacks.tracers import LangChainTracer

os.environ["LANGCHAIN_API_KEY"] = "ls__..."

os.environ["LANGCHAIN_TRACING_V2"] = "true"

os.environ["LANGCHAIN_PROJECT"] = "my-llm-app"

All LangChain calls are automatically traced

llm = ChatOpenAI(model="gpt-4o")

response = llm.invoke("What is RAG?")

Full trace (prompt, response, latency, tokens) visible in LangSmith UI

10.3 Custom Logging

from dataclasses import dataclass, field, asdict

logger = logging.getLogger(__name__)

@dataclass

class LLMCallLog:

model: str

input_tokens: int

output_tokens: int

latency_ms: float

success: bool

error: str = ""

metadata: dict = field(default_factory=dict)

async def traced_completion(messages: list, model: str = "gpt-4o", **metadata) -> str:

from openai import AsyncOpenAI

client = AsyncOpenAI()

start = time.perf_counter()

success = True

error = ""

input_tokens = output_tokens = 0

try:

response = await client.chat.completions.create(model=model, messages=messages)

result = response.choices[0].message.content

input_tokens = response.usage.prompt_tokens

output_tokens = response.usage.completion_tokens

return result

except Exception as e:

success = False

error = str(e)

raise

finally:

log = LLMCallLog(

model=model,

input_tokens=input_tokens,

output_tokens=output_tokens,

latency_ms=(time.perf_counter() - start) * 1000,

success=success,

error=error,

metadata=metadata

)

logger.info("llm_call", extra=asdict(log))

10.4 Guardrails and Safety

from guardrails import Guard

from guardrails.hub import ToxicLanguage, DetectPII

guard = Guard().use_many(

ToxicLanguage(threshold=0.5, on_fail="exception"),

DetectPII(pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER"], on_fail="fix"),

)

def safe_completion(user_input: str) -> str:

from openai import OpenAI

client = OpenAI()

Validate input

guard.validate(user_input)

response = client.chat.completions.create(

model="gpt-4o",

messages=[{"role": "user", "content": user_input}]

)

output = response.choices[0].message.content

Validate and fix output

validated = guard.validate(output)

return validated.validated_output

Summary

Building production-grade LLM applications requires mastery across multiple dimensions:

| Area | Key Takeaway |

| ------------------ | ---------------------------------------------------------- |

| Prompt engineering | Treat prompts as code; version, test, and iterate them |

| RAG | Hybrid search + reranking dramatically improves retrieval |

| Tool use | Parallel tool calls reduce latency for multi-step tasks |

| Streaming | Essential for interactive UX; use SSE with FastAPI |

| Evaluation | LLM-as-judge + automated test suites catch regressions |

| Cost | Caching, routing, and prompt caching can cut costs by 80%+ |

| Monitoring | Track latency, tokens, and quality metrics from day one |

The field evolves rapidly, but these fundamentals will serve you well regardless of which models or frameworks dominate next year. Start simple, measure everything, and iterate based on real usage data.

**Q1. What does RAG stand for, and why is it useful?**

RAG stands for Retrieval-Augmented Generation. It is useful because it lets an LLM answer questions about documents it was never trained on by retrieving relevant text at inference time and injecting it into the prompt. This addresses both the knowledge cutoff problem and the context window limitation.

**Q2. What is the main advantage of parallel tool calls in an LLM agent?**

Parallel tool calls allow the model to invoke multiple tools simultaneously in a single turn rather than sequentially. This reduces total latency for multi-step tasks where the tool calls are independent of each other.

**Q3. Why is LLM-as-judge evaluation preferred over simple string matching?**

LLM outputs are semantically equivalent in many different phrasings, so string matching produces false negatives. An LLM judge can assess semantic correctness, helpfulness, and quality using a rubric, providing much more accurate quality signals than deterministic comparison.

**Q4. Name two techniques for reducing LLM API costs without degrading quality.**

1. Prompt caching: Cache repeated large prefixes (system prompts, reference documents) so they are only charged at full price once.

2. Model routing: Direct simple tasks (classification, extraction) to cheap small models, reserving expensive large models only for complex reasoning tasks.

Table of Contents

1. Overview of LLM Application Development

1.1 What Is an LLM Application?

1.2 The Development Stack

1.3 Key Principles

2. Prompt Engineering Fundamentals

2.1 Anatomy of a Prompt

2.2 System Instructions Best Practices

2.3 Few-Shot Prompting

2.4 Chain-of-Thought (CoT)

2.5 Structured Output

{"company": "Apple", "product": "iPhone 16", "location": "Cupertino", "date": "2024-09-09"}

3. LLM APIs and SDKs

3.1 OpenAI SDK

Basic chat completion

3.2 Anthropic SDK

3.3 Unified Interface with LiteLLM

OpenAI

Anthropic (same interface)

Local Ollama model (same interface)

3.4 Managing Conversation History

Trim history to avoid exceeding context window

4. Retrieval-Augmented Generation (RAG)

4.1 Why RAG?

4.2 Document Ingestion Pipeline

1. Load documents

2. Split into chunks

3. Embed and store

4.3 Retrieval and Generation

Load existing vectorstore

Create retriever

Custom prompt

Chain

4.4 Improving Retrieval Quality

Dense retriever (semantic)

Sparse retriever (BM25 keyword)

Ensemble with equal weight

5. Tool Use and Function Calling

5.1 Defining Tools

Define tools

5.2 Handling Tool Calls

Real implementation would call a weather API

Real implementation would call a search API

No tool call → final answer

Add assistant's response with tool calls to history

Execute each tool call

5.3 Parallel Tool Calls

The model may call multiple tools at once

Your loop above already handles this — message.tool_calls is a list

Both weather and search can be called in a single model turn

response may have 3 tool calls at once: weather(Tokyo), weather(Paris), search(travel tips)

6. Streaming and Async Patterns

6.1 Streaming Responses

Synchronous streaming

6.2 Async Streaming with FastAPI

6.3 Async Batch Processing

Usage

7. Evaluation and Testing

7.1 Why LLM Evaluation Is Hard

7.2 LLM-as-Judge

7.3 Evaluation Frameworks

7.4 Regression Testing with Promptfoo

promptfooconfig.yaml

8. Cost Optimization

8.1 Token Counting and Budgeting

Prices per 1M tokens (March 2026 approximate)

8.2 Prompt Caching

Anthropic prompt caching

First call: full price. Subsequent calls: ~90% discount on cached tokens.

8.3 Model Routing

Example: classify complexity before routing

Cheap classification step

9. Production Deployment

9.1 FastAPI Backend

9.2 Rate Limiting and Retry

Exponential backoff with jitter

9.3 Caching with Redis

Check cache

Generate

Store with TTL