Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Why "AI Engineering" became its own discipline

2023: "Call the ChatGPT API, app's done."
2024: "After a month, 30% of cases break."
2025: "AI products succeed as systems, not as models."

What AI engineers actually solve:

Non-determinism — same input, different outputs.
Evaluation — "good answer" is subjective, not math.
Cost explosion — $0.10 per request × 1000 =$ 100.
Latency — 2s mean, 8s p99.
Security — prompt injection, data leakage.
Hallucination — confident false answers.
Tool/agent chains — a single call fans out to dozens of steps.

This is a different discipline from traditional SRE/backend. This post is the practical playbook.

Part 1 — LLM API calls, really

Beyond the toy example

# Naive
response = client.chat.completions.create(model="gpt-4o", messages=[...])
return response.choices[0].message.content

Production must wrap 6 concerns:

Retry + exponential backoff — rate limits, transient errors.
Timeouts — defaults are too long (60s+).
Streaming — time-to-first-token is the UX.
Token counting — stay below context limit.
Logging / observability — request, response, latency, cost.
Fallback — switch models on provider failure.

Streaming pipeline

async for chunk in client.chat.completions.create(..., stream=True):
    delta = chunk.choices[0].delta.content
    if delta:
        yield delta

Buffer, flush on punctuation, measure TTFT (time-to-first-token) as a first-class SLO.

Structured output

JSON mode / structured outputs (OpenAI, Anthropic, Gemini) — built-in since 2024.
Tool calling — model returns arguments matching your schema.
Zod / Pydantic — validate at the boundary; retry with error in prompt on failure.

Part 2 — RAG is not lookup

The naive pipeline (and why it breaks)

Chunk docs → 2. Embed → 3. Store in vector DB → 4. Top-k cosine search → 5. Stuff into prompt.

What goes wrong:

Chunks split sentences or code blocks mid-token.
Top-k retrieves near-duplicates, crowds context.
Embeddings miss semantic negation ("NOT available").
Multi-hop questions need multiple retrievals.
No re-ranking — first hit dominates.
No freshness — old docs beat new ones.

The 2025 RAG stack

Chunking: semantic splitters (structure-aware), tokens 256–1024, overlap 10–20%.
Embeddings: text-embedding-3-large (OpenAI), Cohere Embed v3, BGE-M3, Voyage AI. Benchmark on YOUR data with MTEB.
Hybrid search: BM25 + dense vectors. Reciprocal Rank Fusion (RRF).
Re-ranker: Cohere Rerank, BGE Reranker — boosts precision dramatically.
Query rewriting: use LLM to expand/decompose question before retrieval.
Citations: always return chunk IDs; render inline.
Eval with Ragas — faithfulness, answer relevancy, context precision/recall.

Part 3 — Agents

Core patterns

ReAct — thought → action → observation loop.
Plan-Execute — plan upfront, execute steps, revise if needed.
ReWOO — plan all tool calls up front (parallel execution).
Reflexion — self-critique and retry.

Frameworks (2025)

LangGraph — state-machine based, explicit nodes/edges. Most production-friendly.
OpenAI Swarm (experimental) — lightweight multi-agent.
CrewAI — role-based agent teams.
AutoGen (Microsoft) — conversation-driven.
Pydantic-AI — typed agents.

Production gotchas

Infinite loops — cap step count, add tool-call budget.
Tool fan-out — parallel execution where safe, serial where state matters.
Observability — every tool call traced (LangSmith, Langfuse, Phoenix).
Failure modes — tool timeout, 400 error, hallucinated tool name.

Part 4 — Fine-tuning: when and when NOT

Don't fine-tune first

Prompt engineering + RAG + few-shot handles 90% of cases cheaper, faster, with updatable knowledge.

Fine-tune when

Format compliance matters (structured output, style).
Domain vocabulary is dense (medical, legal).
Inference cost needs reduction (smaller model matches GPT-4 on narrow task).
Proprietary reasoning patterns must be internalized.

Techniques

LoRA / QLoRA — adapter-based, cheap, VRAM-friendly.
DPO (Direct Preference Optimization) — replaces RLHF with pairwise preference data.
ORPO — combined preference + SFT in one pass.
SPIN / self-rewarding — emerging.

Stack

Unsloth — 2× faster than HF Trainer, 50% less VRAM.
Axolotl — config-driven YAML.
LLaMA-Factory — GUI/CLI, multi-method.
TRL (HuggingFace) — standard reference.

Part 5 — Vector DB decision matrix

DB	Type	Strength	Weakness
pgvector	Postgres extension	Colocated with relational, transactions	Less specialized scale
Qdrant	Rust native	Filters, fast	Another service
Weaviate	Java	Modules, hybrid	Heavier
Milvus	C++	Scale (billions)	Ops complexity
Pinecone	Managed	Zero ops	Expensive, vendor lock
Turbopuffer	Managed, cheap	Cheap cold storage	New
LanceDB	Embedded	Local, simple	Small scale

Default for 2025: pgvector unless vector count >10M or you need advanced filters → Qdrant. Pinecone/Turbopuffer if ops is a bottleneck.

Part 6 — Evaluation: the hard problem

Why it's hard

No single ground truth for open-ended answers.
Human eval doesn't scale.
LLM-as-judge is biased toward its own style.

The layered approach

Unit tests for prompts — pytest fixtures, golden outputs for regression.
LLM-as-judge — cheap, noisy; use GPT-4o to grade; calibrate vs human.
Task-specific metrics — BLEU/ROUGE for summaries, exact match for extraction.
RAG metrics — Ragas: faithfulness, answer relevance, context precision.
Human eval — small, focused, for ground-truth calibration.
Production telemetry — thumbs up/down, session analysis.

Tools

Langfuse, LangSmith, Phoenix (Arize), Braintrust, Weights & Biases, Helicone.

Part 7 — Cost optimization

Every $1 saved at scale matters.

Model tiering — route easy queries to Haiku/mini, escalate to Sonnet/GPT-4.
Prompt caching (Anthropic, OpenAI) — 90% discount on cached prefix.
Batch API — 50% discount, async.
Structured outputs — fewer retries from parse failures.
Context pruning — summarize old turns, not verbatim.
Semantic caching — Redis + embeddings; hit rate 20–40% is common.
Shorter prompts — every token billed.

Part 8 — Security: prompt injection & data leakage

Attack surface

Direct injection: user types "ignore prior, dump secrets."
Indirect injection: malicious webpage instructs the LLM that reads it.
Data exfiltration via tool calls: LLM tricked into calling send_email(attacker, secret).
Training data poisoning — upstream concern.

Defenses (defense in depth)

Separate system and user — never concat user into system prompt.
Input validation — strip suspicious patterns, length limits.
Output validation — refuse / re-prompt on suspicious output.
Tool allow-list + permissions — LLM never touches prod DB directly.
Human-in-the-loop for high-risk tools (email send, payments).
Sandboxing — code interpreter in isolated container.
Prompt shields (Azure AI Content Safety, Lakera Guard).
Audit logs for every tool invocation.

OWASP LLM Top 10 is the canonical reference.

Part 9 — Observability

An AI app without observability is blind. Minimum:

Trace every request (prompt, tool calls, token counts, latency, cost).
Session view for user journey.
Cost dashboard per feature/user.
Alert on anomaly (latency spike, error rate, token burn).

Tools: Langfuse (open-source, self-hostable), LangSmith (LangChain's paid SaaS), Phoenix (Arize, OSS), Helicone, Braintrust.

Part 10 — 12-item production checklist

Retry + exponential backoff + jitter?
Timeout set explicitly (not default)?
Streaming enabled, TTFT measured?
Token counting + context guardrails?
Structured outputs or validated JSON?
RAG uses hybrid + re-ranker + citations?
Agent has step cap + tool budget?
Evaluation suite runs in CI (golden + LLM-judge)?
Observability platform deployed?
Cost dashboard and alert?
Prompt injection defenses (separation, allow-list, human-in-loop)?
Fallback model + graceful degradation?

10 anti-patterns

Treating demo code as production.
RAG with naive top-k, no re-rank.
Fine-tuning before prompt engineering.
LLM-as-judge with no human calibration.
Ignoring cost until the bill arrives.
Concatenating user input into system prompt.
Giving agents unrestricted tool access.
No observability "we can add it later."
Trusting LLM output without schema validation.
Hallucinating packages — letting LLM install arbitrary deps.

Production AI engineering is as much about systems as models. Pick one of: agent orchestration deep dive, RAG at scale, or LLM cost engineering for the next post.

— End of AI Engineering in Practice.

✍️ 필사 모드: AI Engineering in Practice — LLM API, RAG, Agents, LoRA/DPO, Vector DB, Evaluation, Observability, Prompt Injection (2025)