- Published on
AI Engineering in Practice — LLM API, RAG, Agents, LoRA/DPO, Vector DB, Evaluation, Observability, Prompt Injection (2025)
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Why "AI Engineering" became its own discipline
- 2023: "Call the ChatGPT API, app's done."
- 2024: "After a month, 30% of cases break."
- 2025: "AI products succeed as systems, not as models."
What AI engineers actually solve:
- Non-determinism — same input, different outputs.
- Evaluation — "good answer" is subjective, not math.
- Cost explosion — 100.
- Latency — 2s mean, 8s p99.
- Security — prompt injection, data leakage.
- Hallucination — confident false answers.
- Tool/agent chains — a single call fans out to dozens of steps.
This is a different discipline from traditional SRE/backend. This post is the practical playbook.
Part 1 — LLM API calls, really
Beyond the toy example
# Naive
response = client.chat.completions.create(model="gpt-4o", messages=[...])
return response.choices[0].message.content
Production must wrap 6 concerns:
- Retry + exponential backoff — rate limits, transient errors.
- Timeouts — defaults are too long (60s+).
- Streaming — time-to-first-token is the UX.
- Token counting — stay below context limit.
- Logging / observability — request, response, latency, cost.
- Fallback — switch models on provider failure.
Streaming pipeline
async for chunk in client.chat.completions.create(..., stream=True):
delta = chunk.choices[0].delta.content
if delta:
yield delta
Buffer, flush on punctuation, measure TTFT (time-to-first-token) as a first-class SLO.
Structured output
- JSON mode / structured outputs (OpenAI, Anthropic, Gemini) — built-in since 2024.
- Tool calling — model returns arguments matching your schema.
- Zod / Pydantic — validate at the boundary; retry with error in prompt on failure.
Part 2 — RAG is not lookup
The naive pipeline (and why it breaks)
- Chunk docs → 2. Embed → 3. Store in vector DB → 4. Top-k cosine search → 5. Stuff into prompt.
What goes wrong:
- Chunks split sentences or code blocks mid-token.
- Top-k retrieves near-duplicates, crowds context.
- Embeddings miss semantic negation ("NOT available").
- Multi-hop questions need multiple retrievals.
- No re-ranking — first hit dominates.
- No freshness — old docs beat new ones.
The 2025 RAG stack
- Chunking: semantic splitters (structure-aware), tokens 256–1024, overlap 10–20%.
- Embeddings:
text-embedding-3-large(OpenAI), Cohere Embed v3, BGE-M3, Voyage AI. Benchmark on YOUR data with MTEB. - Hybrid search: BM25 + dense vectors. Reciprocal Rank Fusion (RRF).
- Re-ranker: Cohere Rerank, BGE Reranker — boosts precision dramatically.
- Query rewriting: use LLM to expand/decompose question before retrieval.
- Citations: always return chunk IDs; render inline.
- Eval with Ragas — faithfulness, answer relevancy, context precision/recall.
Part 3 — Agents
Core patterns
- ReAct — thought → action → observation loop.
- Plan-Execute — plan upfront, execute steps, revise if needed.
- ReWOO — plan all tool calls up front (parallel execution).
- Reflexion — self-critique and retry.
Frameworks (2025)
- LangGraph — state-machine based, explicit nodes/edges. Most production-friendly.
- OpenAI Swarm (experimental) — lightweight multi-agent.
- CrewAI — role-based agent teams.
- AutoGen (Microsoft) — conversation-driven.
- Pydantic-AI — typed agents.
Production gotchas
- Infinite loops — cap step count, add tool-call budget.
- Tool fan-out — parallel execution where safe, serial where state matters.
- Observability — every tool call traced (LangSmith, Langfuse, Phoenix).
- Failure modes — tool timeout, 400 error, hallucinated tool name.
Part 4 — Fine-tuning: when and when NOT
Don't fine-tune first
Prompt engineering + RAG + few-shot handles 90% of cases cheaper, faster, with updatable knowledge.
Fine-tune when
- Format compliance matters (structured output, style).
- Domain vocabulary is dense (medical, legal).
- Inference cost needs reduction (smaller model matches GPT-4 on narrow task).
- Proprietary reasoning patterns must be internalized.
Techniques
- LoRA / QLoRA — adapter-based, cheap, VRAM-friendly.
- DPO (Direct Preference Optimization) — replaces RLHF with pairwise preference data.
- ORPO — combined preference + SFT in one pass.
- SPIN / self-rewarding — emerging.
Stack
- Unsloth — 2× faster than HF Trainer, 50% less VRAM.
- Axolotl — config-driven YAML.
- LLaMA-Factory — GUI/CLI, multi-method.
- TRL (HuggingFace) — standard reference.
Part 5 — Vector DB decision matrix
| DB | Type | Strength | Weakness |
|---|---|---|---|
| pgvector | Postgres extension | Colocated with relational, transactions | Less specialized scale |
| Qdrant | Rust native | Filters, fast | Another service |
| Weaviate | Java | Modules, hybrid | Heavier |
| Milvus | C++ | Scale (billions) | Ops complexity |
| Pinecone | Managed | Zero ops | Expensive, vendor lock |
| Turbopuffer | Managed, cheap | Cheap cold storage | New |
| LanceDB | Embedded | Local, simple | Small scale |
Default for 2025: pgvector unless vector count >10M or you need advanced filters → Qdrant. Pinecone/Turbopuffer if ops is a bottleneck.
Part 6 — Evaluation: the hard problem
Why it's hard
- No single ground truth for open-ended answers.
- Human eval doesn't scale.
- LLM-as-judge is biased toward its own style.
The layered approach
- Unit tests for prompts — pytest fixtures, golden outputs for regression.
- LLM-as-judge — cheap, noisy; use GPT-4o to grade; calibrate vs human.
- Task-specific metrics — BLEU/ROUGE for summaries, exact match for extraction.
- RAG metrics — Ragas: faithfulness, answer relevance, context precision.
- Human eval — small, focused, for ground-truth calibration.
- Production telemetry — thumbs up/down, session analysis.
Tools
Langfuse, LangSmith, Phoenix (Arize), Braintrust, Weights & Biases, Helicone.
Part 7 — Cost optimization
Every $1 saved at scale matters.
- Model tiering — route easy queries to Haiku/mini, escalate to Sonnet/GPT-4.
- Prompt caching (Anthropic, OpenAI) — 90% discount on cached prefix.
- Batch API — 50% discount, async.
- Structured outputs — fewer retries from parse failures.
- Context pruning — summarize old turns, not verbatim.
- Semantic caching — Redis + embeddings; hit rate 20–40% is common.
- Shorter prompts — every token billed.
Part 8 — Security: prompt injection & data leakage
Attack surface
- Direct injection: user types "ignore prior, dump secrets."
- Indirect injection: malicious webpage instructs the LLM that reads it.
- Data exfiltration via tool calls: LLM tricked into calling
send_email(attacker, secret). - Training data poisoning — upstream concern.
Defenses (defense in depth)
- Separate system and user — never concat user into system prompt.
- Input validation — strip suspicious patterns, length limits.
- Output validation — refuse / re-prompt on suspicious output.
- Tool allow-list + permissions — LLM never touches prod DB directly.
- Human-in-the-loop for high-risk tools (email send, payments).
- Sandboxing — code interpreter in isolated container.
- Prompt shields (Azure AI Content Safety, Lakera Guard).
- Audit logs for every tool invocation.
OWASP LLM Top 10 is the canonical reference.
Part 9 — Observability
An AI app without observability is blind. Minimum:
- Trace every request (prompt, tool calls, token counts, latency, cost).
- Session view for user journey.
- Cost dashboard per feature/user.
- Alert on anomaly (latency spike, error rate, token burn).
Tools: Langfuse (open-source, self-hostable), LangSmith (LangChain's paid SaaS), Phoenix (Arize, OSS), Helicone, Braintrust.
Part 10 — 12-item production checklist
- Retry + exponential backoff + jitter?
- Timeout set explicitly (not default)?
- Streaming enabled, TTFT measured?
- Token counting + context guardrails?
- Structured outputs or validated JSON?
- RAG uses hybrid + re-ranker + citations?
- Agent has step cap + tool budget?
- Evaluation suite runs in CI (golden + LLM-judge)?
- Observability platform deployed?
- Cost dashboard and alert?
- Prompt injection defenses (separation, allow-list, human-in-loop)?
- Fallback model + graceful degradation?
10 anti-patterns
- Treating demo code as production.
- RAG with naive top-k, no re-rank.
- Fine-tuning before prompt engineering.
- LLM-as-judge with no human calibration.
- Ignoring cost until the bill arrives.
- Concatenating user input into system prompt.
- Giving agents unrestricted tool access.
- No observability "we can add it later."
- Trusting LLM output without schema validation.
- Hallucinating packages — letting LLM install arbitrary deps.
Next post
Production AI engineering is as much about systems as models. Pick one of: agent orchestration deep dive, RAG at scale, or LLM cost engineering for the next post.
— End of AI Engineering in Practice.