- Published on
LLM Observability & Prompt Tools 2026 — Helicone / LangSmith / Langfuse / Braintrust / Athina / Comet Opik / Portkey Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — "Shipping an LLM is easy now. Running one is hard."
Up to 2024, putting an LLM into production was still a novel act. By May 2026, it is routine. OpenAI · Anthropic · Google · Mistral · DeepSeek · Korean HyperCLOVA X · Japanese Sakana · NTT Tsuzumi — any of them is one API call away. The hard part starts after that.
- The model answered a question well yesterday and gave a strange answer to the same question today. Why?
- A user asked the same question five times and got five different answers. How do you reproduce that and turn it into a regression test?
- Token spend went from 3,000 USD a month to 15,000 USD overnight. Who used what where?
- You changed one line in a prompt and 7 of 100 test cases broke. Which 7? Are the other 93 still good?
- Can you automatically measure how close a RAG system's answer is to ground truth, and what its faithfulness score is?
Those five questions are the whole job of 2026 LLM ops. And in the last two years, a tool has emerged to answer each one — actually, dozens of tools. Helicone · LangSmith · Langfuse · W&B Weave · Arize Phoenix · Braintrust · Athina · Comet Opik · Vellum · PromptHub · Portkey · TruLens · Ragas · DeepEval · Galileo · Patronus AI · OpenAI Evals · Bedrock Evals · Vertex AI Evaluation Service. All the tools in the title.
This piece lays out the LLM ops landscape as of May 2026. We group it into four areas (observability · evaluation · prompt management · gateway), call out each tool's strengths, weaknesses, pricing model, and real-world deployments, and end with concrete picks for four personas: solo developer, startup, enterprise, and RAG-first team.
1. The 2026 LLM ops map — four areas
The big picture first.
Four areas — Observability / Evaluation / Prompt management / Gateway
The tools overlap, but the cleanest taxonomy is by their primary value proposition.
| Area | What it does | Representative tools |
|---|---|---|
| Observability | Trace every LLM call. Monitor tokens, latency, cost, errors. Debug. | Helicone, LangSmith, Langfuse, W&B Weave, Arize Phoenix, Comet Opik |
| Evaluation | Score model output quality automatically using datasets, metrics, and LLM-as-judge | Braintrust, Athina, Ragas, TruLens, DeepEval, Galileo, Patronus AI |
| Prompt management | Version control, A/B testing, non-engineer collaboration, deployment for prompts | Vellum, PromptHub, LangSmith Prompts, Langfuse Prompts |
| Gateway | Multi-provider routing across OpenAI / Anthropic / Bedrock etc., caching, rate limit, fallback | Portkey, LiteLLM, Cloudflare AI Gateway |
Most tools straddle multiple areas. LangSmith does observability, evaluation, and prompts. Langfuse does the same. Portkey is a gateway by trade but ships observability too. That overlap is what makes comparison hard.
What changed from 2024 to 2026
In early 2024, LangSmith was effectively the only choice. The market fragmented at terrifying speed over the next two years.
- 2023~2024 first wave — Helicone (YC), Langfuse, Braintrust, Athina, TruLens, Ragas all launched. LangChain shipped LangSmith GA.
- Late 2024 — Comet entered LLM space, Arize spun out Phoenix as open source. Portkey and LiteLLM established themselves as gateways.
- March 2025 — Comet launched Opik as a formal open-source product. Langfuse closed Series A.
- Late 2025 to early 2026 — The three hyperscalers entered: Bedrock Evaluations · Vertex AI Evaluation Service · Azure AI Studio Evaluations. OpenAI strengthened its Evals dashboard.
- 2026 today — There are 30+ tools. The biggest question is now "which one do I pick".
The OpenTelemetry shift — GenAI semantic conventions
The decisive shift came in late 2025. OpenTelemetry's GenAI semantic conventions became the de facto standard, and Langfuse · Phoenix · Helicone · Portkey · LangSmith all began shipping OTel-based SDKs. In other words, the SDK is once-and-done; the backend is swappable. That is the most important change in LLM ops for the next five years.
2. Helicone — Y Combinator open source observability
Start here if you want to ship fastest.
One-line definition
A Y Combinator W23 graduate, open-source LLM observability. Change one line — the base URL — and you're done. The lowest barrier to entry in the field.
How it works
Helicone's signature is proxy mode. Point the OpenAI SDK's base_url at https://oai.helicone.ai/v1 and every call gets logged automatically. One line.
from openai import OpenAI
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={"Helicone-Auth": f"Bearer {os.getenv('HELICONE_API_KEY')}"},
)
That one line captures, automatically:
- Request and response bodies.
- Latency and time-to-first-token.
- Input and output tokens and cost.
- User ID, session, custom properties (passed via headers like Helicone-User-Id).
If the proxy is too risky, there is also an async logging SDK that sends in the background.
Strengths
- Zero onboarding cost — one line.
- Open source — Apache 2.0. Self-hostable.
- Provider-agnostic — OpenAI, Anthropic, Together, Anyscale, Bedrock.
- Custom properties — slice by user, feature flag, experiment group.
- Generous free tier — 100k requests/month free.
Weaknesses
- The proxy sits in the critical path — adds one network hop (typically under 10ms in practice).
- Evaluation is light — nothing like LangSmith or Braintrust's dataset and experiment capability.
- Prompt management is minimal — far less serious than Vellum or PromptHub.
Who uses it
Mostly startups and indie devs. The default first install in the "we need production tracing right now and don't want code churn" scenario. A number of Korean LLM startups install Helicone first during PoC.
3. LangSmith — LangChain's flagship
The most famous tool in the space.
One-line definition
The all-in-one LLM ops platform built by LangChain. Observability, evaluation, prompts, datasets in one place. Both SaaS and self-hosted (Enterprise).
How it works
If you use LangChain or LangGraph, two env vars give you automatic tracing.
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=ls_...
If you don't use LangChain, the @traceable decorator lets you trace any function.
from langsmith import traceable
@traceable(run_type="llm")
def call_model(prompt: str) -> str:
# call any model
...
Strengths
- LangChain and LangGraph integration is unmatched — no other tool comes close. Tracing intermediate steps in agentic workflows is natural.
- Evaluation is strong — datasets, LLM-as-judge, pairwise comparison, regression tests all in one place.
- Prompts Hub — version control and sharing.
- Production-grade — some Fortune 500 customers run it self-hosted.
Weaknesses
- Expensive — free for individuals, Plus is 39 USD/seat/month, Enterprise is quote-based.
- LangChain family lock-in — moving off is non-trivial.
- Heavy UI — overkill for small projects.
Who uses it
The default for every team running LangChain or LangGraph in production. Most Korean and Japanese RAG-chatbot vendors who picked the LangChain stack also pay for LangSmith.
4. Langfuse — open source, Series A
The most powerful open-source alternative to LangSmith.
One-line definition
MIT-licensed open-source LLM ops. Self-hosting is actually easy. Closed a Series A in 2025 and is one of the fastest-growing OSS projects in the space.
How it works
docker compose up gives you a self-hosted instance. SDKs cover Python, TypeScript, OpenAI auto-tracing, LlamaIndex, and LangChain.
from langfuse.openai import openai # OpenAI drop-in
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "hi"}],
)
You get a trace for free. You can also create spans manually.
Strengths
- Actually open source — MIT, with almost no enterprise gating on core features.
- Self-hosting is genuinely easy — docker compose gets you to production-grade.
- Observability + evaluation + prompts + datasets in one package.
- Prompt management is surprisingly strong — Langfuse Prompts is a lightweight Vellum alternative.
- OpenTelemetry compatible — native support for OTel GenAI conventions since 2025.
Weaknesses
- UI polish trails LangSmith and Braintrust — closing fast.
- Agentic-workflow visualization is still a step behind LangSmith.
- Cloud is hosted in the EU — if you are a US company sensitive to latency, prefer self-host.
Who uses it
Every team that prefers open source. Some Korean fintech and healthcare companies pick self-hosted Langfuse for data sovereignty. Same in Japan.
5. W&B Weave — Weights & Biases for LLMs
The natural choice if your team already uses W&B for ML.
One-line definition
LLM observability and evaluation from Weights & Biases. Integrates with W&B's existing experiment tracking.
How it works
Call weave.init("project") once and wrap functions with @weave.op for automatic tracing.
import weave
weave.init("my-rag-app")
@weave.op()
def answer(query: str) -> str:
docs = retrieve(query)
return generate(query, docs)
LLM traces show up inside the existing W&B UI.
Strengths
- Same umbrella as W&B's ML experiment tracking — fine-tuning, evaluation, and serving in one place.
- Evaluations are strong —
weave.Evaluationruns combinations of dataset, scorer, and model fast. - Enterprise trust — existing W&B customers (OpenAI, NVIDIA, Toyota) just adopt it.
Weaknesses
- Learning curve for non-W&B users — you need to know W&B concepts (project, run).
- Free tier is less generous than LangSmith or Helicone.
- Overkill for pure-LLM teams — best when you also do ML.
Who uses it
ML teams that already paid for W&B. Many big-corp AI labs in Korea and Japan training their own models adopt Weave.
6. Arize Phoenix — open source
The open-source LLM offering from Arize, a name well known in ML observability.
One-line definition
Open-source LLM observability and evaluation from Arize AI. Same tool from notebook to production.
How it works
import phoenix as px
from phoenix.otel import register
tracer_provider = register(project_name="my-rag", auto_instrument=True)
# OpenAI, LangChain, LlamaIndex calls are now all traced automatically
Phoenix's calling card is that it runs from a notebook. px.launch_app() brings up the UI locally.
Strengths
- Notebook-friendly — the lightest start for experimentation.
- Native OpenTelemetry GenAI conventions.
- Strong embedding and RAG visualization — UMAP clustering of embeddings is hard to find elsewhere.
- Smooth on-ramp to Arize's production tier — Phoenix for PoC, Arize for production.
Weaknesses
- UI tilts more toward ML culture than LangSmith or Braintrust — backend devs hit a wall.
- Prompt management is basic.
Who uses it
Data-science-trained ML engineers. Teams that need RAG debugging — visualizing which chunk got mis-retrieved.
7. Braintrust — evaluation-first
Top pick if evaluation is the most important thing for your team.
One-line definition
An LLM ops platform that puts evals first. Used by Stripe, Notion, and Vercel. Raised a big round in 2024.
How it works
The core abstraction is the Eval — a combination of dataset, task, and scorer.
import { Eval } from "braintrust";
Eval("MyRagApp", {
data: () => [
{ input: "What is the capital of France?", expected: "Paris" },
],
task: async (input) => myRagPipeline(input),
scores: [Factuality, AnswerRelevancy],
});
Running braintrust eval accumulates scores over time, so the impact of any model or prompt change shows up immediately.
Strengths
- Eval-first mindset — best embodiment of "prompts are code, and code has tests".
- Excellent Playground — fast comparison across prompts, models, and datasets.
- Loop (auto-tuning of LLM-as-judge) — automates judge calibration.
- First-class TypeScript and Python SDKs.
Weaknesses
- Paid-first — there is a free tier, but real use requires paying.
- A bit much if all you want is observability.
Who uses it
US product companies like Stripe, Notion, Vercel, Airtable. Teams that have institutionalized "no prompt ships without per-PR automatic evals".
8. Athina — fast-growing
Bundles evaluation, observability, and datasets into one fast-growing tool.
One-line definition
LLM ops with a clean dashboard and 50+ pre-built evaluators. Easy to start.
How it works
from athina.loaders import Loader
from athina.evals import Faithfulness
data = Loader().load_csv("eval_data.csv")
Faithfulness(model="gpt-4o").run_batch(data=data).to_df()
Or send production traces with the SDK and let the dashboard run evaluators automatically.
Strengths
- Many pre-built evaluators — Faithfulness, Context Precision, Toxicity, PII Detection, etc.
- Non-engineer-friendly dashboard — PMs can come in and build datasets and labels themselves.
- YAML-driven configuration — declare eval pipelines in YAML.
Weaknesses
- OSS contribution is partial — the core is SaaS.
- Deep tracing of agentic workflows is still LangSmith territory.
Who uses it
Mid-size startups where product and engineering both own LLM quality. Growing fast in English-speaking markets.
9. Comet Opik (released March 2025) — open source
The newest entrant in the open-source camp.
One-line definition
Open-source LLM observability + evaluation from Comet ML, released March 2025. Apache 2.0.
How it works
import opik
from opik import track
opik.configure(use_local=True)
@track
def answer(query: str) -> str:
return llm_call(query)
use_local=True sends to a self-hosted instance. Or send to Comet cloud.
Strengths
- The newest UX patterns — being late means absorbing what the others do well.
- Same umbrella as Comet's ML experiment tracking — similar positioning to W&B Weave.
- Generous free SaaS tier.
- Apache 2.0 — genuine open source.
Weaknesses
- The smallest ecosystem so far — late mover.
- Fewer plugins and integrations than Langfuse or LangSmith.
Who uses it
Existing Comet ML customers, and new projects that want "newest, open source, fastest start" all at once.
10. Vellum / PromptHub — prompt management proper
These tools live to separate prompts from code.
Vellum — enterprise prompt management
GitHub for prompts. Versions, environments, deployments, A/B tests, and datasets in one place. Optimized for workflows where PMs, CS, and QA edit prompts directly.
- Git-style diff and PR review for prompts.
- Workflow editor (visual chain builder).
- Canary new prompts to a percentage of production traffic.
- Many enterprise customers in healthcare and legal.
PromptHub — lighter collaboration
Lighter and cheaper than Vellum. For small teams that want git-like prompt management without the weight.
- Prompt library — share and search.
- A/B tests.
- Cross-provider comparison (send the same prompt to OpenAI, Anthropic, Bedrock in parallel).
When you actually need a separate prompt tool
For most small teams, LangSmith or Langfuse's built-in prompt features are enough. You hit the wall when:
- Non-engineers edit prompts directly (PMs and CS tune prompts weekly).
- Promotion across environments (dev, staging, prod) needs more than git.
- Same prompt across multiple models with side-by-side comparison.
All three → Vellum. One or two → PromptHub. None of them → LangSmith/Langfuse built-ins.
11. Portkey — AI Gateway + observability
The headline name in the gateway category.
One-line definition
An AI gateway unifying OpenAI / Anthropic / Bedrock / Google / Azure / Together / 200+ providers. Observability, caching, fallback, rate limit, cost guard included.
How it works
Point the OpenAI SDK's base_url at Portkey and pass routing rules in headers.
from openai import OpenAI
client = OpenAI(
base_url="https://api.portkey.ai/v1",
default_headers={
"x-portkey-api-key": os.getenv("PORTKEY_API_KEY"),
"x-portkey-config": "your-config-id", # routing, caching, retry rules
},
)
Inside the config, declare policies like "primary is GPT-4o, fallback to Claude Sonnet 4.5 on failure, cache same input for one hour".
Strengths
- Multi-provider integration — 200+.
- Fallback / load balancing / canary are native.
- Semantic cache — semantically equivalent questions hit the cache.
- Observability comes along — no separate tool needed.
- Built-in prompt management.
Weaknesses
- Gateway sits in the critical path — the inherent proxy problem. Region selection isn't always granular, so latency can creep up.
- Observability depth is not at Helicone/Langfuse level — sufficient, but not specialist.
Compared to LiteLLM
LiteLLM (open-source SDK / proxy) plays in similar territory. Differences:
- LiteLLM — started as a Python library, also has a self-host gateway. 100% open-source core. Lighter and more hackable.
- Portkey — SaaS-first. UI, policy management, and collaboration are central. Self-host enterprise tier exists.
Startups and indies lean LiteLLM. Mid-size and up lean Portkey.
12. TruLens / Ragas — the two pillars of RAG evaluation
If your system has RAG, you almost always use one of these.
Ragas — the de-facto standard for RAG metrics
Open source. The standard RAG metrics as a library. The most-cited RAG evaluation framework.
- Faithfulness — does the answer actually ground in retrieved context?
- Answer Relevancy — does the answer actually answer the question?
- Context Precision / Recall — is retrieval correct?
- Context Entity Recall — are the answer's entities in the retrieved context?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
LangSmith, Langfuse, Athina, and nearly every other observability tool ships Ragas metrics as built-in evaluators.
TruLens — broader evaluation + tracing
Open source from TruEra (now Snowflake). If Ragas is a metrics library, TruLens is metrics + tracing + dashboard.
- The RAG Triad — Context Relevance, Groundedness, Answer Relevance.
- Tracing and evaluation in the same tool.
- Notebook-friendly.
Picking between Ragas and TruLens
- Already using another observability tool (LangSmith, Langfuse, Athina) and only need metrics → Ragas.
- Want to spin up RAG evaluation by itself without committing to an observability tool → TruLens.
- Mixing them is common — call Ragas metrics inside TruLens.
13. Galileo / Patronus AI / DeepEval — enterprise evaluation
For organizations where compliance, security, and SLA matter.
Galileo — Generative AI Studio
Production-grade hallucination, safety, and drift monitoring. Fortune 500, government, finance.
- Galileo Evaluate — pre-production evaluation.
- Galileo Observe — production tracing and monitoring.
- Galileo Protect — real-time guardrails (block PII, jailbreak, hallucination).
Patronus AI — automated evaluation + safety
Specializes in automated LLM evaluation. Ships its own evaluation models — Lynx (hallucination detector), Glider, FinanceBench.
- Define custom evaluators in plain English.
- Pre-built benchmarks for finance and legal domains.
DeepEval (Confident AI) — pytest-style LLM tests
Pytest for LLMs. The most developer-familiar API.
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_answer_relevancy():
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital is Paris.",
)
metric = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [metric])
You run it in CI like pytest. Confident AI is the SaaS dashboard that aggregates results over time.
Picking among the three
- Finance or legal domain + need pre-built benchmarks → Patronus AI.
- Want production guardrails in one package → Galileo.
- Want developers to test LLMs like unit tests → DeepEval.
14. Cloud-native — Bedrock Evals / Vertex AI Evaluation / OpenAI Evals
The three hyperscalers entered seriously in late 2025.
AWS Bedrock Evaluations
A managed service inside Bedrock for evaluating models, prompts, and RAG.
- Model Evaluation — compare multiple Bedrock models on the same dataset.
- RAG Evaluation — integrates with Bedrock Knowledge Base. Evaluates retrieval and generation together.
- LLM-as-judge plus human evaluation (Amazon Mechanical Turk integration) — both supported.
- Pairs with Bedrock Guardrails so evaluation flows directly into guardrail policy.
The default choice for teams already on AWS.
Vertex AI Evaluation Service (Google)
Gen AI Eval Service. Evaluates Gemini and third-party models inside Vertex AI.
- Pointwise, pairwise, and rubric-based metrics.
- Autoraters (LLM-as-judge) plus custom metrics.
- Integrates with Vertex AI Pipelines — evals run as a CI step.
Default for companies running Gemini or PaLM in production.
OpenAI Evals (dashboard)
The Evals tab in OpenAI Platform. The openai/evals OSS project from 2024 has been folded into a SaaS dashboard.
- Evaluation backed by Stored Completions — a fraction of production traffic auto-converts into eval datasets.
- Model-graded eval as default.
- Hooks naturally into OpenAI Fine-tuning and Distillation.
Azure AI Studio Evaluations
Azure OpenAI's evaluation feature. Integrates with PromptFlow. Default for Azure-committed enterprises.
Pros and cons of cloud-native
- Pros — data stays inside the same cloud (compliance, security). Integrates naturally with IAM, VPC, and logging. No separate SaaS contract.
- Cons — multi-cloud and multi-model comparison is hard (Bedrock Evals can't evaluate OpenAI models). Specialist tools go deeper. Vendor lock-in risk.
15. Korea / Japan — Toss, NAVER, Sakana, NTT Tsuzumi
Local practice deserves a section of its own.
Korea
- Toss — homegrown LLM ops
- Built its own LLM gateway (a PortkeyLite of sorts), its own prompt registry, and self-hosts Langfuse for tracing.
- Financial regulation in Korea (mandatory network separation) effectively bans SaaS LLM ops, so the de-facto standard is self-hosting OSS in a private network.
- Wraps Ragas metrics in an internal library to measure quality of internal RAG chatbots (HR, legal, customer support).
- NAVER HCX monitoring
- NAVER, operating its own HyperCLOVA X model, runs its own monitoring stack integrated with internal NSML and CLOVA Studio.
- Provides prompt management and evaluation to BizPlatform and CLOVA for Biz customers.
- Kakao / Coupang / LINE — all hybridize homegrown plus open source (Langfuse, Phoenix).
- Korean LLM startups — Upstage, Wrtn, DeepL, and others mix LangSmith, Langfuse, and Helicone case by case.
Japan
- Sakana AI — own models, own ops
- Uses W&B and MLflow side by side for training and evaluating its models (EvoLLM, evo-ukiyoe, others). Production observability is Langfuse or a homegrown tool.
- NTT Tsuzumi — telco-grade ops
- NTT's own LLM. Telco compliance pushes them to a homegrown monitoring stack plus open-source Ragas and Langfuse.
- Mercari / CyberAgent / LINE Yahoo — self-hosted LangSmith or Langfuse. CyberAgent leans heavily on W&B because of its in-house training.
- Megabanks (MUFG, SMBC, Mizuho) — external SaaS is hard. AWS Bedrock + Bedrock Evals or Azure OpenAI + Azure AI Studio is the de facto standard.
Common patterns across Korea and Japan
- In finance, telco, and public sector, SaaS LLM ops is essentially banned, so self-hosting OSS (Langfuse, Phoenix, Opik, Helicone) is the standard.
- B2C startups use LangSmith, Helicone, or Langfuse SaaS directly.
- Data residency is increasingly the first question — does Tokyo region exist? Does Seoul region exist?
16. Who should pick what — four personas
The decision guide.
Persona 1 · Solo developer / indie hack
Setup — solo, building a side project. Cost must be near zero.
- Observability — Helicone (most generous free tier) or Langfuse Cloud (free tier up to 50k traces/month).
- Evaluation — Ragas as a library, only when needed.
- Prompt management — docstrings in code. LangSmith Prompts is free.
- Gateway — LiteLLM (Python library only, free).
Persona 2 · Seed/Series A startup (5~50 people)
Setup — production traffic exists. Iteration must be fast. Cost still matters.
- Observability — Langfuse SaaS (open source, sensible pricing) or LangSmith Plus.
- Evaluation — Braintrust (if eval-first culture matters) or Athina (PM-friendly UI).
- Prompt management — start with built-in LangSmith Prompts / Langfuse Prompts.
- Gateway — Portkey or LiteLLM once fallback and caching matter.
- RAG eval — register Ragas metrics as evaluators inside the above.
Persona 3 · Series B+ / enterprise
Setup — large scale. Compliance, SOC2, ISO 27001 required. SLA is revenue.
- Observability — LangSmith Enterprise or self-hosted Langfuse (data sovereignty). Galileo if you want production guardrails too.
- Evaluation — Braintrust Enterprise plus Patronus AI (domain-specific).
- Prompt management — Vellum (PM, CS, QA all editing).
- Gateway — Portkey Enterprise self-host or homegrown.
- Cloud-native — add Bedrock Evals on AWS, Vertex AI Evaluation on GCP.
Persona 4 · RAG-first
Setup — RAG is the core product. Retrieval quality equals business quality.
- Observability — Arize Phoenix (embedding visualization) or Langfuse.
- Evaluation — run both Ragas metrics and TruLens RAG Triad. Manage datasets and experiments in Braintrust.
- Prompt management — Vellum's workflow editor fits multi-step RAG chains.
- Gateway — Portkey's semantic cache is decisive for RAG cost.
The five questions you should ask before picking
Ask yourself these five before evaluating any tool.
- Data sovereignty — which region must our data live in? (Korea / Japan / EU / US?)
- Open source vs SaaS — do we have the headcount to self-host?
- Do we run agentic workflows? — yes → LangSmith or Langfuse wins. No → Helicone or Athina is enough.
- Do PMs and CS edit prompts directly? — yes → Vellum or LangSmith Prompts UI is decisive.
- Do we run automated LLM regression tests in CI? — yes → Braintrust or DeepEval wins.
17. Closing — "Running an LLM" finally has a name
In 2024, "LLM ops" still sounded awkward as a phrase. As of May 2026, it is a proper branch of SRE. Thirty-plus tools compete. OpenTelemetry GenAI conventions are the standard. The three hyperscalers ship evaluation services of their own.
The five questions we opened with — why is the answer strange, how do I reproduce it, who burned the tokens, which tests broke, can I measure quality automatically — all have answers in the tool layer now. The problem is choosing which tool.
- Fastest start → Helicone.
- LangChain family → LangSmith.
- Open-source / self-host required → Langfuse, Phoenix, or Opik.
- Evaluation as the core → Braintrust + Ragas.
- Multi-provider traffic management → Portkey or LiteLLM.
- Enterprise guardrails included → Galileo + Patronus.
- Cloud native → Bedrock Evals / Vertex AI Evaluation.
There is no longer an excuse not to use a tool. "Prompts are code. Code needs monitoring and tests." That is the 2026 baseline. When the next model arrives (GPT-5.5, Claude Opus 5, Gemini 3 Ultra, Llama 5), the same infrastructure carries over. Models change. The principles of ops do not.
References
Observability — all-in-one
- Helicone — https://www.helicone.ai/
- LangSmith — https://www.langchain.com/langsmith
- Langfuse — https://langfuse.com/
- W&B Weave — https://wandb.ai/site/weave
- Arize Phoenix — https://phoenix.arize.com/
- Comet Opik — https://www.comet.com/site/products/opik/
Evaluation specialists
- Braintrust — https://www.braintrust.dev/
- Athina — https://athina.ai/
- Ragas — https://docs.ragas.io/
- TruLens — https://www.trulens.org/
- DeepEval (Confident AI) — https://docs.confident-ai.com/
- Galileo — https://www.galileo.ai/
- Patronus AI — https://www.patronus.ai/
Prompt management
- Vellum — https://www.vellum.ai/
- PromptHub — https://www.prompthub.us/
- LangChain Studio — https://www.langchain.com/langgraph-studio
Gateway
- Portkey — https://portkey.ai/
- LiteLLM — https://www.litellm.ai/
- Cloudflare AI Gateway — https://developers.cloudflare.com/ai-gateway/
Cloud-native eval
- AWS Bedrock Evaluations — https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation.html
- Vertex AI Generative AI Evaluation Service — https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview
- OpenAI Evals (open source) — https://github.com/openai/evals
- Azure AI Studio Evaluations — https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-approach-gen-ai
Standards / specs
- OpenTelemetry GenAI semantic conventions — https://opentelemetry.io/docs/specs/semconv/gen-ai/
Korea / Japan
- Toss Tech blog — https://toss.tech/
- NAVER HyperCLOVA X — https://clova.ai/hyperclova
- Sakana AI — https://sakana.ai/
- NTT Tsuzumi — https://www.rd.ntt/e/research/JN20231101_h.html
- CyberAgent AI Lab — https://research.cyberagent.ai/