- Published on
LLMOps Platform Architecture Guide: Model Deployment, Monitoring, and A/B Testing
- Authors
- Name
- Introduction
- LLMOps vs MLOps: What Is Different
- Model Serving Architecture
- Monitoring Strategy
- Prompt Version Management
- A/B Testing Framework
- Guardrail Integration
- Cost Optimization
- Failure Cases and Lessons Learned
- Operational Checklist
- References

Introduction
As LLMs (Large Language Models) rapidly expand into production environments, new operational challenges have emerged that traditional MLOps alone cannot solve. Prompt engineering has become more central than model training, evaluating generation quality now matters more than quantitative metrics, and guardrails for per-token cost management and hallucination prevention have become essential.
According to Gartner, over 50% of enterprise generative AI deployments are expected to fail by 2026 due to operational immaturity. This is not a problem with LLM technology itself, but rather stems from the absence of platform architecture needed to operate LLMs reliably.
This post covers the complete architecture of an LLMOps platform. We build everything needed for production LLM operations with code -- from vLLM/TGI-based model serving and token usage monitoring to prompt version management, A/B testing frameworks, NeMo Guardrails integration, and cost optimization.
LLMOps vs MLOps: What Is Different
While traditional MLOps focuses on automating the "train-deploy-monitor" pipeline, LLMOps starts from a fundamentally different paradigm. MLOps is a system for repeatable predictions, while LLMOps is a system for probabilistic generation.
| Aspect | MLOps | LLMOps |
|---|---|---|
| Core Activity | Model training/retraining | Prompt engineering/fine-tuning |
| Cost Structure | Training cost dominant | Inference cost dominant (per-token billing) |
| Evaluation | Accuracy, F1, RMSE | BLEU, ROUGE, LLM-as-Judge |
| Data Pipeline | Feature stores, ETL | RAG, vector DB, chunking pipelines |
| Versioning | Model artifacts | Prompt templates + model + parameters |
| Monitoring | Data drift, performance metrics | Token usage, latency, quality, hallucination |
| Deployment Cycle | Weekly/monthly retraining | Prompt changes possible in minutes |
| Safety Measures | Input validation | Guardrails, content filtering, PII detection |
LLMOps architecture requires more components than traditional MLOps. An application gateway sits in front of the model server, orchestrating prompt routing, vector DB search, tool calling, and caching layers.
User Request
|
v
+-------------------------------------------+
| Application Gateway |
| +----------+ +--------+ +-----------+ |
| | Prompt | | Vector | | Guardrail | |
| | Router | | DB | | Engine | |
| | | | (RAG) | | | |
| +----------+ +--------+ +-----------+ |
| +----------+ +--------+ +-----------+ |
| | Cache | | A/B | | Token | |
| | Layer | | Router | | Metering | |
| +----------+ +--------+ +-----------+ |
+-------------------+-----------------------+
|
+---------------+---------------+
v v v
+--------+ +--------+ +------------+
| vLLM | | TGI | | TensorRT- |
| Server | | Server | | LLM Server |
+--------+ +--------+ +------------+
Model Serving Architecture
LLM Serving Framework Comparison
A comparison of the major frameworks specialized for LLM serving:
| Framework | Core Technology | GPU Requirements | Throughput | Latency | Model Compatibility | Operational Complexity |
|---|---|---|---|---|---|---|
| vLLM | PagedAttention | CUDA GPU | High | Medium | Full HuggingFace | Low |
| TGI | Flash Attention | CUDA GPU | High | Low (v3) | Full HuggingFace | Low |
| TensorRT-LLM | CUDA Graph Optimization | NVIDIA only | Highest | Lowest | Conversion required | High |
| Triton + vLLM | Ensemble Pipeline | CUDA GPU | High | Medium | Multi-model | Medium |
vLLM manages KV cache in page units like virtual memory through PagedAttention, minimizing GPU memory fragmentation. It can handle more concurrent requests on the same VRAM, showing 10-30% higher throughput than TGI on mixed-length workloads.
TGI v3 demonstrates up to 13x faster performance than vLLM on long prompts (200K+ tokens) and has strong integration with the Hugging Face ecosystem.
TensorRT-LLM achieves 20-40% higher raw throughput than vLLM/TGI on H100 hardware through CUDA graph optimization and fused kernels, but requires model conversion and NVIDIA hardware lock-in.
vLLM Deployment Configuration
A production deployment configuration for vLLM in a Kubernetes environment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-70b
namespace: llm-serving
spec:
replicas: 2
selector:
matchLabels:
app: vllm-llama3-70b
template:
metadata:
labels:
app: vllm-llama3-70b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.7.3
args:
- '--model'
- 'meta-llama/Llama-3.3-70B-Instruct'
- '--tensor-parallel-size'
- '4'
- '--max-model-len'
- '8192'
- '--gpu-memory-utilization'
- '0.90'
- '--enable-chunked-prefill'
- '--max-num-batched-tokens'
- '32768'
- '--port'
- '8000'
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 4
requests:
nvidia.com/gpu: 4
memory: '64Gi'
cpu: '16'
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-70b-svc
namespace: llm-serving
spec:
selector:
app: vllm-llama3-70b
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
Key parameter explanations:
tensor-parallel-size: 4: Distributes the 70B model across 4 GPUs for inference.gpu-memory-utilization: 0.90: Allocates 90% of GPU memory to KV cache, maximizing concurrent requests.enable-chunked-prefill: Interleaves prefill and decoding to reduce TTFT (Time To First Token).max-num-batched-tokens: 32768: Maximum tokens per batch, balancing throughput and latency.
Monitoring Strategy
Unlike traditional ML monitoring, LLM monitoring must track three dimensions simultaneously: Performance, Cost, and Quality.
Core Metrics Framework
Performance Metrics Cost Metrics Quality Metrics
+-- TTFT (Time To +-- Input token count +-- Response relevance score
| First Token) +-- Output token count +-- Hallucination rate
+-- TPOT (Time Per +-- Cost per request +-- Guardrail violation rate
| Output Token) +-- Cost comparison +-- User feedback
+-- Total generation time | by model | (thumbs up/down)
+-- Request throughput +-- Cache hit rate +-- LLM-as-Judge score
| (RPS) +-- Daily/monthly
+-- GPU utilization | cost trends
+-- Queue wait time
Prometheus Metrics Collection Implementation
A Python middleware example that collects metrics exposed by vLLM via Prometheus and adds custom business metrics:
import time
import tiktoken
from prometheus_client import (
Counter, Histogram, Gauge, start_http_server
)
from functools import wraps
# Performance metrics
REQUEST_LATENCY = Histogram(
"llm_request_latency_seconds",
"LLM request latency",
["model", "endpoint"],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0],
)
TTFT_LATENCY = Histogram(
"llm_ttft_seconds",
"Time To First Token",
["model"],
buckets=[0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0],
)
# Cost metrics
TOKEN_COUNTER = Counter(
"llm_tokens_total",
"Total token usage",
["model", "direction"], # direction: input/output
)
REQUEST_COST = Counter(
"llm_request_cost_dollars",
"Per-request cost (USD)",
["model"],
)
# Quality metrics
QUALITY_SCORE = Histogram(
"llm_quality_score",
"LLM response quality score",
["model", "evaluator"],
buckets=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
)
GUARDRAIL_VIOLATIONS = Counter(
"llm_guardrail_violations_total",
"Guardrail violation count",
["model", "violation_type"],
)
# Token pricing per model (USD per 1K tokens)
PRICING = {
"llama-3.3-70b": {"input": 0.00059, "output": 0.00079},
"gpt-4o": {"input": 0.0025, "output": 0.01},
"claude-sonnet": {"input": 0.003, "output": 0.015},
}
class LLMMetricsCollector:
def __init__(self, model_name: str):
self.model_name = model_name
self.encoder = tiktoken.get_encoding("cl100k_base")
def record_request(self, prompt: str, response: str,
latency: float, ttft: float):
input_tokens = len(self.encoder.encode(prompt))
output_tokens = len(self.encoder.encode(response))
# Record performance
REQUEST_LATENCY.labels(
model=self.model_name, endpoint="/v1/chat/completions"
).observe(latency)
TTFT_LATENCY.labels(model=self.model_name).observe(ttft)
# Record token usage
TOKEN_COUNTER.labels(
model=self.model_name, direction="input"
).inc(input_tokens)
TOKEN_COUNTER.labels(
model=self.model_name, direction="output"
).inc(output_tokens)
# Calculate and record cost
pricing = PRICING.get(self.model_name, PRICING["llama-3.3-70b"])
cost = (
input_tokens / 1000 * pricing["input"]
+ output_tokens / 1000 * pricing["output"]
)
REQUEST_COST.labels(model=self.model_name).inc(cost)
def record_quality(self, score: float, evaluator: str = "auto"):
QUALITY_SCORE.labels(
model=self.model_name, evaluator=evaluator
).observe(score)
def record_guardrail_violation(self, violation_type: str):
GUARDRAIL_VIOLATIONS.labels(
model=self.model_name, violation_type=violation_type
).inc()
if __name__ == "__main__":
start_http_server(9090)
collector = LLMMetricsCollector("llama-3.3-70b")
Grafana Dashboard Key Panels
Core PromQL queries for a Grafana dashboard built on Prometheus metrics:
# P99 latency (5-minute window)
histogram_quantile(0.99, rate(llm_request_latency_seconds_bucket[5m]))
# Token consumption per minute
rate(llm_tokens_total[1m])
# Hourly cost trend
rate(llm_request_cost_dollars[1h]) * 3600
# Guardrail violation rate
rate(llm_guardrail_violations_total[5m]) / rate(llm_request_latency_seconds_count[5m])
# Median quality score
histogram_quantile(0.5, rate(llm_quality_score_bucket[1h]))
Prompt Version Management
In LLMOps, prompts are the core asset equivalent to model artifacts in traditional MLOps. Prompt templates, model versions, and generation parameters (temperature, top_p, etc.) must be versioned together to enable reproducible deployments and granular rollbacks.
import json
import hashlib
from datetime import datetime
from dataclasses import dataclass, field, asdict
from typing import Optional
@dataclass
class PromptVersion:
name: str
template: str
model: str
temperature: float = 0.7
top_p: float = 0.9
max_tokens: int = 2048
system_prompt: str = ""
version: str = ""
created_at: str = field(
default_factory=lambda: datetime.utcnow().isoformat()
)
def __post_init__(self):
if not self.version:
content = f"{self.template}{self.model}{self.temperature}"
self.version = hashlib.sha256(
content.encode()
).hexdigest()[:8]
def to_dict(self) -> dict:
return asdict(self)
class PromptRegistry:
"""Prompt version management registry"""
def __init__(self, storage_backend="redis"):
self.storage_backend = storage_backend
self.prompts: dict[str, list[PromptVersion]] = {}
def register(self, prompt: PromptVersion) -> str:
if prompt.name not in self.prompts:
self.prompts[prompt.name] = []
self.prompts[prompt.name].append(prompt)
return prompt.version
def get_latest(self, name: str) -> Optional[PromptVersion]:
versions = self.prompts.get(name, [])
return versions[-1] if versions else None
def get_version(self, name: str, version: str
) -> Optional[PromptVersion]:
versions = self.prompts.get(name, [])
for v in versions:
if v.version == version:
return v
return None
def rollback(self, name: str, version: str) -> bool:
target = self.get_version(name, version)
if target:
self.prompts[name].append(
PromptVersion(
name=target.name,
template=target.template,
model=target.model,
temperature=target.temperature,
top_p=target.top_p,
max_tokens=target.max_tokens,
system_prompt=target.system_prompt,
)
)
return True
return False
# Usage example
registry = PromptRegistry()
v1 = PromptVersion(
name="customer-support",
template="Please respond kindly to customer inquiries.\n\nInquiry: {query}",
model="llama-3.3-70b",
temperature=0.3,
system_prompt="You are a professional customer service agent.",
)
registry.register(v1)
A/B Testing Framework
A/B testing for LLMs is fundamentally different from traditional web A/B testing. Instead of simple metrics like click-through rates, multi-dimensional quality evaluation is required. Due to probabilistic output characteristics, the same input can produce different responses, requiring larger sample sizes.
A/B Test Router Implementation
import random
import hashlib
from dataclasses import dataclass
from typing import Any
@dataclass
class ABVariant:
name: str
prompt_version: str
model: str
weight: float # Traffic ratio (0.0 ~ 1.0)
parameters: dict = None
class LLMABRouter:
"""LLM A/B test traffic router"""
def __init__(self, experiment_name: str):
self.experiment_name = experiment_name
self.variants: list[ABVariant] = []
def add_variant(self, variant: ABVariant):
self.variants.append(variant)
def route(self, user_id: str) -> ABVariant:
"""Deterministic routing based on user ID
(same user always gets same variant)"""
hash_input = f"{self.experiment_name}:{user_id}"
hash_value = int(
hashlib.md5(hash_input.encode()).hexdigest(), 16
)
normalized = (hash_value % 10000) / 10000.0
cumulative = 0.0
for variant in self.variants:
cumulative += variant.weight
if normalized < cumulative:
return variant
return self.variants[-1]
def validate_weights(self) -> bool:
total = sum(v.weight for v in self.variants)
return abs(total - 1.0) < 0.001
# Experiment setup example
experiment = LLMABRouter("customer-support-v2-test")
experiment.add_variant(ABVariant(
name="control",
prompt_version="v1-abc123",
model="llama-3.3-70b",
weight=0.7,
parameters={"temperature": 0.3},
))
experiment.add_variant(ABVariant(
name="treatment",
prompt_version="v2-def456",
model="llama-3.3-70b",
weight=0.3,
parameters={"temperature": 0.5},
))
# Per-user routing
variant = experiment.route(user_id="user-12345")
print(f"Assigned variant: {variant.name}")
Statistical Significance Testing
Core logic for determining statistical significance in LLM A/B tests:
import numpy as np
from scipy import stats
def calculate_ab_significance(
control_scores: list[float],
treatment_scores: list[float],
alpha: float = 0.05,
min_samples: int = 100,
) -> dict:
"""Determine statistical significance of A/B test results"""
if (len(control_scores) < min_samples
or len(treatment_scores) < min_samples):
return {
"status": "insufficient_samples",
"control_n": len(control_scores),
"treatment_n": len(treatment_scores),
"min_required": min_samples,
}
control_mean = np.mean(control_scores)
treatment_mean = np.mean(treatment_scores)
lift = (treatment_mean - control_mean) / control_mean
# Welch's t-test (no equal variance assumption)
t_stat, p_value = stats.ttest_ind(
control_scores, treatment_scores, equal_var=False
)
# Effect size (Cohen's d)
pooled_std = np.sqrt(
(np.std(control_scores) ** 2 + np.std(treatment_scores) ** 2)
/ 2
)
cohens_d = (
(treatment_mean - control_mean) / pooled_std
if pooled_std > 0 else 0
)
return {
"status": "significant" if p_value < alpha else "not_significant",
"control_mean": round(control_mean, 4),
"treatment_mean": round(treatment_mean, 4),
"lift": round(lift * 100, 2),
"p_value": round(p_value, 6),
"cohens_d": round(cohens_d, 4),
"recommendation": (
"DEPLOY treatment"
if p_value < alpha and lift > 0
else "KEEP control"
),
}
Guardrail Integration
Guardrails are not optional but essential for production LLMs. Using NVIDIA NeMo Guardrails, you can declaratively configure input/output filtering, topic drift prevention, PII detection, and hallucination checks.
NeMo Guardrails Configuration
# config.yml - NeMo Guardrails configuration
models:
- type: main
engine: vllm
parameters:
base_url: 'http://vllm-llama3-70b-svc:8000/v1'
model_name: 'meta-llama/Llama-3.3-70B-Instruct'
rails:
input:
flows:
- self check input # Input toxicity check
- check jailbreak # Jailbreak attempt detection
- mask pii # PII masking
output:
flows:
- self check output # Output toxicity check
- check hallucination # Hallucination detection
- check topic relevance # Topic relevance verification
config:
enable_multi_step_generation: true
lowest_temperature: 0.1
enable_rails_exceptions: true
instructions:
- type: general
content: |
You must follow these guidelines:
1. Clearly state when facts are unverified speculation
2. Never include personal information in responses
3. Recommend professional consultation for medical/legal/financial advice
4. Never generate violent or harmful content
sample_conversation: |
user "Hello, I need help."
express greeting
bot express greeting and offer help
"Hello! How can I help you?"
Guardrail Middleware Integration
from nemoguardrails import RailsConfig, LLMRails
class GuardrailMiddleware:
"""LLM guardrail middleware"""
def __init__(self, config_path: str):
config = RailsConfig.from_path(config_path)
self.rails = LLMRails(config)
async def process(self, user_message: str,
context: dict = None) -> dict:
try:
response = await self.rails.generate_async(
messages=[{"role": "user", "content": user_message}]
)
return {
"status": "success",
"response": response["content"],
"guardrail_actions": response.get(
"log", {}
).get("activated_rails", []),
}
except Exception as e:
return {
"status": "blocked",
"reason": str(e),
"response": "Unable to process your request. "
"Please try a different question.",
}
Cost Optimization
Since LLM operational costs are proportional to token usage, a systematic cost optimization strategy is essential.
Cost Reduction Strategies
Semantic Caching: Cache responses for similar questions based on vector similarity to prevent redundant inference. Typically achieves 20-40% cost reduction.
Prompt Compression: Remove unnecessary tokens and convey only essential information to reduce input tokens. Tools like LLMLingua can compress prompts by over 50%.
Model Routing: Automatically route between lightweight models (7B) and large models (70B) based on query complexity. Handling simple queries with lightweight models can reduce costs by over 80%.
KV Cache Optimization: Leverage vLLM prefix caching to reuse KV caches for system prompts and common contexts.
# Model routing example: automatic routing based on query complexity
class ModelRouter:
def __init__(self):
self.complexity_threshold = 0.6
self.models = {
"simple": {
"name": "llama-3.2-8b",
"endpoint": "http://vllm-8b:8000/v1",
"cost_per_1k": 0.00010,
},
"complex": {
"name": "llama-3.3-70b",
"endpoint": "http://vllm-70b:8000/v1",
"cost_per_1k": 0.00079,
},
}
def classify_complexity(self, query: str) -> float:
"""Evaluate query complexity as a score between 0 and 1"""
indicators = [
len(query) > 500, # Long question
"compare" in query.lower(), # Comparison request
"analyze" in query.lower(), # Analysis request
"code" in query.lower(), # Code generation
query.count("?") > 2, # Multiple questions
]
return sum(indicators) / len(indicators)
def route(self, query: str) -> dict:
complexity = self.classify_complexity(query)
if complexity >= self.complexity_threshold:
return self.models["complex"]
return self.models["simple"]
Failure Cases and Lessons Learned
Case 1: Model Serving OOM (Out of Memory)
Deployed a 70B model on 4x A100 80GB, but setting max-model-len to 32768 caused OOM when concurrent requests increased.
Root Cause: KV cache consumes memory proportional to sequence length. At 32K length with 50 concurrent requests, KV cache alone requires 300GB+ memory.
Solution: Reduced max-model-len to 8192, set gpu-memory-utilization to 0.90, and routed long inputs to a dedicated instance.
Case 2: Prompt Regression
Modified a customer support prompt to be "friendlier," which caused a 30% drop in technical support accuracy.
Root Cause: The prompt change was deployed to all traffic without A/B testing. Emphasizing "friendliness" had the side effect of suppressing accurate technical terminology.
Solution: Established a policy requiring all prompt changes to be canary-deployed with 10% traffic, passing accuracy, relevance, and helpfulness quality metrics before full deployment.
Case 3: A/B Test Statistical Errors
Concluded an A/B test with only 500 samples and decided to deploy the treatment, but performance subsequently dropped below the control.
Root Cause: LLMs have high variance due to probabilistic characteristics, and 500 samples lacked sufficient statistical power. Additionally, weekend/weekday traffic pattern differences were not accounted for.
Solution: Set minimum sample size to 1000+, run experiments for at least one week to offset time-of-day effects, and verify effect size using Cohen's d.
Operational Checklist
Pre-Deployment
- Verify model serving framework healthcheck endpoint responds correctly
- Confirm GPU memory utilization does not exceed 95%
- Ensure prompt version is registered in the registry
- Verify guardrail configuration reflects the latest policies
- Confirm rollback prompt version is clearly specified
During Deployment
- Start canary traffic ratio at 10% and increase incrementally
- Monitor TTFT, TPOT, and total latency in real-time
- Verify guardrail violation rate does not spike compared to baseline
- Confirm token usage and cost remain within budget
Post-Deployment
- Confirm statistical significance of A/B test results before full deployment
- Review quality metrics (relevance, accuracy, helpfulness) dashboard weekly
- Generate monthly cost reports tracking actual vs. budget performance
- Collect user feedback data to inform prompt improvements
- Analyze guardrail logs to add new risk patterns to policies
References
- vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy - Production LLM Inference Comparison
- LLMOps vs MLOps: Key Differences and Architecture - Codebridge
- LLM Monitoring: Quality, Cost, Latency, and Drift in Production - LangWatch
- A/B Testing of LLM Prompts - Langfuse
- NVIDIA NeMo Guardrails - GitHub
- LLM Observability and Monitoring - Langfuse
- A/B Testing LLM Prompts: A Practical Guide - Braintrust