Skip to content
Published on

LLMOps Platform Architecture Guide: Model Deployment, Monitoring, and A/B Testing

Authors
  • Name
    Twitter
LLMOps Platform

Introduction

As LLMs (Large Language Models) rapidly expand into production environments, new operational challenges have emerged that traditional MLOps alone cannot solve. Prompt engineering has become more central than model training, evaluating generation quality now matters more than quantitative metrics, and guardrails for per-token cost management and hallucination prevention have become essential.

According to Gartner, over 50% of enterprise generative AI deployments are expected to fail by 2026 due to operational immaturity. This is not a problem with LLM technology itself, but rather stems from the absence of platform architecture needed to operate LLMs reliably.

This post covers the complete architecture of an LLMOps platform. We build everything needed for production LLM operations with code -- from vLLM/TGI-based model serving and token usage monitoring to prompt version management, A/B testing frameworks, NeMo Guardrails integration, and cost optimization.

LLMOps vs MLOps: What Is Different

While traditional MLOps focuses on automating the "train-deploy-monitor" pipeline, LLMOps starts from a fundamentally different paradigm. MLOps is a system for repeatable predictions, while LLMOps is a system for probabilistic generation.

AspectMLOpsLLMOps
Core ActivityModel training/retrainingPrompt engineering/fine-tuning
Cost StructureTraining cost dominantInference cost dominant (per-token billing)
EvaluationAccuracy, F1, RMSEBLEU, ROUGE, LLM-as-Judge
Data PipelineFeature stores, ETLRAG, vector DB, chunking pipelines
VersioningModel artifactsPrompt templates + model + parameters
MonitoringData drift, performance metricsToken usage, latency, quality, hallucination
Deployment CycleWeekly/monthly retrainingPrompt changes possible in minutes
Safety MeasuresInput validationGuardrails, content filtering, PII detection

LLMOps architecture requires more components than traditional MLOps. An application gateway sits in front of the model server, orchestrating prompt routing, vector DB search, tool calling, and caching layers.

User Request
    |
    v
+-------------------------------------------+
|         Application Gateway               |
|  +----------+ +--------+ +-----------+    |
|  | Prompt   | | Vector | | Guardrail |    |
|  | Router   | | DB     | | Engine    |    |
|  |          | | (RAG)  | |           |    |
|  +----------+ +--------+ +-----------+    |
|  +----------+ +--------+ +-----------+    |
|  | Cache    | | A/B    | | Token     |    |
|  | Layer    | | Router | | Metering  |    |
|  +----------+ +--------+ +-----------+    |
+-------------------+-----------------------+
                    |
    +---------------+---------------+
    v               v               v
+--------+    +--------+    +------------+
| vLLM   |    |  TGI   |    | TensorRT-  |
| Server |    | Server |    | LLM Server |
+--------+    +--------+    +------------+

Model Serving Architecture

LLM Serving Framework Comparison

A comparison of the major frameworks specialized for LLM serving:

FrameworkCore TechnologyGPU RequirementsThroughputLatencyModel CompatibilityOperational Complexity
vLLMPagedAttentionCUDA GPUHighMediumFull HuggingFaceLow
TGIFlash AttentionCUDA GPUHighLow (v3)Full HuggingFaceLow
TensorRT-LLMCUDA Graph OptimizationNVIDIA onlyHighestLowestConversion requiredHigh
Triton + vLLMEnsemble PipelineCUDA GPUHighMediumMulti-modelMedium

vLLM manages KV cache in page units like virtual memory through PagedAttention, minimizing GPU memory fragmentation. It can handle more concurrent requests on the same VRAM, showing 10-30% higher throughput than TGI on mixed-length workloads.

TGI v3 demonstrates up to 13x faster performance than vLLM on long prompts (200K+ tokens) and has strong integration with the Hugging Face ecosystem.

TensorRT-LLM achieves 20-40% higher raw throughput than vLLM/TGI on H100 hardware through CUDA graph optimization and fused kernels, but requires model conversion and NVIDIA hardware lock-in.

vLLM Deployment Configuration

A production deployment configuration for vLLM in a Kubernetes environment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b
  namespace: llm-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama3-70b
  template:
    metadata:
      labels:
        app: vllm-llama3-70b
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.7.3
          args:
            - '--model'
            - 'meta-llama/Llama-3.3-70B-Instruct'
            - '--tensor-parallel-size'
            - '4'
            - '--max-model-len'
            - '8192'
            - '--gpu-memory-utilization'
            - '0.90'
            - '--enable-chunked-prefill'
            - '--max-num-batched-tokens'
            - '32768'
            - '--port'
            - '8000'
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 4
            requests:
              nvidia.com/gpu: 4
              memory: '64Gi'
              cpu: '16'
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-70b-svc
  namespace: llm-serving
spec:
  selector:
    app: vllm-llama3-70b
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

Key parameter explanations:

  • tensor-parallel-size: 4: Distributes the 70B model across 4 GPUs for inference.
  • gpu-memory-utilization: 0.90: Allocates 90% of GPU memory to KV cache, maximizing concurrent requests.
  • enable-chunked-prefill: Interleaves prefill and decoding to reduce TTFT (Time To First Token).
  • max-num-batched-tokens: 32768: Maximum tokens per batch, balancing throughput and latency.

Monitoring Strategy

Unlike traditional ML monitoring, LLM monitoring must track three dimensions simultaneously: Performance, Cost, and Quality.

Core Metrics Framework

Performance Metrics          Cost Metrics              Quality Metrics
+-- TTFT (Time To            +-- Input token count     +-- Response relevance score
|   First Token)             +-- Output token count    +-- Hallucination rate
+-- TPOT (Time Per           +-- Cost per request      +-- Guardrail violation rate
|   Output Token)            +-- Cost comparison       +-- User feedback
+-- Total generation time    |   by model              |   (thumbs up/down)
+-- Request throughput       +-- Cache hit rate        +-- LLM-as-Judge score
|   (RPS)                    +-- Daily/monthly
+-- GPU utilization          |   cost trends
+-- Queue wait time

Prometheus Metrics Collection Implementation

A Python middleware example that collects metrics exposed by vLLM via Prometheus and adds custom business metrics:

import time
import tiktoken
from prometheus_client import (
    Counter, Histogram, Gauge, start_http_server
)
from functools import wraps

# Performance metrics
REQUEST_LATENCY = Histogram(
    "llm_request_latency_seconds",
    "LLM request latency",
    ["model", "endpoint"],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0],
)
TTFT_LATENCY = Histogram(
    "llm_ttft_seconds",
    "Time To First Token",
    ["model"],
    buckets=[0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0],
)

# Cost metrics
TOKEN_COUNTER = Counter(
    "llm_tokens_total",
    "Total token usage",
    ["model", "direction"],  # direction: input/output
)
REQUEST_COST = Counter(
    "llm_request_cost_dollars",
    "Per-request cost (USD)",
    ["model"],
)

# Quality metrics
QUALITY_SCORE = Histogram(
    "llm_quality_score",
    "LLM response quality score",
    ["model", "evaluator"],
    buckets=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
)
GUARDRAIL_VIOLATIONS = Counter(
    "llm_guardrail_violations_total",
    "Guardrail violation count",
    ["model", "violation_type"],
)

# Token pricing per model (USD per 1K tokens)
PRICING = {
    "llama-3.3-70b": {"input": 0.00059, "output": 0.00079},
    "gpt-4o": {"input": 0.0025, "output": 0.01},
    "claude-sonnet": {"input": 0.003, "output": 0.015},
}


class LLMMetricsCollector:
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.encoder = tiktoken.get_encoding("cl100k_base")

    def record_request(self, prompt: str, response: str,
                       latency: float, ttft: float):
        input_tokens = len(self.encoder.encode(prompt))
        output_tokens = len(self.encoder.encode(response))

        # Record performance
        REQUEST_LATENCY.labels(
            model=self.model_name, endpoint="/v1/chat/completions"
        ).observe(latency)
        TTFT_LATENCY.labels(model=self.model_name).observe(ttft)

        # Record token usage
        TOKEN_COUNTER.labels(
            model=self.model_name, direction="input"
        ).inc(input_tokens)
        TOKEN_COUNTER.labels(
            model=self.model_name, direction="output"
        ).inc(output_tokens)

        # Calculate and record cost
        pricing = PRICING.get(self.model_name, PRICING["llama-3.3-70b"])
        cost = (
            input_tokens / 1000 * pricing["input"]
            + output_tokens / 1000 * pricing["output"]
        )
        REQUEST_COST.labels(model=self.model_name).inc(cost)

    def record_quality(self, score: float, evaluator: str = "auto"):
        QUALITY_SCORE.labels(
            model=self.model_name, evaluator=evaluator
        ).observe(score)

    def record_guardrail_violation(self, violation_type: str):
        GUARDRAIL_VIOLATIONS.labels(
            model=self.model_name, violation_type=violation_type
        ).inc()


if __name__ == "__main__":
    start_http_server(9090)
    collector = LLMMetricsCollector("llama-3.3-70b")

Grafana Dashboard Key Panels

Core PromQL queries for a Grafana dashboard built on Prometheus metrics:

# P99 latency (5-minute window)
histogram_quantile(0.99, rate(llm_request_latency_seconds_bucket[5m]))

# Token consumption per minute
rate(llm_tokens_total[1m])

# Hourly cost trend
rate(llm_request_cost_dollars[1h]) * 3600

# Guardrail violation rate
rate(llm_guardrail_violations_total[5m]) / rate(llm_request_latency_seconds_count[5m])

# Median quality score
histogram_quantile(0.5, rate(llm_quality_score_bucket[1h]))

Prompt Version Management

In LLMOps, prompts are the core asset equivalent to model artifacts in traditional MLOps. Prompt templates, model versions, and generation parameters (temperature, top_p, etc.) must be versioned together to enable reproducible deployments and granular rollbacks.

import json
import hashlib
from datetime import datetime
from dataclasses import dataclass, field, asdict
from typing import Optional


@dataclass
class PromptVersion:
    name: str
    template: str
    model: str
    temperature: float = 0.7
    top_p: float = 0.9
    max_tokens: int = 2048
    system_prompt: str = ""
    version: str = ""
    created_at: str = field(
        default_factory=lambda: datetime.utcnow().isoformat()
    )

    def __post_init__(self):
        if not self.version:
            content = f"{self.template}{self.model}{self.temperature}"
            self.version = hashlib.sha256(
                content.encode()
            ).hexdigest()[:8]

    def to_dict(self) -> dict:
        return asdict(self)


class PromptRegistry:
    """Prompt version management registry"""

    def __init__(self, storage_backend="redis"):
        self.storage_backend = storage_backend
        self.prompts: dict[str, list[PromptVersion]] = {}

    def register(self, prompt: PromptVersion) -> str:
        if prompt.name not in self.prompts:
            self.prompts[prompt.name] = []
        self.prompts[prompt.name].append(prompt)
        return prompt.version

    def get_latest(self, name: str) -> Optional[PromptVersion]:
        versions = self.prompts.get(name, [])
        return versions[-1] if versions else None

    def get_version(self, name: str, version: str
                    ) -> Optional[PromptVersion]:
        versions = self.prompts.get(name, [])
        for v in versions:
            if v.version == version:
                return v
        return None

    def rollback(self, name: str, version: str) -> bool:
        target = self.get_version(name, version)
        if target:
            self.prompts[name].append(
                PromptVersion(
                    name=target.name,
                    template=target.template,
                    model=target.model,
                    temperature=target.temperature,
                    top_p=target.top_p,
                    max_tokens=target.max_tokens,
                    system_prompt=target.system_prompt,
                )
            )
            return True
        return False


# Usage example
registry = PromptRegistry()
v1 = PromptVersion(
    name="customer-support",
    template="Please respond kindly to customer inquiries.\n\nInquiry: {query}",
    model="llama-3.3-70b",
    temperature=0.3,
    system_prompt="You are a professional customer service agent.",
)
registry.register(v1)

A/B Testing Framework

A/B testing for LLMs is fundamentally different from traditional web A/B testing. Instead of simple metrics like click-through rates, multi-dimensional quality evaluation is required. Due to probabilistic output characteristics, the same input can produce different responses, requiring larger sample sizes.

A/B Test Router Implementation

import random
import hashlib
from dataclasses import dataclass
from typing import Any


@dataclass
class ABVariant:
    name: str
    prompt_version: str
    model: str
    weight: float  # Traffic ratio (0.0 ~ 1.0)
    parameters: dict = None


class LLMABRouter:
    """LLM A/B test traffic router"""

    def __init__(self, experiment_name: str):
        self.experiment_name = experiment_name
        self.variants: list[ABVariant] = []

    def add_variant(self, variant: ABVariant):
        self.variants.append(variant)

    def route(self, user_id: str) -> ABVariant:
        """Deterministic routing based on user ID
        (same user always gets same variant)"""
        hash_input = f"{self.experiment_name}:{user_id}"
        hash_value = int(
            hashlib.md5(hash_input.encode()).hexdigest(), 16
        )
        normalized = (hash_value % 10000) / 10000.0

        cumulative = 0.0
        for variant in self.variants:
            cumulative += variant.weight
            if normalized < cumulative:
                return variant
        return self.variants[-1]

    def validate_weights(self) -> bool:
        total = sum(v.weight for v in self.variants)
        return abs(total - 1.0) < 0.001


# Experiment setup example
experiment = LLMABRouter("customer-support-v2-test")
experiment.add_variant(ABVariant(
    name="control",
    prompt_version="v1-abc123",
    model="llama-3.3-70b",
    weight=0.7,
    parameters={"temperature": 0.3},
))
experiment.add_variant(ABVariant(
    name="treatment",
    prompt_version="v2-def456",
    model="llama-3.3-70b",
    weight=0.3,
    parameters={"temperature": 0.5},
))

# Per-user routing
variant = experiment.route(user_id="user-12345")
print(f"Assigned variant: {variant.name}")

Statistical Significance Testing

Core logic for determining statistical significance in LLM A/B tests:

import numpy as np
from scipy import stats


def calculate_ab_significance(
    control_scores: list[float],
    treatment_scores: list[float],
    alpha: float = 0.05,
    min_samples: int = 100,
) -> dict:
    """Determine statistical significance of A/B test results"""

    if (len(control_scores) < min_samples
            or len(treatment_scores) < min_samples):
        return {
            "status": "insufficient_samples",
            "control_n": len(control_scores),
            "treatment_n": len(treatment_scores),
            "min_required": min_samples,
        }

    control_mean = np.mean(control_scores)
    treatment_mean = np.mean(treatment_scores)
    lift = (treatment_mean - control_mean) / control_mean

    # Welch's t-test (no equal variance assumption)
    t_stat, p_value = stats.ttest_ind(
        control_scores, treatment_scores, equal_var=False
    )

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(
        (np.std(control_scores) ** 2 + np.std(treatment_scores) ** 2)
        / 2
    )
    cohens_d = (
        (treatment_mean - control_mean) / pooled_std
        if pooled_std > 0 else 0
    )

    return {
        "status": "significant" if p_value < alpha else "not_significant",
        "control_mean": round(control_mean, 4),
        "treatment_mean": round(treatment_mean, 4),
        "lift": round(lift * 100, 2),
        "p_value": round(p_value, 6),
        "cohens_d": round(cohens_d, 4),
        "recommendation": (
            "DEPLOY treatment"
            if p_value < alpha and lift > 0
            else "KEEP control"
        ),
    }

Guardrail Integration

Guardrails are not optional but essential for production LLMs. Using NVIDIA NeMo Guardrails, you can declaratively configure input/output filtering, topic drift prevention, PII detection, and hallucination checks.

NeMo Guardrails Configuration

# config.yml - NeMo Guardrails configuration
models:
  - type: main
    engine: vllm
    parameters:
      base_url: 'http://vllm-llama3-70b-svc:8000/v1'
      model_name: 'meta-llama/Llama-3.3-70B-Instruct'

rails:
  input:
    flows:
      - self check input # Input toxicity check
      - check jailbreak # Jailbreak attempt detection
      - mask pii # PII masking

  output:
    flows:
      - self check output # Output toxicity check
      - check hallucination # Hallucination detection
      - check topic relevance # Topic relevance verification

  config:
    enable_multi_step_generation: true
    lowest_temperature: 0.1
    enable_rails_exceptions: true

instructions:
  - type: general
    content: |
      You must follow these guidelines:
      1. Clearly state when facts are unverified speculation
      2. Never include personal information in responses
      3. Recommend professional consultation for medical/legal/financial advice
      4. Never generate violent or harmful content

sample_conversation: |
  user "Hello, I need help."
    express greeting
  bot express greeting and offer help
    "Hello! How can I help you?"

Guardrail Middleware Integration

from nemoguardrails import RailsConfig, LLMRails


class GuardrailMiddleware:
    """LLM guardrail middleware"""

    def __init__(self, config_path: str):
        config = RailsConfig.from_path(config_path)
        self.rails = LLMRails(config)

    async def process(self, user_message: str,
                      context: dict = None) -> dict:
        try:
            response = await self.rails.generate_async(
                messages=[{"role": "user", "content": user_message}]
            )
            return {
                "status": "success",
                "response": response["content"],
                "guardrail_actions": response.get(
                    "log", {}
                ).get("activated_rails", []),
            }
        except Exception as e:
            return {
                "status": "blocked",
                "reason": str(e),
                "response": "Unable to process your request. "
                            "Please try a different question.",
            }

Cost Optimization

Since LLM operational costs are proportional to token usage, a systematic cost optimization strategy is essential.

Cost Reduction Strategies

  1. Semantic Caching: Cache responses for similar questions based on vector similarity to prevent redundant inference. Typically achieves 20-40% cost reduction.

  2. Prompt Compression: Remove unnecessary tokens and convey only essential information to reduce input tokens. Tools like LLMLingua can compress prompts by over 50%.

  3. Model Routing: Automatically route between lightweight models (7B) and large models (70B) based on query complexity. Handling simple queries with lightweight models can reduce costs by over 80%.

  4. KV Cache Optimization: Leverage vLLM prefix caching to reuse KV caches for system prompts and common contexts.

# Model routing example: automatic routing based on query complexity
class ModelRouter:
    def __init__(self):
        self.complexity_threshold = 0.6
        self.models = {
            "simple": {
                "name": "llama-3.2-8b",
                "endpoint": "http://vllm-8b:8000/v1",
                "cost_per_1k": 0.00010,
            },
            "complex": {
                "name": "llama-3.3-70b",
                "endpoint": "http://vllm-70b:8000/v1",
                "cost_per_1k": 0.00079,
            },
        }

    def classify_complexity(self, query: str) -> float:
        """Evaluate query complexity as a score between 0 and 1"""
        indicators = [
            len(query) > 500,               # Long question
            "compare" in query.lower(),      # Comparison request
            "analyze" in query.lower(),      # Analysis request
            "code" in query.lower(),         # Code generation
            query.count("?") > 2,            # Multiple questions
        ]
        return sum(indicators) / len(indicators)

    def route(self, query: str) -> dict:
        complexity = self.classify_complexity(query)
        if complexity >= self.complexity_threshold:
            return self.models["complex"]
        return self.models["simple"]

Failure Cases and Lessons Learned

Case 1: Model Serving OOM (Out of Memory)

Deployed a 70B model on 4x A100 80GB, but setting max-model-len to 32768 caused OOM when concurrent requests increased.

Root Cause: KV cache consumes memory proportional to sequence length. At 32K length with 50 concurrent requests, KV cache alone requires 300GB+ memory.

Solution: Reduced max-model-len to 8192, set gpu-memory-utilization to 0.90, and routed long inputs to a dedicated instance.

Case 2: Prompt Regression

Modified a customer support prompt to be "friendlier," which caused a 30% drop in technical support accuracy.

Root Cause: The prompt change was deployed to all traffic without A/B testing. Emphasizing "friendliness" had the side effect of suppressing accurate technical terminology.

Solution: Established a policy requiring all prompt changes to be canary-deployed with 10% traffic, passing accuracy, relevance, and helpfulness quality metrics before full deployment.

Case 3: A/B Test Statistical Errors

Concluded an A/B test with only 500 samples and decided to deploy the treatment, but performance subsequently dropped below the control.

Root Cause: LLMs have high variance due to probabilistic characteristics, and 500 samples lacked sufficient statistical power. Additionally, weekend/weekday traffic pattern differences were not accounted for.

Solution: Set minimum sample size to 1000+, run experiments for at least one week to offset time-of-day effects, and verify effect size using Cohen's d.

Operational Checklist

Pre-Deployment

  • Verify model serving framework healthcheck endpoint responds correctly
  • Confirm GPU memory utilization does not exceed 95%
  • Ensure prompt version is registered in the registry
  • Verify guardrail configuration reflects the latest policies
  • Confirm rollback prompt version is clearly specified

During Deployment

  • Start canary traffic ratio at 10% and increase incrementally
  • Monitor TTFT, TPOT, and total latency in real-time
  • Verify guardrail violation rate does not spike compared to baseline
  • Confirm token usage and cost remain within budget

Post-Deployment

  • Confirm statistical significance of A/B test results before full deployment
  • Review quality metrics (relevance, accuracy, helpfulness) dashboard weekly
  • Generate monthly cost reports tracking actual vs. budget performance
  • Collect user feedback data to inform prompt improvements
  • Analyze guardrail logs to add new risk patterns to policies

References