- Authors
- Name
- Entering
- Why structured output is needed
- Constrained Decoding Principle: FSM and CFG
- 예시: JSON 객체 시작 직후, 키 문자열만 허용
- 현재 FSM 상태에서 허용되는 토큰: 쌍따옴표(")로 시작하는 키
- softmax 적용 후 허용된 토큰만 양의 확률을 가짐

Entering
The most frequently encountered issue when integrating LLM into a production pipeline is uncertainty in the output format. Simply telling the prompt to “Response with JSON” can cause a variety of problems, including schema mismatches, missing fields, incorrect types, and incomplete JSON. In fact, there are reports that parsing failure occurs in approximately 5 to 15% of requests when JSON output is requested with only a prompt.
The technology that fundamentally solves this problem is Constrained Decoding. Rather than a “request” at the prompt level, 100% schema compliance is guaranteed** by imposing grammatical constraints on the token creation process itself.
In this article, we start with the core principles of Constrained Decoding (FSM/CFG), compare major engines (Outlines,
Why structured output is needed
Limitations of free-form output
Programmatically processing LLM's free-form text output requires writing regular expressions or custom parsers. However, this approach is inherently weak.
- Parse failure: JSON is cut off, commas are missing, or quotes are missing.
- Schema mismatch: Field name
namenot this名前is returned, orratingThis comes as a string - Illusion Field: Unrequested fields are added or required fields are missing.
- Format mixing: JSON and natural language descriptions are mixed
What structured output solves
Structured output ensures:
- Format Guarantee: Output must be valid JSON/XML/YAML
- Schema Compliance: 100% conforms to the specified JSON Schema
- Type Safety: Strings are not allowed in numeric fields.
- No retry required: Retry costs due to parsing failure are eliminated.
There are two main ways to achieve this. API native method (OpenAI Structured Outputs, Anthropic Tool Use) and Constrained Decoding engine (Outlines, XGrammar, llguidance).
Constrained Decoding Principle: FSM and CFG
Imposing constraints on the token creation process
LLM generates a probability distribution for the entire vocabulary at every step. Constrained Decoding masks the probability of grammatically unacceptable tokens as 0 (or -inf) in this probability distribution, ensuring that only valid tokens are selected.
For example, at the beginning of a JSON object:{It only accepts tokens, and after the key name ends:Only tokens are allowed.
Finite State Machine (FSM)-based approach
Regular expressions or simple grammars can be expressed with FSM. The operation method of FSM-based Constrained Decoding is as follows.
- Convert JSON Schema to regular expression
- Compile regular expressions into DFA (Deterministic Finite Automaton)
- At each decoding step, only tokens that can transition from the current FSM state are accepted.
- Update FSM status according to selected token
The Outlines library is a representative implementation of this method. The advantage is that the implementation is relatively simple and state tracking is fast, but it has the limitation that it is difficult to naturally handle recursive structures (nested JSON objects, variable-length arrays, etc.).
CFG (Context-Free Grammar)-based approach
CFG is needed to accurately express the entire grammar of JSON (including nested objects, arrays, and recursive structures). The CFG-based approach works as follows.
- Convert JSON Schema to EBNF (Extended Backus-Naur Form) grammar
- Track PDA (Pushdown Automaton) or parser state at every decoding step.
- Select only tokens that are allowed in the current grammar state
- Manage nesting depth based on stack
XGrammar and llguidance adopt a CFG-based approach. Although recursive structures can be handled naturally, the state management overhead is large compared to FSM.
Core mechanism of token masking
Both methods ultimately perform logit masking. When decoding, the position of the token is not allowed in the logit vector.-infIf set, the probability of the token after softmax becomes 0.```python import torch import numpy as np
def apply_grammar_mask(logits: torch.Tensor, allowed_token_ids: set, vocab_size: int) -> torch.Tensor: """ Constrained Decoding의 핵심: 문법적으로 허용된 토큰만 남기고 나머지 토큰의 logit을 -inf로 마스킹한다. """ mask = torch.full((vocab_size,), float('-inf'), dtype=logits.dtype, device=logits.device) allowed_ids = torch.tensor(list(allowed_token_ids), dtype=torch.long, device=logits.device) mask[allowed_ids] = 0.0 masked_logits = logits + mask return masked_logits
예시: JSON 객체 시작 직후, 키 문자열만 허용
vocab_size = 32000 logits = torch.randn(vocab_size) # 모델이 출력한 raw logits
현재 FSM 상태에서 허용되는 토큰: 쌍따옴표(")로 시작하는 키
allowed_tokens = 1234 # tokenizer에서 '"name', '"age' 등에 해당하는 ID masked = apply_grammar_mask(logits, allowed_tokens, vocab_size)
softmax 적용 후 허용된 토큰만 양의 확률을 가짐
probs = torch.softmax(masked, dim=-1) assert probs[0].item() == 0.0 # 허용되지 않은 토큰의 확률은 0
### Comparison table
| Item | Outlines | XGrammar | llguidance | Native API (OpenAI) |
| -------------------- | ------------------------------- | ------------------------------- | ------------------------------- | --------------------- |
| **Grammar Model** | FSM (regular expression + JSON Schema) | CFG (EBNF) | CFG (EBNF) + byte level | Internal implementation (private) |
| **Application Formats** | JSON, regular expressions, CFG | JSON, EBNF, regular expressions | JSON, EBNF, regular expressions, Substrs | JSON Schema |
| **Framework Integration** | vLLM, TGI, SGLang | vLLM, SGLang, MLC-LLM | Guidance, Azure AI | OpenAI API only |
| **Preprocessing time** | Normal (FSM index build) | Fast (adaptive token mask) | Fast (parser based) | None (server side) |
| **Runtime overhead** | Low (~3% increased latency) | Very low (~1% latency increase) | Low (~2% increased latency) | None (transparent) |
| **Batch decoding support** | Support | High efficiency support (bitmask reuse) | Support | Not applicable |
| **Handling nested structures** | Limited (performance deterioration during deep recursion) | Excellent (Pushdown Automata) | Excellent | Excellent |
| **License** | Apache 2.0 | Apache 2.0 | MIT | Commercial API |
| **Maturity** | High (2023~) | Middle (2024~) | Middle (2024~) | High |
### Outlines
Outlines is a pioneering library in the field of Constrained Decoding. It uses a two-stage pipeline that converts regular expressions to FSM and JSON Schema to regular expressions. It was adopted as the default guided decoding backend in vLLM, and although the default backend will be converted to XGrammar after 2025, it is still widely used.
###XGrammar is a CFG-based engine developed by the MLC-AI team and was released in late 2024. The key innovation is **Adaptive Token Mask Caching**. By separating the grammar state into a “context-independent” part and a “context-dependent” part, the former reuses pre-computed bitmasks and only the latter is computed in real-time. This achieves particularly high efficiency in batch decoding.
### llguidance
llguidance is a Rust-based parser engine derived from Microsoft's Guidance project. It operates at the byte level, is independent of the tokenizer, and supports unique constraint types such as substring matching (Substrs). It is used in Azure AI Inference.
## OpenAI/Anthropic API native method
### OpenAI Structured Outputs
OpenAI will officially support Structured Outputs starting in August 2024. Constrained Decoding is applied internally to ensure 100% compliance with JSON Schema.```python
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum
client = OpenAI()
# Pydantic 모델로 출력 스키마 정의
class Severity(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
class SecurityFinding(BaseModel):
vulnerability_type: str = Field(description="취약점 유형 (예: SQL Injection, XSS)")
severity: Severity = Field(description="심각도")
affected_component: str = Field(description="영향받는 컴포넌트")
description: str = Field(description="취약점 상세 설명")
remediation: str = Field(description="권장 조치")
cvss_score: Optional[float] = Field(default=None, ge=0.0, le=10.0, description="CVSS 점수")
class SecurityReport(BaseModel):
findings: List[SecurityFinding] = Field(description="발견된 취약점 목록")
overall_risk: Severity = Field(description="전체 위험 수준")
summary: str = Field(description="보안 분석 요약")
scan_timestamp: str = Field(description="스캔 시각 (ISO 8601)")
# Structured Outputs로 호출 - 스키마 100% 보장
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "system",
"content": "당신은 보안 분석 전문가입니다. 주어진 코드를 분석하여 보안 취약점을 보고하세요."
},
{
"role": "user",
"content": """다음 Flask 코드의 보안 취약점을 분석해주세요:
@app.route('/search')
def search():
query = request.args.get('q')
result = db.execute(f"SELECT * FROM users WHERE name = '{query}'")
return render_template_string(f"<h1>Results for {query}</h1>")
"""
}
],
response_format=SecurityReport
)
report = response.choices[0].message.parsed
print(f"전체 위험 수준: {report.overall_risk.value}")
for finding in report.findings:
print(f" [{finding.severity.value}] {finding.vulnerability_type}: {finding.description}")
```### Anthropic: Structured output based on tool use
Anthropic utilizes the Tool Use (Function Calling) mechanism instead of a separate Structured Outputs API. Define just one tool`tool_choice`By forcing the tool, you can get structured output.```python
import anthropic
import json
client = anthropic.Anthropic()
# Tool Use를 활용한 구조화된 출력
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
tools=[
{
"name": "analyze_code_security",
"description": "코드의 보안 취약점을 구조화된 형식으로 분석합니다",
"input_schema": {
"type": "object",
"properties": {
"findings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"vulnerability_type": {"type": "string"},
"severity": {
"type": "string",
"enum": ["low", "medium", "high", "critical"]
},
"affected_component": {"type": "string"},
"description": {"type": "string"},
"remediation": {"type": "string"}
},
"required": ["vulnerability_type", "severity", "description", "remediation"]
}
},
"overall_risk": {
"type": "string",
"enum": ["low", "medium", "high", "critical"]
},
"summary": {"type": "string"}
},
"required": ["findings", "overall_risk", "summary"]
}
}
],
tool_choice={"type": "tool", "name": "analyze_code_security"},
messages=[
{
"role": "user",
"content": "다음 코드의 보안 취약점을 분석해주세요:\n\n@app.route('/login', methods=['POST'])\ndef login():\n username = request.form['username']\n password = request.form['password']\n query = f\"SELECT * FROM users WHERE username='{username}' AND password='{password}'\"\n user = db.execute(query).fetchone()\n if user:\n session['user'] = username\n return redirect('/dashboard')"
}
]
)
# Tool Use 결과에서 구조화된 데이터 추출
tool_use_block = next(block for block in response.content if block.type == "tool_use")
report = tool_use_block.input
print(f"전체 위험: {report['overall_risk']}")
print(f"발견 항목: {len(report['findings'])}건")
```## Application of open source model: vLLM and TGI
### Guided Decoding of vLLM
vLLM supports guided decoding starting from v0.5.0, and Outlines and XGrammar can be selected as the backend. After v0.7.0, XGrammar is set as the default backend.```python
from vllm import LLM, SamplingParams
from pydantic import BaseModel, Field
from typing import List, Optional
import json
# Pydantic 모델로 출력 스키마 정의
class ExtractedEntity(BaseModel):
name: str = Field(description="엔티티 이름")
entity_type: str = Field(description="엔티티 유형: PERSON, ORG, LOCATION, DATE, PRODUCT")
confidence: float = Field(ge=0.0, le=1.0, description="신뢰도")
context: Optional[str] = Field(default=None, description="원문에서의 관련 문맥")
class EntityExtractionResult(BaseModel):
entities: List[ExtractedEntity] = Field(description="추출된 엔티티 목록")
source_language: str = Field(description="원문 언어")
# vLLM 엔진 초기화
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
max_model_len=4096,
gpu_memory_utilization=0.85,
guided_decoding_backend="xgrammar", # 또는 "outlines"
)
# JSON Schema를 가이드로 사용하는 SamplingParams
sampling_params = SamplingParams(
temperature=0.1,
top_p=0.95,
max_tokens=1024,
guided_decoding={
"json_object": EntityExtractionResult.model_json_schema()
}
)
# 배치 추론
texts = [
"삼성전자 이재용 회장이 2026년 3월 서울에서 AI 반도체 전략을 발표했다.",
"Apple CEO Tim Cook announced the new M5 chip at WWDC 2026 in Cupertino.",
"소프트뱅크 손정의 회장이 도쿄에서 ARM 기반 AI 인프라 투자를 발표했다.",
]
prompts = [
f"다음 텍스트에서 엔티티(사람, 조직, 장소, 날짜, 제품)를 추출하세요:\n\n{text}"
for text in texts
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
result = json.loads(output.outputs[0].text)
parsed = EntityExtractionResult(**result)
print(f"언어: {parsed.source_language}")
for entity in parsed.entities:
print(f" [{entity.entity_type}] {entity.name} (신뢰도: {entity.confidence})")
print("---")
```### For use on vLLM OpenAI compatible servers
vLLM's OpenAI-compatible server allows existing OpenAI SDK code to be applied to open source models with virtually no modifications.```bash
# vLLM 서버 실행 (guided decoding 활성화)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--guided-decoding-backend xgrammar \
--max-model-len 4096 \
--port 8000
from openai import OpenAI
# vLLM 서버에 연결
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
# extra_body로 guided decoding 전달
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "텍스트에서 엔티티를 추출하세요."},
{"role": "user", "content": "Google의 Sundar Pichai CEO가 Mountain View에서 Gemini 3.0을 공개했다."}
],
extra_body={
"guided_json": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string", "enum": ["PERSON", "ORG", "LOCATION", "PRODUCT"]},
"confidence": {"type": "number"}
},
"required": ["name", "type", "confidence"]
}
}
},
"required": ["entities"]
}
}
)
import json
result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2, ensure_ascii=False))
```### Application in TGI (Text Generation Inference)
Hugging Face TGI also supports grammar-based guided generation.`grammar`Pass JSON Schema as a parameter.```bash
# TGI 서버 실행
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--max-input-tokens 2048 \
--max-total-tokens 4096
import requests
import json
# TGI의 grammar 파라미터로 JSON Schema 전달
response = requests.post(
"http://localhost:8080/generate",
json={
"inputs": "서울의 유명 관광지 3곳을 추천해주세요.",
"parameters": {
"max_new_tokens": 512,
"temperature": 0.7,
"grammar": {
"type": "json",
"value": {
"type": "object",
"properties": {
"places": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"category": {"type": "string"},
"description": {"type": "string"}
},
"required": ["name", "category", "description"]
},
"minItems": 3,
"maxItems": 3
}
},
"required": ["places"]
}
}
}
}
)
result = json.loads(response.json()["generated_text"])
for place in result["places"]:
print(f"{place['name']} ({place['category']}): {place['description']}")
```## Function Calling Integration
### Relationship between structured output and function calling
Function calling is essentially a special form of structured output. This is because LLM outputs “which function to call with which arguments” as structured JSON. Combining the two functions allows you to build a powerful agent system.
### LangChain with_structured_output
LangChain`with_structured_output`Through the method, you can obtain structured output with a unified interface regardless of the provider.```python
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field
from typing import List, Literal
# 출력 스키마 정의
class DatabaseQuery(BaseModel):
"""자연어를 SQL 쿼리로 변환한 결과"""
sql: str = Field(description="생성된 SQL 쿼리")
tables_used: List[str] = Field(description="사용된 테이블 목록")
query_type: Literal["SELECT", "INSERT", "UPDATE", "DELETE"] = Field(description="쿼리 타입")
explanation: str = Field(description="쿼리 설명")
estimated_complexity: Literal["low", "medium", "high"] = Field(description="쿼리 복잡도")
# OpenAI 모델에 구조화된 출력 적용
openai_llm = ChatOpenAI(model="gpt-4o", temperature=0)
openai_structured = openai_llm.with_structured_output(DatabaseQuery)
# Anthropic 모델에도 동일하게 적용
anthropic_llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
anthropic_structured = anthropic_llm.with_structured_output(DatabaseQuery)
# 동일한 인터페이스로 호출
question = "지난 30일 동안 가장 매출이 높은 상위 10개 제품과 해당 카테고리를 보여줘"
result_openai = openai_structured.invoke(question)
result_anthropic = anthropic_structured.invoke(question)
print(f"[OpenAI] SQL: {result_openai.sql}")
print(f"[OpenAI] 테이블: {result_openai.tables_used}")
print(f"[OpenAI] 복잡도: {result_openai.estimated_complexity}")
print()
print(f"[Anthropic] SQL: {result_anthropic.sql}")
print(f"[Anthropic] 테이블: {result_anthropic.tables_used}")
print(f"[Anthropic] 복잡도: {result_anthropic.estimated_complexity}")
```### Function Calling + Structured Output Combination Pattern
This is a pattern in which an agent calls a tool, but returns the final response in a structured format.```python
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional
import json
client = OpenAI()
# 1단계: Function Calling으로 정보 수집
tools = [
{
"type": "function",
"function": {
"name": "search_database",
"description": "데이터베이스에서 제품 정보를 검색합니다",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "검색 쿼리"},
"category": {"type": "string", "enum": ["electronics", "clothing", "food"]},
"limit": {"type": "integer", "default": 10}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "get_price_history",
"description": "제품의 가격 이력을 조회합니다",
"parameters": {
"type": "object",
"properties": {
"product_id": {"type": "string"},
"days": {"type": "integer", "default": 30}
},
"required": ["product_id"]
}
}
}
]
# 2단계: 최종 응답은 Structured Output으로 반환
class ProductAnalysis(BaseModel):
product_name: str
current_price: float
price_trend: str # "rising", "falling", "stable"
recommendation: str
confidence: float = Field(ge=0.0, le=1.0)
class AnalysisReport(BaseModel):
analyses: List[ProductAnalysis]
market_summary: str
generated_at: str
# 멀티턴 대화로 Function Calling 후 Structured Output
messages = [
{"role": "system", "content": "제품 시장 분석 전문가입니다. 도구를 사용하여 정보를 수집하고 분석 보고서를 작성하세요."},
{"role": "user", "content": "최신 노트북 시장 동향을 분석해주세요."}
]
# Function Calling 단계
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
# 도구 호출 결과를 처리한 후 최종 Structured Output 요청
# (도구 실행 및 결과 피드백 로직 생략)
# 최종 분석 보고서를 Structured Output으로 생성
final_response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=messages, # 전체 대화 이력 포함
response_format=AnalysisReport
)
report = final_response.choices[0].message.parsed
print(f"시장 요약: {report.market_summary}")
for analysis in report.analyses:
print(f" {analysis.product_name}: {analysis.price_trend} (신뢰도: {analysis.confidence})")
```## Production application strategy
### Hierarchical fallback strategy
In a production environment, you should configure hierarchical fallbacks rather than relying on a single method.```
1차: Constrained Decoding (100% 스키마 보장)
↓ 실패 시
2차: JSON Mode + Pydantic 검증
↓ 실패 시
3차: 자유 텍스트 + LLM 기반 재파싱
↓ 실패 시
4차: 기본값 반환 + 알림
```### Schema design principles
To maximize the effect of Constrained Decoding, care is needed in schema design.
1. **Minimize the number of fields**: The more fields there are, the larger the FSM/CFG state space and the slower the decoding speed. Include only key fields.
2. **Limit nesting depth**: Nesting more than 3 levels may affect performance. Flatten it if possible.
3. **Use enum actively**: Enums are much more efficient than free text fields. Predefining acceptable values narrows the token mask and improves quality.
4. **Caution with Optional fields**: If there are many optional fields, the model`null`You can select excessively. Give priority to required fields.
5. **Description Utilization**: JSON Schema`description`Fields provide hints to the model. Write a clear description.
### Caching Strategy
Constrained Decoding engine requires preprocessing time to compile schema into FSM/CFG. When using the same schema repeatedly, overhead can be reduced by caching the compilation results.
- **Outlines**:`RegexGuide`Pre-build and reuse indexes
- **XGrammar**:`GrammarCompiler`Caching compiled grammar with
- **vLLM**: Manages guided decoding cache at server level.
## Performance implications and trade-offs
### Decoding speed impact
Constrained Decoding requires additional calculations when generating each token. The actual benchmark is as follows.
| Scenario | No restrictions (tok/s) | Outlines (tok/s) | XGrammar (tok/s) | overhead |
| ------------------------ | ----------------- | ---------------- | ---------------- | -------- |
| Simple JSON (5 fields) | 520 | 505 | 515 | 1~3% |
| Complex JSON (20 fields, nested) | 520 | 470 | 500 | 4~10% |
| Regular expressions (email) | 520 | 510 | 518 | 0.5~2% |
| large enum (100 values) | 520 | 490 | 510 | 2~6% |
In most cases, the overhead is negligible. However, in very complex schemas (deep nesting, large enums, long regular expressions), performance degradation of more than 10% may occur.
### Quality Impact
Constrained Decoding limits the degree of freedom of the model, so it may have a minor effect on semantic quality. Particular caution is required in the following situations:
- **Very limited enum**: If there is no enum value that is closest to the meaning the model wants to express, an incorrect value may be selected.
- **Excessive number of fields**: If there are more than 30 fields, the quality of the later fields tends to deteriorate.
- **short max_tokens**: structured output uses more tokens than free text, so you should leave plenty of room for it.
## Troubleshooting
### Frequently encountered problems and solutions
**1. When the first request is slow during guided decoding in vLLM**
Building FSM/CFG indexes takes time. Sending a warmup request when starting the server or switching to the XGrammar backend reduces preprocessing time.
**2. Infinite loop occurs in nested array**`maxItems`If you do not set , the model can continue to create array elements. In JSON Schema`maxItems`Be sure to specify .
**3. When Unicode characters are broken**
Some Constrained Decoding engines are vulnerable to handling multibyte Unicode tokens. llguidance circumvents this problem with byte-level operation. When using outlines, explicitly include the Unicode range in the regular expression.
**4. When the model fills a schema field with an empty string**`minLength: 1`, or provide specific instructions for each field in the prompt. Constrained Decoding only guarantees format, not semantic quality.
**5. When refusal is returned in OpenAI Structured Outputs**
If the model refuses to respond due to safety policy,`response.choices[0].message.refusal`This is set.`parsed`go`None`Be sure to check if it is.```python
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=messages,
response_format=MySchema
)
if response.choices[0].message.refusal:
print(f"응답 거부: {response.choices[0].message.refusal}")
elif response.choices[0].message.parsed:
result = response.choices[0].message.parsed
# 정상 처리
```## Precautions during operation
### JSON Schema restrictions
OpenAI Structured Outputs does not support all features of JSON Schema. The main constraints are:
-`additionalProperties`cast`false`Must be set to
- All properties`required`(Optional is included in the type`null`added as union)
-`$ref`is supported, but there is a limit to the depth of recursion
-`patternProperties`,`if/then/else`is not supported
- Up to 5 levels of nesting depth
### Schema versioning
When changing the schema in production, backwards compatibility must be maintained. When adding a new field, you must start with Optional, and when removing an existing field, you must leave a deprecated period.
### Cost Considerations
Structured output has an increased number of tokens than free text. The structural overhead of JSON (`{`,`}`,`"`,`:`,`,`) Because. On average, a token overhead of 15-30% occurs. In cost-sensitive environments, it is helpful to minimize the number of fields and use short key names.
### Monitoring indicators
Key metrics to track in production include:
- **Parse success rate**: JSON parsing success rate of structured output (target: 99.9% or higher)
- **Schema verification pass rate**: Rate of passing Pydantic verification
- **Fallback trigger rate**: How often the fallback strategy works (if it is high, the schema or prompt needs to be improved)
- **Token Efficiency**: Effective Information Tokens / Total Output Tokens Ratio
- **Latency distribution**: P50, P95, P99 latency tracking
## Failure cases and recovery
### Case 1: Timeout due to schema overcomplexity
When using a schema with more than 30 fields, 4 levels of nesting, and complex regular expression patterns, building the FSM index in the Outlines backend took more than 60 seconds and resulted in a timeout.
**Repair**: Split the schema into two simple schemas and changed to a two-step call. This method extracts key fields in the first call and extracts detailed information in the second call.
### Case 2: Conflict between enum values and model knowledge
I restricted the country code to ISO 3166-1 alpha-2, but the model returns "South Korea".`KR`instead`KO`Attempting to map to (a non-existent code), the closest valid value`KP`The problem of choosing (North Korea) arose.
**Repair**: Included descriptions of enum values in the prompt, and explicitly provided examples of frequently confusing mappings.
### Case 3: Memory explosion due to unrestricted array size`maxItems`When we used the prompt "Please list all possible items" without setting , the model generated hundreds of array elements and continued until max_tokens was reached.
**Restore**: to all array fields`maxItems`was set, and “maximum N” was specified in the prompt.
## Checklist
Before applying structured output to production, check the following:
- JSON Schema`maxItems`,`maxLength`Are size limits set?
- All required fields are`required`Is it included in
- Is the enum value a semantically clear value that the model can understand?
- Is the nesting depth within 3 levels (2 levels if possible)?
- Is a fallback strategy implemented?
- Is there a retry logic in case of parsing failure (up to 3 times)?
-`max_tokens`Is it set to 1.5 times or more than the expected output size?
- Is response rejection processing implemented?
- Is there a backward compatibility policy for schema changes?
- Does the monitoring dashboard include parsing success rate, fallback rate, and latency?
- In the load test, is the overhead of guided decoding within the allowable range?
- Has it been verified that it operates normally in Unicode/multilingual text?
## References
1. [OpenAI Structured Outputs Guide](https://platform.openai.com/docs/guides/structured-outputs)
2. [Anthropic Tool Use Documentation](https://docs.anthropic.com/en/docs/build-with-claude/tool-use)
3. [Outlines - Structured Text Generation](https://github.com/dottxt-ai/outlines)
4. [XGrammar - Flexible and Efficient Grammar-Guided Generation](https://github.com/mlc-ai/xgrammar)
5. [XGrammar: Flexible and Efficient Structured Generation Engine (arXiv 2501.10868)](https://arxiv.org/abs/2501.10868)
6. [vLLM Guided Decoding Documentation](https://docs.vllm.ai/en/latest/features/structured_outputs.html)
7. [llguidance - Rust-based Parser Engine](https://github.com/microsoft/llguidance)
8. [LangChain Structured Output Guide](https://python.langchain.com/docs/concepts/structured_outputs/)