Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.
The demo works: you ask the model for JSON, it returns JSON, you JSON.parse it. Then real traffic arrives and 2% of responses come back with a trailing comment, a markdown code fence, a hallucinated extra field, or a number where you expected a string. At a million requests a day, 2% is twenty thousand broken responses. Here's the layered approach that got our structured-output reliability from "mostly" to "we stopped getting paged."
The single biggest win was switching from "please return JSON" in the prompt to schema-constrained generation at the API layer. Most providers now support forcing the model to emit output matching a JSON schema — the decoder is masked so only tokens that keep the output valid against the schema are sampleable.
schema = {
"type": "object",
"properties": {
"intent": {"type": "string", "enum": ["refund", "question", "complaint"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"entities": {"type": "array", "items": {"type": "string"}},
},
"required": ["intent", "confidence"],
"additionalProperties": False,
}
response = client.messages.create(
model="claude-...",
tools=[{"name": "classify", "input_schema": schema}],
tool_choice={"type": "tool", "name": "classify"},
messages=[{"role": "user", "content": text}],
)
result = response.content[0].input # already schema-valid
Forcing a tool call with a strict input_schema means the structural classes of failure — fences, prose preamble, wrong types, missing required fields — largely disappear because the model literally cannot emit them. This is the difference between validating output and making invalid output unrepresentable.
Constrained decoding guarantees the output is shaped right. It does not guarantee it's correct. confidence: 0.99 on a wrong classification is still schema-valid. So the second layer is semantic validation in your own code:
from pydantic import BaseModel, field_validator
class Classification(BaseModel):
intent: str
confidence: float
entities: list[str] = []
@field_validator("entities")
@classmethod
def entities_must_appear_in_source(cls, v, info):
# reject hallucinated entities not present in the input
source = info.context["source_text"].lower()
return [e for e in v if e.lower() in source]
We drop entities the model invented that don't appear in the source text. Schema validation can't catch that — it's a business rule, and business rules live in your code, not the model's prompt.
When validation still fails — usually a semantic constraint, occasionally a provider that doesn't support strict schemas — feed the error back and retry once or twice, not forever:
def generate_validated(text, max_attempts=2):
messages = [{"role": "user", "content": text}]
for attempt in range(max_attempts):
raw = call_model(messages)
try:
return Classification.model_validate(raw, context={"source_text": text})
except ValidationError as e:
messages.append({"role": "assistant", "content": str(raw)})
messages.append({"role": "user",
"content": f"That failed validation: {e}. Fix and resend."})
raise OutputValidationError("exhausted repair attempts")
Two rules learned the hard way:
confidence was 1.5, must be ≤ 1" gets you a fix.When repair is exhausted, you need a defined behavior that isn't a 500. Depending on the call site:
degraded: true flagThe worst outcome is an exception bubbling to the user because nobody decided what "the model couldn't comply" should do.
Treat model output exactly like user input: untrusted until validated. Never let it flow directly into a SQL query, a shell command, a file path, or an API call without passing your schema and semantic checks first. Constrained decoding makes the happy path clean; the validation layer is what keeps a bad day from becoming an incident.
Reliable structured output isn't one trick. It's constraining what the model can emit, validating what it means, repairing what's fixable, and having a plan for what isn't.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Most CI caches either miss constantly or restore stale junk. The cache-key discipline, scope boundaries, and measurements that turned our pipeline cache from theatre into real minutes saved.
The cache-control header most teams under-use. How stale-while-revalidate and stale-if-error turned our CDN from a freshness liability into a latency and resilience win — with the gotchas.
Explore more articles in this category
A long, stable system prompt re-billed on every request is money on fire. How prompt caching works, where the cache boundary belongs, and the structuring discipline that got us a big cost and latency cut without changing behavior.
You can't improve retrieval you don't measure. The offline eval harness that lets us change embeddings, chunking, and rerankers with confidence instead of vibes — with the metrics that actually predict production quality.
They solve different problems. RAG injects knowledge; fine-tuning changes behavior. The decision criteria, the hybrid pattern, and what we'd do over.
Evergreen posts worth revisiting.