Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.

On this page

LLM Output Validation — Schema-Constrained Generation in Production

The demo works: you ask the model for JSON, it returns JSON, you JSON.parse it. Then real traffic arrives and 2% of responses come back with a trailing comment, a markdown code fence, a hallucinated extra field, or a number where you expected a string. At a million requests a day, 2% is twenty thousand broken responses. Here's the layered approach that got our structured-output reliability from "mostly" to "we stopped getting paged."

Stop asking nicely; constrain the output #

The single biggest win was switching from "please return JSON" in the prompt to schema-constrained generation at the API layer. Most providers now support forcing the model to emit output matching a JSON schema — the decoder is masked so only tokens that keep the output valid against the schema are sampleable.

python.python

schema = {
    "type": "object",
    "properties": {
        "intent": {"type": "string", "enum": ["refund", "question", "complaint"]},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "entities": {"type": "array", "items": {"type": "string"}},
    },
    "required": ["intent", "confidence"],
    "additionalProperties": False,
}

response = client.messages.create(
    model="claude-...",
    tools=[{"name": "classify", "input_schema": schema}],
    tool_choice={"type": "tool", "name": "classify"},
    messages=[{"role": "user", "content": text}],
)
result = response.content[0].input  # already schema-valid

Forcing a tool call with a strict input_schema means the structural classes of failure — fences, prose preamble, wrong types, missing required fields — largely disappear because the model literally cannot emit them. This is the difference between validating output and making invalid output unrepresentable.

Validate anyway — constraints aren't semantics #

Constrained decoding guarantees the output is shaped right. It does not guarantee it's correct. confidence: 0.99 on a wrong classification is still schema-valid. So the second layer is semantic validation in your own code:

python.python

from pydantic import BaseModel, field_validator

class Classification(BaseModel):
    intent: str
    confidence: float
    entities: list[str] = []

    @field_validator("entities")
    @classmethod
    def entities_must_appear_in_source(cls, v, info):
        # reject hallucinated entities not present in the input
        source = info.context["source_text"].lower()
        return [e for e in v if e.lower() in source]

We drop entities the model invented that don't appear in the source text. Schema validation can't catch that — it's a business rule, and business rules live in your code, not the model's prompt.

The repair loop, bounded #

When validation still fails — usually a semantic constraint, occasionally a provider that doesn't support strict schemas — feed the error back and retry once or twice, not forever:

python.python

def generate_validated(text, max_attempts=2):
    messages = [{"role": "user", "content": text}]
    for attempt in range(max_attempts):
        raw = call_model(messages)
        try:
            return Classification.model_validate(raw, context={"source_text": text})
        except ValidationError as e:
            messages.append({"role": "assistant", "content": str(raw)})
            messages.append({"role": "user",
                             "content": f"That failed validation: {e}. Fix and resend."})
    raise OutputValidationError("exhausted repair attempts")

Two rules learned the hard way:

Bound the retries. An unbounded repair loop on a request the model fundamentally can't satisfy is a cost and latency bomb. Cap it, then fall back.
Feed the specific error. "That was invalid" gets you another invalid response. "Field confidence was 1.5, must be ≤ 1" gets you a fix.

Define the fallback before you need it #

When repair is exhausted, you need a defined behavior that isn't a 500. Depending on the call site:

Route to a human queue (classification, moderation)
Return a safe default with a degraded: true flag
Use the last partially-valid response with missing fields nulled

The worst outcome is an exception bubbling to the user because nobody decided what "the model couldn't comply" should do.

Where to draw the trust boundary #

Treat model output exactly like user input: untrusted until validated. Never let it flow directly into a SQL query, a shell command, a file path, or an API call without passing your schema and semantic checks first. Constrained decoding makes the happy path clean; the validation layer is what keeps a bad day from becoming an incident.

What moved the needle #

Strict schema-forced tool calls: structural failures ~2% → ~0.05%
Semantic validators: caught hallucinated entities the schema couldn't
Bounded repair loop: recovered ~70% of the remaining failures within one retry
Defined fallbacks: zero unhandled exceptions reaching users

Reliable structured output isn't one trick. It's constraining what the model can emit, validating what it means, repairing what's fixable, and having a plan for what isn't.

LLM Output Validation — Schema-Constrained Generation in Production

LLM Output Validation — Schema-Constrained Generation in Production

Stop asking nicely; constrain the output #

Validate anyway — constraints aren't semantics #

The repair loop, bounded #

Define the fallback before you need it #

Where to draw the trust boundary #

What moved the needle #

Stay Updated

CI Pipeline Caching That Actually Pays Off

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

More from AI

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

LLM Cost Optimization in Production — What Actually Moves the Bill

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

LLM Evals That Actually Predict Production Quality

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

LLM Output Validation — Schema-Constrained Generation in Production

Stop asking nicely; constrain the output#

Validate anyway — constraints aren't semantics#

The repair loop, bounded#

Define the fallback before you need it#

Where to draw the trust boundary#

What moved the needle#

Stay Updated

CI Pipeline Caching That Actually Pays Off

Edge Caching with Stale-While-Revalidate — Fast and Fresh at the CDN

More from AI

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

Stop asking nicely; constrain the output #

Validate anyway — constraints aren't semantics #

The repair loop, bounded #

Define the fallback before you need it #

Where to draw the trust boundary #

What moved the needle #