June 17, 2026 · 9 min read

'Respond Only With Valid JSON' Fails 1 in 10 Times

You have 1.3 million users. Your LLM call returns JSON — usually. But 1% of the time it returns Here's the JSON you asked for: {"result": ...}, or a markdown code fence, or an object with a trailing comma, or a truncated payload that cuts off mid-string. Your parser throws. At 1.3M users, that's 13,000 failures a day. Some silently corrupt downstream state. Some surface as 500s. All of them are avoidable.

Wrong conclusion #1: add the magic sentence

The first thing teams try is "respond only with valid JSON. Do not include any other text." This helps, a little. The problem is it's a hint, not a constraint. You're negotiating with a stochastic process.

Here are the actual failure modes that survive the magic sentence:

Markdown fences. The model has been trained on billions of tokens of GitHub content where JSON is always wrapped in ` `json `. That training signal doesn't disappear because you asked nicely.

Preamble text. Models trained with RLHF tend toward helpfulness. "Sure! Here's the result:" is helpful in conversational context. Your system prompt fights that tendency; sometimes it loses.

Trailing commas. The model generates a comma after the last key-value pair because most JSON it was trained on is formatted that way in larger objects. JSON.parse rejects it.

Truncated objects. Token limits hit mid-output. You get {"user": "alice", "score: and the parser crashes. No schema awareness, so the model doesn't know to stop early.

Wrong field names. You asked for "status" but got "state". Both valid JSON. Your .get("status") returns None.

The failure rate for naive JSON prompting is 5–15% in production, depending on schema complexity. On complex nested schemas, it climbs higher.

Bar chart comparing schema violation rates across prompting approaches

A bit better: JSON mode

JSON mode (available on OpenAI since 2023) tells the model to produce valid JSON syntax — it won't give you trailing commas or markdown fences. This cuts failure rate to roughly 3–5%.

What it doesn't do: enforce your schema. JSON mode has no concept of field names, types, or required properties. You can get valid JSON that's {} — perfectly parseable, completely useless. You can get {"result": null} when you needed {"result": "active"}. JSON mode is a syntax guarantee. It's not a contract.

The real fix: constrained decoding

The fundamental problem with prompting-based approaches is that the constraint is applied as text, not as mechanics. The model reads your instruction, weights it against its training distribution, and usually complies — but the probability of non-compliance is never zero.

Constrained decoding moves the enforcement to inference time. Here's exactly what happens:

  1. Your JSON Schema is compiled into a state machine — a pushdown automaton or an Earley parser table.
  2. At each decode step, the model produces its usual logit distribution over the full vocabulary (50K–100K tokens).
  3. Before sampling, the constraint engine computes which tokens are valid given the current parser state. Everything else gets its logit set to −∞.
  4. The model samples from the masked distribution. Invalid tokens have probability zero — not low probability, zero.
Diagram showing logit masking: valid tokens survive, invalid tokens set to minus infinity

The model cannot produce a markdown fence because the token ` ` is not valid at the start of a JSON value. It cannot produce preamble text because H` is not valid at position 0 of a JSON object. It cannot truncate early because the JSON isn't closed.

Syntactic and schema violation rates with constrained decoding: under 0.1%. The 0.1% is noise — refusals, context length limits, model API errors.

The underlying engines powering this across the industry: XGrammar (default backend for vLLM, SGLang, TensorRT-LLM as of 2026 — under 40 microseconds overhead per token) and llguidance from Microsoft (Rust-based Earley parser, ~50µs overhead).

Implementations across providers

Every major provider ships constrained decoding now. Here's how to actually use it.

Provider support matrix showing OpenAI, Anthropic, Google Gemini, and vLLM structured output capabilities

OpenAI — strict mode

from openai import OpenAI
import json

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract the order details from: Order #4821, status active, total $142.50"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "order",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string"},
                    "status": {"type": "string", "enum": ["active", "pending", "done", "cancelled"]},
                    "total_usd": {"type": "number"}
                },
                "required": ["order_id", "status", "total_usd"],
                "additionalProperties": False
            }
        }
    }
)

order = json.loads(response.choices[0].message.content)
# order["status"] is guaranteed to be one of: active, pending, done, cancelled
# order["total_usd"] is guaranteed to be a number
# This cannot fail to parse

"strict": True is the key. Without it, you're back to JSON mode. With it, OpenAI runs constrained decoding and your schema is enforced at the token level.

Anthropic — tool-use pattern for structured output

Anthropic's structured output approach uses tool use with tool_choice set to force the model to call a specific tool. Define your schema as the tool's input_schema, set tool_choice={"type": "tool", "name": ""}, and the response is always a validated tool call — the model cannot return freeform text:

import anthropic
import json

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=[{
        "name": "extract_order",
        "description": "Extract structured order information",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "status": {"type": "string", "enum": ["pending", "shipped", "delivered", "cancelled"]},
                "amount": {"type": "number"},
                "refund_eligible": {"type": "boolean"}
            },
            "required": ["order_id", "status", "amount", "refund_eligible"]
        }
    }],
    tool_choice={"type": "tool", "name": "extract_order"},
    messages=[{"role": "user", "content": user_query}]
)

result = json.loads(response.content[0].input)

This pattern predates Anthropic's native structured output beta and achieves near-identical reliability — the schema is enforced through the tool input_schema, and Claude has been trained to fill tool arguments precisely. tool_choice with "type": "tool" forces the model to always call the named tool, so you always get a structured result, never freeform text.

Google Gemini — response_schema

import os
import google.generativeai as genai
import json

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")

response = model.generate_content(
    "Extract order details: Order #4821, status active, total $142.50",
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema={
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "status": {"type": "string", "enum": ["active", "pending", "done", "cancelled"]},
                "total_usd": {"type": "number"}
            },
            "required": ["order_id", "status", "total_usd"]
        }
    )
)

order = json.loads(response.text)

Gemini 2.5 added anyOf, $ref for recursive schemas, minimum/maximum for numeric constraints, and prefixItems for tuple validation. The API will reject very large or deeply nested schemas — keep nesting under 4 levels as a rule.

Instructor — one abstraction, every provider

If you're working across providers or want Pydantic models instead of raw JSON Schema dicts, Instructor is the right abstraction. Over 3 million monthly downloads. It wraps OpenAI, Anthropic, Gemini, Ollama, and 15+ other providers.

import instructor
from anthropic import Anthropic
from pydantic import BaseModel
from typing import Literal

class Order(BaseModel):
    order_id: str
    status: Literal["active", "pending", "done", "cancelled"]
    total_usd: float

client = instructor.from_anthropic(Anthropic())

order = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Extract order details: Order #4821, status active, total $142.50"}],
    response_model=Order,
)

# order is a validated Pydantic instance — not a dict, a typed object
assert order.status in ["active", "pending", "done", "cancelled"]

Instructor generates the JSON Schema from your Pydantic model, calls the appropriate provider API with constrained decoding, and deserialises the response back into your model. If validation fails (which is rare with constrained decoding but can happen on semantic grounds), Instructor retries with the validation error fed back to the model.

Self-hosted models — Outlines or vLLM

For open-source models (Llama, Mistral, Qwen), Outlines provides grammar-constrained generation using an FSM approach. It compiles your JSON Schema or regex into a finite state machine that guides token sampling:

from outlines import models, generate
from pydantic import BaseModel
from typing import Literal

class Order(BaseModel):
    order_id: str
    status: Literal["active", "pending", "done", "cancelled"]
    total_usd: float

model = models.transformers("meta-llama/Llama-3-8B-Instruct")
generator = generate.json(model, Order)

order = generator("Extract order details: Order #4821, status active, total $142.50")
# Returns a validated Order instance

If you're running vLLM in production, XGrammar is already the default backend — pass guided_json in your completion request and it handles the constraint engine.

Wrong conclusion #2: constrain everything

Once you've seen constrained decoding work, the temptation is to wrap your entire API response in a single structured output call. This is where it starts fighting you.

Constrained decoding adds overhead at each token step: the constraint engine must compute the valid next-token set against the parser state. For small, typed schemas this is negligible — XGrammar achieves under 40 microseconds per token. For very large schemas with deep nesting, or for long prose fields inside a structured response, the overhead compounds.

More importantly: constrained decoding actively degrades quality on open-ended fields. In practice, forcing a structured format on fields meant for prose — summaries, explanations, recommendations — tends to reduce response quality and make the model fight the schema. The model is spending capacity navigating the schema constraint rather than generating the best content.

The pattern that breaks is this: you have a schema with {"summary": {"type": "string"}, "status": {"type": "string", "enum": [...]}}. You constrain the whole thing. The status field is perfect. The summary field is worse than if you'd asked for it without constraints, because the model is partially distracted by maintaining JSON structure through a potentially long string value.

The right split

The correct approach is to constrain what benefits from constraints, and leave prose free:

Field typeApproachReason
IDs, booleans, numbersconstrainedType correctness is load-bearing; any drift breaks downstream code
Enum values (status, intent, action)constrainedFinite valid set; model can still pick the right one under constraint
Timestamps, ISO datesconstrainedFormat correctness matters; regex or format constraint handles it
Short labels (< 20 chars)constrainedLow generation length; overhead is minimal
Summaries, explanations, reasoningunconstrainedConstraining prose degrades output quality; generate separately
Creative text, step-by-step plansunconstrainedFormat constraints actively fight the model's reasoning pathway

In practice, this means two calls for responses that mix typed metadata with prose: one constrained call to extract the structured fields, one free call for the prose fields. Or: constrain the schema but keep prose fields as {"type": "string"} without length or pattern constraints — you get the schema wrapper but don't fight the model on content.

Edge cases that still bite you

Constrained decoding guarantees structural validity. It doesn't guarantee semantic correctness. Things that still go wrong:

Optional fields vs. required + null. Optional fields are a footgun. The model might not include them at all, and your downstream code silently gets None where it expected a value. Prefer "required": ["field_name"] with "type": ["string", "null"] — the field is always present, and you know explicitly when the model couldn't determine a value.

Deeply nested schemas (> 3–4 levels). Complex nesting increases schema violation rates to 1–2% even with constrained decoding. The constraint engine overhead also spikes. Flatten where you can.

Recursive schemas with $ref. Supported by Gemini and llguidance, but can cause compilation times that blow up with Outlines/FSM-based engines. Use $ref carefully with self-hosted setups.

Enums on ambiguous fields. Constrained decoding forces the model to pick one enum value. If your enum is ["confirmed", "tentative", "cancelled"] and the source text is ambiguous, the model will pick one — it has to. It cannot express uncertainty. Add a "confidence" float field if ambiguity matters.

Streaming. Every provider's streaming with structured output is partial. Chunks arrive, but schema validation only applies to the complete response. You can parse partial JSON for progressive UI updates, but don't trust intermediate chunks as valid schema instances.

Constrained decoding fixes the syntax layer. You still need evaluation for the semantic layer — the model can fill every required field with a syntactically valid but semantically wrong value. Run output validation in your pipeline, not just at the schema level.

Production checklist

What to actually ship:

  • Use "strict": True (OpenAI) or the beta structured-outputs header (Anthropic) — not JSON mode, not a prompt instruction.
  • Define your schema with "additionalProperties": false to catch field hallucinations.
  • Prefer required fields + nullable types over optional fields. "type": ["string", "null"] is safer than "required": false.
  • Keep nesting under 4 levels. Flatter schemas compile faster, fail less, and are easier to debug.
  • Use enums for finite-set fields. status, intent, action, source, category — all enums.
  • Don't constrain prose fields with pattern or length. Let string fields be strings.
  • Separate prose from metadata. If you need a 200-word summary plus a status enum, consider two calls rather than one giant constrained schema.
  • Log raw model output before parsing. When the rare failure happens, you need to see what the model actually returned.
  • Test your schema on edge inputs. Long source texts that hit token limits, ambiguous inputs that force enum choices, multilingual text. Constrained decoding doesn't make the model smarter — it makes its output structurally valid.

The ecosystem has converged. OpenAI, Anthropic, Google, vLLM, Ollama — all ship constrained decoding. The "respond only with valid JSON" instruction had its moment. That moment is over.

Related topics
constrained decodingstructured outputLLM infrastructure +7

T
Tanmay Bohra
Full Stack Engineer at Grant Thornton Bharat. Building high-concurrency systems in Go and TypeScript.
← portfolio chat with tanmay ↗