Getting Reliable Structured Output from LLMs

The moment you move an AI feature from "chat" to "part of your app," you need the model to return structured data — usually JSON your code can parse — not a friendly paragraph. This is where a lot of AI features quietly break, because language models are built to produce text, and text is messy. The difference between a demo and a production feature is almost entirely in how you handle the 2% of responses that don't come back the way you asked.

Here's the full defensive stack, in the order I'd reach for each layer.

Why this is hard

An LLM predicts likely text. Ask for JSON and it will usually produce valid JSON — but "usually" is a nightmare in production. The classic failure modes, all of which I've seen on real traffic:

The JSON is wrapped in a markdown code fence, because the model has seen a million tutorials formatted that way.
A chatty preamble sneaks in: "Sure! Here's the JSON you asked for:".
A trailing comma, single quotes, or an unescaped newline inside a string.
The response is cut off mid-object because it hit the output-length limit.
The JSON is valid but wrong — a required field missing, a string where you wanted a number, a value outside your enum.

Any of the first four breaks a naive JSON.parse(); the fifth sails through parsing and corrupts state three screens later. You need defences for both.

Technique 1: Use native structured-output features

The best fix, when available, is to not fight the problem at all. The major model APIs now support structured outputs — you provide a JSON Schema and the API constrains generation so the response conforms to it. Both Anthropic and OpenAI offer this, and self-hosted stacks can get the same guarantee with grammar-constrained decoding (e.g. via llama.cpp grammars or libraries like Outlines).

If your provider offers this, use it — it eliminates malformed JSON at the source, which is most of the problem. Two caveats keep it from being the whole answer: schema conformance doesn't guarantee the content is right (a syntactically perfect answer can still be nonsense), and a response truncated by output limits can still be incomplete. So keep the validation layer below even when the API guarantees the syntax.

A related option is function/tool calling, where you describe a function signature and the model returns arguments for it. Under the hood it's the same idea — schema-constrained output — and if your use case looks like "the model decides what to call with which parameters," designing good tool definitions is the cleaner framing.

Technique 2: Ask precisely, with an example

When you're prompting directly — older APIs, local models, or providers without schema support — be explicit and show the shape you want:

Return ONLY valid JSON matching this structure, with no explanation
and no markdown fences:

{"category": "string", "confidence": 0.0, "tags": ["string"]}

Two things matter here: telling it to return only JSON with no extra prose, and giving a concrete example of the structure. Models follow examples far better than they follow descriptions. If a field is an enum, list the allowed values; if a field can be absent, show an example where it's absent. Every ambiguity you leave in the spec is a place the model will eventually improvise.

Keep the temperature low for extraction tasks, too. You want the most probable reading of the input, not a creative one.

Technique 3: Always parse defensively

Never trust the raw response. Extract and validate before you use it:

import json, re

def parse_json(text: str):
    # strip markdown fences and surrounding prose
    match = re.search(r"\{.*\}", text, re.DOTALL)
    if not match:
        raise ValueError("no JSON object found")
    return json.loads(match.group(0))

Then validate the shape against a schema — Pydantic in Python, Zod in TypeScript, or whatever your stack's equivalent is. This is where you enforce the things JSON syntax can't: required fields present, numbers in range, enums actually from the enum, arrays not absurdly long. A response that parses but is missing a required field is still a bug — catch it here, not three screens later.

The schema-validation layer earns its keep even with native structured outputs, because it's also your defence against semantic failures: it's the natural place to check that confidence is between 0 and 1 and that the category the model picked actually exists in your database.

Technique 4: Retry with the error

When parsing or validation fails, you often don't need to fail the whole request. Send the model its own broken output plus the error and ask it to fix it:

Your previous response was not valid JSON. Error: <message>.
Return corrected JSON only.

A single retry resolves the large majority of malformed responses — the model is genuinely good at repairing its own output when shown the specific error. Cap it at one or two attempts so a stubborn failure can't loop forever, and log every retry: a rising retry rate is an early-warning signal that a prompt change or model update has shifted behaviour, and it's much nicer to see that in a dashboard than in user reports.

Technique 5: Have a fallback

Sometimes the model just won't cooperate, or the provider is down. Decide in advance what happens then: a safe default value, a visible "couldn't process this" state, or a queue that retries later. Which one is right depends on the feature — a failed auto-tag can silently default to "untagged," while a failed data extraction the user is waiting on should say so honestly.

What you must not do is let an unparsed response crash the feature or, worse, write half-parsed garbage into user data. The blast radius of a bad parse should be one request, never one record.

Keep the schema small

The more fields and nesting you demand, the more ways the model can drift — and the worse it gets at each individual field, because its attention is spread across all of them. Ask for the minimum structure you need. If you genuinely need something complex, consider breaking it into smaller, separate calls — each with a simple schema — rather than one giant request. Two reliable calls beat one flaky one, and they're easier to test and retry independently.

Field names matter more than you'd think, too: delivery_date gets better results than dd, because the name itself is instruction. Self-describing schemas are both documentation for you and guidance for the model.

Summary

Getting structured output from an LLM is a solved problem if you layer your defences: prefer native structured-output/schema modes, prompt with an explicit example when you can't, parse and schema-validate defensively regardless, retry once with the error message on failure, and always have a fallback. Treat the model as a helpful but unreliable narrator, and build the guardrails that turn "usually valid" into "always safe."