On this page
- TL;DR
- Why this is still the hardest problem
- Where hallucinations come from
- Technique 1: grounding (the single biggest lever)
- RAG done well vs RAG done badly
- Technique 2: structured output
- Technique 3: citations and traceability
- Inline citations
- Spans / structured citations
- The fallback prompt
- Technique 4: invite "I don't know"
- Technique 5: temperature and decoding settings
- Technique 6: choose the right model
- Technique 7: verify with a second pass
- Technique 8: evals (the boring one that matters most)
- Technique 9: prompt patterns that help
- Technique 10: post-processing and guardrails
- What does not work (despite the marketing)
- The recommended stack (mid-2026)
- A worked example: a financial Q&A assistant
- The summary
TL;DR
- Hallucinations aren't going to zero, but in 2026 you can drive them well below the threshold where they break trust — if you do the work.
- The five techniques that actually matter: grounding (give it the answer), structured output (constrain the form), citations (force traceability), evals (measure regression), and temperature/decoding (turn down the dice).
- RAG is necessary but not sufficient. Bad retrieval makes hallucinations worse than no retrieval. Build the eval first.
- Reasoning models (o3, GPT-5 with thinking, Claude Opus 4.7 extended) hallucinate less than fast models on hard questions, but they still confidently invent things. Don't trust the marketing.
- The most underused technique: let the model say "I don't know." It rarely will unless you explicitly invite it.
Why this is still the hardest problem
Frontier models in 2026 are dramatically better than 2023. They're also dramatically more confident. The fluent, well-structured wrong answer is the worst kind: it slides past human review because it looks right.
Every reliability technique below is in service of one outcome: making the model's output either correct or visibly uncertain, never confidently wrong.
We're not going to pretend the problem is solved. We're going to give you the techniques that actually move the needle.
Where hallucinations come from
Useful taxonomy before solutions:
- Knowledge gaps. The model doesn't know the answer and makes one up rather than admit it.
- Stale knowledge. The model's training cutoff is months or years old. Recent facts are fabricated.
- Long-tail facts. Specific names, dates, identifiers, version numbers. These live at the edge of the model's distribution and tend to drift.
- Compositional errors. Each fact is right; the combination is wrong (mixing two people's bios, attributing a quote to the wrong author).
- Format-induced errors. "Give me a JSON object with these fields" → the model invents plausible-looking values for fields it doesn't have data for.
- Sycophancy. The model agrees with the user's framing even when the framing is false.
- Reasoning errors. Multi-step deductions go wrong, even when each step looks plausible.
Different sources need different fixes. Don't reach for RAG when the problem is sycophancy.
Technique 1: grounding (the single biggest lever)
If you give the model the information it needs to answer in the prompt, it doesn't have to guess. This is the single highest-leverage technique by far.
Grounding modes:
- Inline context. Paste the relevant document, code, or data directly. Works for short, focused tasks.
- RAG. Retrieve relevant chunks from a knowledge base and include them.
- Tool use. Let the model call a search tool, a database, or an API to fetch facts on demand.
- Browse / web search. Let the model query the live web for fresh information.
The pattern that matters: never ask a model a question whose answer requires information you didn't give it. When you do, you're asking it to invent.
RAG done well vs RAG done badly
RAG is the most common grounding pattern. It's also the most commonly broken.
Bad RAG looks like:
- Embed the user's raw question, retrieve top 5 chunks, stuff into prompt, ship.
- No re-ranking. No hybrid search. No chunking strategy.
- No measurement of whether retrieval is bringing back the right chunks.
When retrieval brings back the wrong chunks, the model uses them as facts. The hallucination rate goes up, not down, because the model now has confident but irrelevant context to confabulate from.
Good RAG looks like:
- Smart chunking. Semantic chunks, not arbitrary character splits. Respect document structure.
- Hybrid retrieval. BM25 + vector + sometimes a graph layer. Different queries need different strategies.
- Re-ranking. A cross-encoder reorders the top 50 candidates to find the best 5. This single step is one of the highest-ROI changes you can make.
- Query rewriting. Have a model rewrite the user's question into a better retrieval query.
- Retrieval evaluation. You measure precision and recall of retrieval as a separate metric from end-to-end answer quality.
- Fall-through behavior. When retrieval returns nothing relevant, the prompt instructs the model to say so, not improvise.
If you're doing RAG and skipping re-ranking and retrieval evals, fix those before you do anything else.
Technique 2: structured output
Free-form text gives the model maximum room to wander. Structured output constrains it.
Modern providers all support structured output at the API level: JSON schema, function calling, regex grammars. Use them.
Why structure helps:
- The model can't add a fictional field if the schema doesn't allow it.
- Required fields force the model to either produce a value or fail explicitly (which you can detect).
- Enums constrain free-form choices to known options.
- Type checks catch obvious errors at the parser, not in production.
Pattern: never parse free-form text when a structured output API is available. The amount of "model said 'maybe' when I expected 'yes'" pain you'll save is enormous.
For tasks like extraction, classification, and routing, structured output is essentially mandatory.
Technique 3: citations and traceability
Make the model show its work.
Specifically: every claim should point back to the chunk of source material it came from. Two flavors:
Inline citations
The model produces output like: "The quarterly revenue grew by 14% [source: report.pdf, page 3]." You can audit every claim.
This is most useful when the user reads the output. They develop trust in the parts that cite real sources and skepticism about parts that don't.
Spans / structured citations
The model returns the answer plus a list of source spans (document ID + character range or chunk ID). You can render the citation differently in your UI, and you can programmatically verify that the cited span actually contains the claimed information.
The verification step is the magic: a second pass model can check "does this cited chunk actually support this claim?" and downgrade unverifiable claims.
The fallback prompt
Even without RAG, you can ask a model to mark uncertainty: "For each claim, mark [confident] or [uncertain]. Mark [unverified] if you cannot point to a source." It will, mostly, comply. The output becomes self-flagging.
Technique 4: invite "I don't know"
Models hallucinate partly because they're trained to be helpful. "I don't know" feels unhelpful. They'd rather guess.
Counter this in the system prompt:
- "If the provided context does not contain the answer, respond exactly with: 'I don't have enough information to answer.'"
- "Do not speculate. If you would need to guess, say so explicitly."
- "Better to say 'I don't know' than to give a plausible but unverified answer."
This sounds obvious. Most production prompts don't include it. It is one of the cheapest, highest-impact changes you can make.
For RAG specifically, also add: "If the retrieved context does not directly support an answer, say so. Do not synthesize beyond what is in the context."
Technique 5: temperature and decoding settings
Temperature controls the randomness of token selection. Higher = more creative, more variance, more hallucinations. Lower = more deterministic, less creative.
Practical defaults:
- Factual extraction, classification, routing: temperature 0 (or as low as the API allows).
- Code generation: temperature 0.0-0.3.
- Analysis and summarization of provided context: temperature 0.0-0.5.
- Open conversation, brainstorming: temperature 0.7-1.0.
- Creative writing: temperature 0.7-1.2.
Many production AI features ship with the default temperature (often 0.7-1.0) on tasks that should be near-zero. This single setting change can meaningfully reduce hallucinations on factual tasks.
Other knobs:
- top_p (nucleus sampling): keep below 1.0 for factual tasks. 0.9 is a reasonable default.
- top_k: similar effect; constrain to most-likely candidates.
- JSON mode / structured output: enforces shape, indirectly reduces drift.
For some providers, seed parameters give you reproducible outputs across runs. Use them in eval pipelines.
Technique 6: choose the right model
Some models hallucinate less than others on certain tasks. In 2026:
- Reasoning models (o3, GPT-5 thinking, Opus 4.7 extended) hallucinate less on math, code, and multi-step problems. They check their work internally.
- Long-context models (Gemini 2.5 Pro) are better at "needle in haystack" retrieval from provided context.
- Fast models (Haiku 4.5, GPT-5-mini, Flash) are more prone to confabulation on hard questions but excellent at well-defined tasks.
- Small models (Phi-4, Qwen 2.5) hallucinate more than frontier on broad knowledge tasks but can be surprisingly accurate on narrow fine-tuned tasks.
Don't reach for o3 by default — it's slow and expensive. But for any task where reasoning depth maps to correctness, the reasoning models really do hallucinate less.
For more on this trade, see Fast AI vs smart AI and Choosing the right AI model.
Technique 7: verify with a second pass
A surprisingly effective pattern: run the answer through a second model whose only job is to fact-check.
The verifier prompt looks like: "Here is a question, the source material, and an answer. For each claim in the answer, judge whether the source material supports it. Output: VERIFIED, UNSUPPORTED, or CONTRADICTED, with the relevant source span."
Use a fast model for this — Haiku 4.5 or GPT-5-mini. The cost is small. The reduction in shipped hallucinations is significant.
Patterns:
- Hard verify: drop unsupported claims before showing the user.
- Soft verify: show the user, but flag unsupported claims with a UI marker.
- Re-ask on failure: if a claim is unsupported, send it back to the original model with the verifier's feedback and try again.
The verifier is the closest thing we have to a hallucination "alarm." Many production systems have one. Most prototypes don't.
Technique 8: evals (the boring one that matters most)
Without evals you cannot say "we reduced hallucinations." You can only say "vibes are better."
A real eval pipeline:
- Golden dataset. 50-500 representative inputs with known correct outputs (or known-supported source material).
- Automated grader. Often an LLM grading whether outputs match expectations. Sometimes string match, sometimes structured comparison.
- Hallucination-specific metrics: unsupported-claim rate, fabrication rate, citation-accuracy rate.
- Regression alerts. When a prompt or model change degrades a metric, you know before users do.
This is the unglamorous infrastructure that separates "we hope it works" from "we shipped it because we measured it."
Technique 9: prompt patterns that help
A grab bag of prompt-engineering patterns specifically for reliability:
- State the source of truth explicitly: "Use only the information in the CONTEXT block below. If the context does not answer the question, say so."
- Forbid speculation: "Do not make assumptions. Do not infer beyond what is stated."
- Demand citations: "For every fact, include a source citation. If you cannot cite a source, omit the fact."
- Chain-of-thought (carefully): "Think step by step. Show your reasoning. State your final answer last." Improves multi-step accuracy on non-reasoning models. Less needed on reasoning models, which think internally.
- Role calibration: "You are a careful research analyst. Prefer 'unknown' over a confident wrong answer."
- Double-check: "Before responding, review your answer for any claims that are not supported by the context. Remove or flag them."
Some of these are old. They still work. Use them.
Technique 10: post-processing and guardrails
After the model speaks, before the user reads:
- Regex / parser checks for required formats.
- Schema validation for structured outputs.
- Number sanity checks — does the cited number exist in the source? Is it in a sensible range?
- PII scanners on outputs (especially if the model might echo private data).
- Toxicity filters for user-facing surfaces.
- Citation verification — does the cited source actually exist? Does it contain the claim?
Guardrails are not a substitute for the techniques above. They're a final safety net. Use both layers.
What does not work (despite the marketing)
A few honest negatives:
- "Just use a bigger model." Bigger frontier models hallucinate less than fast models on hard tasks, but they hallucinate more confidently. Quality goes up, but the residual errors are harder to spot. Bigger is not free.
- "Just turn off the temperature." Helps for factual tasks. Does nothing for grounding gaps.
- "Just add more context." Beyond a point, more context dilutes attention and can degrade quality. Curate, don't dump.
- "Just use the latest model." Newer is usually better, but each model has new failure modes. Re-run your evals on every model upgrade.
- "Just enable JSON mode." Helpful, not a panacea. JSON mode constrains shape, not truth.
The recommended stack (mid-2026)
If you're building a feature where reliability matters, here's the stack:
- Grounding. Always. RAG, tool use, browse, or inline.
- A reasoning-tier model (Sonnet 4.6 minimum, Opus 4.7 / GPT-5 / o3 for hard cases).
- Low temperature (0.0-0.3) for factual tasks.
- Structured output for any non-conversational shape.
- Citations in the response, with optional verification pass.
- Explicit "I don't know" instruction in the system prompt.
- A verifier pass by a fast model on the output.
- A real eval suite running on every prompt or model change.
Each layer reduces hallucinations a little. Stacked, they get you to a quality bar where users actually trust the system.
A worked example: a financial Q&A assistant
User asks: "What was Acme Corp's revenue growth in Q3 2026?"
Naive: GPT-5 prompt with no grounding. Model says "Acme Corp's Q3 2026 revenue grew by approximately 12%." Confident. Probably wrong. Definitely unverifiable.
Reliable:
- Retrieve Acme's Q3 2026 earnings release from a vector index, with re-ranking.
- Verify retrieval — is this actually the Q3 release? Skip if not, return "no source available."
- Prompt with the document inlined: "Using only the provided document, answer the question. Cite the section. If the document does not contain the answer, say so."
- Model (Sonnet 4.6, temperature 0.1, structured output).
- Output:
{ "answer": "14.2%", "citation": "MD&A section, page 4", "confidence": "high" }. - Verifier (Haiku 4.5): "Does the cited page actually state 14.2% revenue growth?" → confirms.
- Render to user with the citation linkable.
Cost: pennies. Latency: a couple seconds. Reliability: high. Hallucination risk: minimal.
This is the shape of every reliable AI feature in 2026. It's not magic. It's layers.
The summary
- Hallucinations are an architecture problem, not a model problem.
- Ground the model so it has the answer before you ask.
- Constrain the output with structure, low temperature, and explicit "I don't know" permission.
- Verify with a second pass and citation checks.
- Measure with evals or you don't actually know if you're improving.
- The right model helps; the right system matters more.
Build the layers. Ship answers users can trust.
NovaKit lets you wire grounding, RAG, citations, and verifier passes into any conversation — across every major model, with your own keys.