On this page
- TL;DR
- Setting the stage
- V1: The Tutorial Special
- What I built
- What went wrong
- What I should have known
- V2: The Rebuild That Almost Worked
- What I built
- What went wrong (this time)
- What I should have known
- V3: The Rebuild That Stuck
- Building the eval first
- What I built
- The lessons that stuck
- What I'd do differently from day one
- The thing I'm still learning
- The summary
TL;DR
- I shipped a RAG chatbot, watched it fail, rebuilt it, watched it fail differently, and rebuilt it again. Each rebuild taught me something I should have known the first time.
- V1 lesson: the demo lies. A working demo on 20 documents tells you nothing about how the system behaves on 20,000.
- V2 lesson: evaluation is not optional. Without an eval harness you are flying blind and your "improvements" are guesses.
- V3 lesson: the model and the retrieval are not the bottleneck. The data is. Always the data.
- The honest meta-lesson: I would have shipped a better system faster by reading other people's postmortems instead of being clever.
Setting the stage
I'm writing this in April 2026, eighteen months after I started building what should have been a simple internal "ask our docs" chatbot. The corpus was 12,000 documents — engineering wikis, product specs, runbooks, policy docs, the usual mid-size company sprawl. The brief was: "make it so engineers can ask questions and get answers from our docs."
Sounds easy. It's a tutorial-shaped problem. I had built proof-of-concept RAG systems before. I figured a couple of weeks.
Three rebuilds later, here we are.
V1: The Tutorial Special
What I built
The first version was the canonical RAG tutorial, slightly polished:
- LangChain pipeline.
- PDFs and HTML in. PyPDF2 + a simple HTML parser. Split into 1000-character chunks with 200-character overlap.
- OpenAI embeddings (text-embedding-3-small at the time).
- Pinecone for the vector store.
- GPT-4o for generation, with a prompt that said "answer using the provided context, cite sources."
- A Slack bot frontend.
Two weeks of work. The demo was great. Six engineers tried it during the launch lunch. It answered all six of their questions.
I shipped it.
What went wrong
Within a week, people stopped using it.
The complaints came in trickles, never in dramatic bug reports. "It told me the wrong on-call rotation." "It said our SOC 2 report covers stuff that it doesn't." "It just makes things up about the deploy process."
I dug in. The failures were everywhere:
- Stale answers. The chatbot was confidently citing a 2022 architecture doc that had been replaced in 2024. The old doc was still in the corpus.
- Wrong chunks. A question about "the staging environment" pulled chunks about a different staging environment used by a deprecated team.
- Conflicting answers. Two policy documents said different things. The bot would pick one, seemingly at random, and not flag the conflict.
- Mangled tables. Anything tabular came out garbled. SLA tables, on-call rotations, pricing matrices — all unreadable.
- Dead OCR. A bunch of our policy docs were scanned PDFs. PyPDF2 returned empty strings. The chatbot didn't know those documents existed.
The demo had worked because the six demo questions had been against six clean, recent, prose-heavy documents. The real corpus was a swamp.
What I should have known
The corpus is the product. Before you write any pipeline code, audit the corpus. What's in it? What's stale? What's tabular? What's scanned? What's contradictory? You will spend more time on data hygiene than on retrieval, and that's correct.
I had spent two weeks on the pipeline and zero on the data. I deserved the result I got.
V2: The Rebuild That Almost Worked
What I built
For V2, I took the criticism seriously. I rebuilt the ingestion:
- Real PDF parsing with Unstructured for layout-aware extraction.
- OCR fallback for scanned documents (Tesseract was fine; for the worst ones I ran them through GPT-4 vision).
- Structure-aware chunking — split on headings, keep tables intact as their own chunks, never split mid-paragraph.
- Metadata on every chunk — source URL, last-modified date, document type, owning team.
- Deduplication. Identical and near-identical chunks (90% similarity) collapsed.
- Stale filtering. Anything not modified in 18+ months got a "potentially stale" flag.
I switched to hybrid search — BM25 plus vectors, fused with RRF. Added a Cohere reranker in front of the LLM. Bumped the model to Claude Sonnet 4 (this was 2025).
Six weeks of work. The bot was visibly better. I shipped V2.
What went wrong (this time)
People started using it again. For about three weeks. Then a different set of complaints showed up.
- It was confidently wrong on edge cases. Not random questions — specific important questions. "What's the rollback procedure for service X?" got a plausible answer that didn't match the actual runbook.
- It would refuse to answer questions I knew it could answer. Some hedge in the prompt was making it overly cautious.
- It got worse over time. People kept adding documents. Some of those documents conflicted with older ones. The bot would pick the wrong one.
- A senior engineer told me, point blank, "I don't trust it."
Here's the thing that broke me: I had no way to measure any of this. Was V2 better than V1? I thought so. By how much? I had no idea. Did my last prompt change improve things or make them worse? I had vibes, not numbers.
I had built a system with no eval harness. I was flying blind.
What I should have known
Eval first. Then build. The eval set is not the last thing you build before shipping. It is the first thing you build, before you write a single chunking rule. Without it, every "improvement" is a guess.
What an eval set looks like, in retrospect:
- 100-200 real questions, sourced from Slack history of people asking the docs questions.
- For each, the document(s) that should be retrieved.
- For each, an expected answer (or at least an expected set of facts the answer should contain).
- A scoring function — exact match where possible, LLM-as-judge for fuzzier cases.
- A regression dashboard so you can see retrieval@K and answer correctness over time.
Six hours of work to assemble. Saves you weeks of rework.
I didn't build one for V2 because I'd convinced myself the bot was "obviously better." It was, but only on the questions I happened to remember to test, and not nearly as much as I assumed.
V3: The Rebuild That Stuck
Building the eval first
Before writing any V3 code, I built the eval harness. I scraped six months of #ask-docs Slack history. I categorized questions: factual lookups, multi-document synthesis, table queries, current-state questions, historical questions. I wrote down what the right answer was for each.
I ran V2 against this eval set. Numbers: 61% answers correct, 73% retrieval@5, 22% citation accuracy.
That last number was the killer. Three quarters of the time the bot's "citations" weren't actually where the claim came from.
Now I had something to optimize against.
What I built
V3 changes, in order of impact:
-
Fixed citations first. Made the LLM emit chunk IDs from the retrieved set, then reconstruct human-readable citations server-side. Before: model made up page numbers. After: citations matched the actual chunk every time.
-
Conflict detection. When retrieval returned chunks from multiple documents that disagreed, the bot now surfaced the conflict explicitly: "Document A says X (last updated 2023). Document B says Y (last updated 2025). The more recent says Y." This single change moved trust dramatically.
-
Per-team document filters. A platform engineer asking about "the deploy process" wants the platform team's deploy docs, not the data team's. Adding metadata-based filtering by team affiliation cut wrong-team retrievals by 80%.
-
Routing by query type. Simple factual lookups went to Haiku 4.5 (cheap, fast). Multi-document synthesis went to Opus 4.7 (slow, expensive, careful). I used a tiny classifier to decide. See multi-model routing.
-
Freshness scoring. Newer documents got a small retrieval boost. Documents flagged as deprecated got a strong penalty. The bot stopped citing 2022 docs as authoritative.
-
A "show me the chunks" debug mode for me. Every answer had a hidden link that showed exactly what chunks were retrieved, in what order, after rerank. Debugging went from hours to minutes.
-
An ongoing feedback loop. Each Slack response had a thumbs up/down. The thumbs-down questions went into a queue I reviewed weekly and added to the eval set.
After all this: 84% answer correctness, 91% retrieval@5, 96% citation accuracy.
The senior engineer who told me he didn't trust it now uses it daily.
The lessons that stuck
The data is the system. I kept thinking the model or the embeddings were the bottleneck. They never were. The bottleneck was always: bad chunks, missing metadata, stale documents, conflicting documents. Fix the data and the rest gets dramatically easier.
Eval is a forcing function. Without it, every change feels like progress. With it, you find out which changes actually were progress, and which were lateral moves dressed up as improvements.
Conflicts must be surfaced. Real corpora contradict themselves. A bot that picks one answer at random destroys trust. A bot that says "two sources disagree, here's both" earns trust.
Citations must be real. Before V3 I underestimated how much "the bot is making things up" was actually "the bot's citations don't match its claims." Fixing citation accuracy fixed the perception of quality more than fixing answer accuracy did.
Debug tooling for yourself, not just users. The "show me the chunks" mode probably saved me 100 hours of guessing. Build the introspection tools you'd want as the developer.
Stop reinventing. I made every mistake in this post about RAG failures. I should have read it first. Read other people's postmortems. Your problem is not unique.
What I'd do differently from day one
If I were starting this project today:
-
Spend week one on the corpus, not the pipeline. Audit it. Find the scanned PDFs. Find the duplicates. Find the stale stuff. Have a plan for each.
-
Build the eval harness in week two. 100 real questions, expected answers, a scoring script. Before any retrieval code.
-
Skip vector-only search. Go straight to hybrid + reranker. The "we'll add reranking later" detour is wasted time.
-
Treat citations as load-bearing, not as decoration. Make them verifiable from day one.
-
Surface conflicts. Build the conflict-detection logic before you ship.
-
Pick the model last. The model matters less than the data and the retrieval. With good chunks and good retrieval, even cheaper models do fine.
-
Build the debug view for yourself. Future-you will thank present-you when something inevitably breaks at 11pm.
-
Consider whether you even need RAG. For our 12k document corpus, RAG was the right call. For a smaller corpus (say, a 200-page handbook), long context might be simpler. Don't reach for the complex tool first.
The thing I'm still learning
RAG systems are never "done." The corpus changes. New documents land. Old ones get deprecated. New question patterns emerge. New models come out that change what's possible.
V3 is in production. It's good. It's not the last version. There's already a V4 in my head — better handling of multi-turn conversations, integration with our internal tools via MCP, proactive suggestions when someone asks a question that has a known better answer elsewhere.
The mental shift, after three rebuilds, is from shipping a project to operating a system. RAG isn't a thing you build and walk away from. It's a thing you tend to.
The summary
I rebuilt the same RAG system three times because I kept skipping the unsexy work — auditing the corpus, building the eval, fixing the data. Every time I rebuilt it, the lesson was the same: the model is not the bottleneck. The data is.
If you're about to build a RAG system in 2026, do the eval first. Audit the corpus. Use hybrid search and a reranker. Surface conflicts. Make citations real. Build the debug view. And expect to operate the thing forever, not ship it once.
I would have saved nine months by reading this post in early 2025. So I wrote it now, in case it saves yours.
NovaKit is built on the lessons in this post — structure-aware ingestion, hybrid retrieval, conflict surfacing, and per-question model routing. BYOK, local-first, no markup.