guidesMarch 13, 202611 min read

How to Build an AI Knowledge Base from Your PDFs, Notes, and Docs (2026 Guide)

Stop re-uploading the same files into ChatGPT. A personal AI knowledge base lets you chat with every document you own — PDFs, Markdown notes, Notion exports, Kindle highlights — privately and locally. Here's exactly how to build one.

TL;DR

  • A personal AI knowledge base = your documents + embeddings + a chat interface that can answer questions grounded in your own files.
  • For under 1M tokens of text, you can skip the complicated stack — just paste everything into a long-context model.
  • For serious amounts of content, you need chunking, embeddings, a vector store, and a retrieval layer.
  • You can build this in a weekend with open tools, or use a local-first BYOK client like NovaKit's knowledge base that handles it all without sending your files to a third-party.
  • Privacy-minded version: keep the entire pipeline local except for the model API call.

Why a personal knowledge base is worth building

If you've ever:

  • Re-uploaded the same PDF to ChatGPT for the third time
  • Searched your Obsidian vault for "that one thing I wrote about X" and failed
  • Wished you could ask Claude "what did I save about [topic] last year?"
  • Lost a Kindle highlight because there's no good way to search across books

...then you want a knowledge base.

Think of it as "search engine for everything you've ever read or written, with AI reading comprehension on top."

The three kinds of knowledge bases

Level 1: The "just paste it" knowledge base

You have less than 1M tokens of content (a handful of documents, or a single big PDF).

What to do: Use a long-context model. Paste everything in, ask questions. Gemini 2.5 Pro handles 2M tokens. Claude 200k is plenty for moderate usage.

Pros: No infra, no embedding pipeline, no chunking decisions. Cons: Cost scales per query; need to re-paste every session unless caching is available.

Best for: A single big document you're analyzing; a small personal reference library.

Level 2: The full retrieval pipeline

You have thousands of documents, updated frequently, and you want fast per-query retrieval.

What to build:

  1. Ingestion: Convert every doc to text (PDF → text, Notion → Markdown, Apple Notes → text).
  2. Chunking: Split into ~500-1000 token pieces, with some overlap.
  3. Embeddings: Turn each chunk into a vector using an embedding model (OpenAI text-embedding-3-small, or local models like nomic-embed-text).
  4. Vector store: Index the vectors (SQLite + sqlite-vec, Qdrant, LanceDB, or cloud options).
  5. Retrieval: On each query, find top-K similar chunks.
  6. Prompt: Pass chunks + query to the LLM; generate grounded answer.

Pros: Scales to huge libraries, cheap per query, fresh. Cons: Real engineering. Chunk tuning, embedding choice, retrieval evaluation.

Level 3: Hybrid with full-document context

The pattern most production systems end up using (see our RAG vs long-context post):

  1. Retrieve top-20 chunks (not top-3).
  2. Then pass the full surrounding section or file, not just the chunks.
  3. Use a long-context model to reason over the retrieved material.

This gets you scale and quality. Most "best of both worlds" KB implementations converge here.

What to put in your knowledge base

Good raw material:

  • Your own writing: blog drafts, journal entries, meeting notes, design docs.
  • PDFs you've actually read: papers, ebooks, reports.
  • Kindle highlights (export via Your Clippings.txt or Readwise).
  • Notion / Obsidian / Bear exports.
  • Email newsletters you saved.
  • Web clippings (Instapaper, Pocket, Matter exports).
  • Important Slack / Discord messages (use export tools cautiously for privacy).

Don't dump random files hoping for the best. Signal-to-noise matters. 500 curated documents beats 50,000 junk docs.

Privacy: the whole point of building your own

If you care about data, the reason to build your own knowledge base (instead of using a hosted service like NotebookLM, Mem, or Reflect) is that your files are yours.

Privacy-maximal architecture:

  • Documents live on your own disk or local database.
  • Embeddings are generated locally (Ollama, LM Studio, or Python scripts with a local model).
  • Vector store is local (SQLite + sqlite-vec, or LanceDB).
  • Only the retrieved snippets + your question get sent to an LLM API — and only when you ask a question.
  • No "sync to our cloud for faster search" step ever.

Privacy-pragmatic architecture:

  • Everything local as above, except you use hosted embeddings (OpenAI's text-embedding-3-small is $0.02/M tokens — basically free).
  • Your raw documents never leave your machine; only the question and retrieved chunks go to the LLM API.

The difference: in the maximal version, zero content touches any external API except during a live chat query. In the pragmatic version, embeddings are computed once using a hosted model but your docs still don't sit on anyone else's server.

The step-by-step build

Option A: Use a BYOK client with built-in KB

The zero-infrastructure path:

  1. Use NovaKit or a similar client that has a built-in knowledge base feature.
  2. Drag-and-drop your PDFs, Markdown, and text files in.
  3. The client chunks, embeds (using your own BYOK key for the embedding provider), and stores everything locally in your browser's IndexedDB.
  4. Start chatting. Retrieval happens behind the scenes.

Time to set up: 15 minutes. Good for: Most personal use cases.

Option B: DIY with open tools

If you want full control and open-source everything:

  1. Ingestion: Use pdfplumber or pymupdf for PDFs. Point at your Obsidian vault for Markdown.
  2. Chunking: Use langchain's recursive text splitter or write your own — 800 tokens with 200 overlap is a reasonable default.
  3. Embeddings: Either text-embedding-3-small (OpenAI, $0.02/M), or self-host nomic-embed-text via Ollama (free, truly local).
  4. Vector store: LanceDB or Chroma locally. Both are dead simple.
  5. Retrieval: Cosine similarity top-K from your vector store.
  6. Chat layer: Any LLM API, pass retrieved chunks in the system prompt.

You can have this running end-to-end in a weekend.

Option C: The "I just want to chat with my Notion" path

  • Export your Notion workspace as Markdown.
  • Feed the .zip into NovaKit's knowledge base or NotebookLM.
  • Done.

Chunking: the choice people mess up

Chunking is surprisingly important. Bad chunking ruins retrieval.

Bad chunking: fixed-size splits that cut sentences in half, separate headings from their content, lose context.

Good chunking:

  • Respect document structure. Split by sections (H1, H2) or by semantic units, not by raw token count.
  • Keep headings with their content. If you have to split a section, repeat the heading at the top of each child chunk.
  • 800-1200 tokens is a good starting range for most content.
  • 150-200 tokens of overlap prevents information loss across chunk boundaries.
  • Include metadata: filename, section, page number. This makes retrieval citations possible.

Evaluation: how do you know it's working?

A common trap: you build a knowledge base, ask one or two questions, get decent answers, and declare victory. Then three weeks later you hit a question it whiffs, and you don't know why.

Build a "golden question set" — 20-50 questions whose correct answers you know — and re-run them whenever you change chunking, embedding model, or retrieval parameters. Without this, you're tuning blind.

Example golden questions to include:

  • Specific facts: "What was my conclusion about X in the Q3 design doc?"
  • Cross-document: "Which of my reading notes reference Y?"
  • Summarization: "Summarize my Kindle highlights about Z into 5 themes."
  • Temporal: "What did I write about [topic] before March 2026?"
  • Adversarial: "Does anything in my notes contradict the idea that X?"

If your KB fails on any of these, you have a measurable thing to fix.

Common mistakes

  1. Dumping everything. Curation > volume. Your KB is only as good as what's in it.
  2. Skipping chunk overlap. Single-sentence retrievals lose context.
  3. Using the default embedding model blindly. OpenAI's text-embedding-3-small is great for English. For multilingual or domain-specific, check alternatives.
  4. No metadata. Without filename/date/source, retrieval can't do filtering and citations become impossible.
  5. Never re-evaluating. Content drifts; add a monthly "does it still work?" ritual.

What to do with a working KB

Once you have one, the use cases explode:

  • Research assistant: "What have I read about [topic] in the last year?"
  • Writing companion: "I'm drafting a post about X — pull relevant context from my notes."
  • Decision memory: "What did I decide about Y in the design review two months ago?"
  • Meeting prep: "What do I know about [person/company] from my saved articles?"
  • Book concordance: "Find every time anyone I follow has written about Z."

The best personal productivity upgrade of 2026 isn't a new app. It's having all your content indexed and queryable by a good model.

The summary

  • Under 1M tokens? Skip the infra, just paste.
  • More than that? Build (or use) a proper retrieval pipeline.
  • Privacy matters? Keep the pipeline local; only the question and retrieved snippets leave your machine.
  • Evaluation beats vibes. Build a golden question set.
  • The ROI compounds: every new document adds to every future answer.

NovaKit has a local-first knowledge base built in. Drop your PDFs and Markdown in, chat across all of them, keep your files on your machine. BYOK for the model of your choice.

NovaKit workspace

Stop reading about AI tools. Use the one you own.

NovaKit is a BYOK AI workspace — chat across providers, compare model costs live, and keep conversations on your device. No markup on tokens, no lock-in.

  • Bring your own keys
  • Private by default
  • All models, one workspace

Keep exploring

All posts