KBrain Concepts
How AI context retrieval works: indexing vs. copy-paste vs. web search
How indexing, copy-pasted context, and live LLM web search compare on hallucination, latency, and token cost, with the chunking, embedding, and retrieval mechanics behind each.
Build your first knowledge brain
Create a brainA knowledge base can reach a model in three common ways. Paste the raw content into the prompt. Let the model search the open web on its own at query time. Or retrieve the right slice from a pre-built index, which is what KBrain does. Each has a different cost profile on hallucination, latency, and tokens, and the differences come from mechanics, not marketing. This article walks through the architecture behind each one.
Most of what people call "AI context" is just paste. A document dropped into a chat window. A whole wiki exported into one prompt. It works, until the knowledge base grows past what fits, or the model starts drowning the right fact in a hundred irrelevant ones. Indexing is the fix, and it happens before the model ever sees a token.
What indexing actually is
Indexing is not storage. Storage is where documents sit. Indexing is what makes them searchable in milliseconds, at any scale, so the model gets a handful of relevant passages instead of a warehouse.
Three things happen before a knowledge base is ready to query.
- Chunking. Source content, documents, wikis, connected apps, gets split into passages. This is the highest-leverage decision in the pipeline. Split badly and even a perfect search step retrieves the wrong slice.
- Embedding. Each chunk becomes a vector, a numeric representation placed so that similar meaning sits close in space. Comparing vectors is how a search over meaning becomes possible at all.
- Indexing. The vectors are organized into a structure built for fast similarity search, so a query does not have to compare against every chunk one by one. This step runs once per change to the knowledge base. It does not run per question.
Where indexing sits
Indexing is offline work. It happens when content is added or updated, and produces something persistent: an index the retrieval step reads from. Retrieval is online work. It happens on every single question, and it is the part that touches MCP.
Keeping these separate matters. Indexing can be slow and thorough, because it happens once. Retrieval has to be fast, because it happens every time someone asks a question.
What happens at query time
A prompt arrives through KBrain's MCP connection. The question gets converted into the same kind of vector used for the chunks, then compared against the index to find the closest matches. The best candidates get rescored for precision, and only the top few, usually a handful, get passed to the model as context.
Everything downstream of the index is bounded. A knowledge base with ten documents and a knowledge base with ten thousand documents send roughly the same amount of context to the model, because retrieval selects, it does not forward.
Why this beats pasting everything into the prompt
Fewer hallucinations. Without a selection step, irrelevant chunks compete with relevant ones for the model's attention, and the model has to find the signal itself. Models are also measurably worse at using facts buried in the middle of a long context than facts near the start or end (Liu et al., "Lost in the Middle," 2023, arXiv:2307.03172). Indexing removes most of the irrelevant material before the model ever sees it. This does not make hallucination zero. A retrieval step can still miss the right chunk. What it removes is the specific failure mode caused by drowning the model in unfiltered content.
Lower latency. A model has to process every token of context before it can start answering. A prompt carrying a handful of relevant chunks processes faster than one carrying an entire knowledge base, on every single call, not just the first one. The retrieval and reranking steps add their own small, fixed cost. That cost does not grow with the size of the knowledge base. Unfiltered context does.
Fewer tokens. This is the most direct saving. Retrieving five relevant chunks instead of pasting fifty documents cuts the input token count by an amount that scales with the size of the knowledge base. The bigger the knowledge base gets, the bigger this gap gets. It also means the size of a knowledge base a person can usefully connect to a model is no longer capped by the model's context window. It is capped by retrieval quality instead, which is a problem indexing can keep improving on. The mechanics of that saving are covered in more depth in how KBrain reduces the tokens needed for a grounded answer.
The third option: letting the model build its own context from the web
There is a second common alternative to indexing, one that is not copy-paste and not KBrain either. It is letting the model search the web itself at query time: issue a search query, fetch the top pages, read them, decide whether to search again, and repeat until it has enough to answer. This is sometimes called agentic retrieval, and it follows the same reasoning-and-acting loop described in the original ReAct pattern, where a model interleaves reasoning steps with tool calls instead of retrieving once and stopping (Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," 2022, arXiv:2210.03629).
It solves a real problem: public information the model was not trained on, or that has changed since training. It also introduces a different set of costs than either copy-paste or indexed retrieval, worth naming precisely rather than lumping in with "context stuffing."
Latency compounds instead of scaling. Copy-paste pays once for a large prompt. Indexed retrieval pays once for a single lookup. Agentic web search pays per hop, and a question that needs two or three searches to answer pays that cost two or three times, sequentially, because each search depends on the result of the last.
Token cost is unbounded and hard to predict. Fetched web pages arrive with navigation menus, ads, related-article widgets, and cookie notices attached. None of that was chunked or filtered ahead of time, so the model either burns tokens reading it or burns a separate pass summarizing it down. Where indexed retrieval sends a fixed, small number of chunks regardless of corpus size, web search has no equivalent ceiling.
Results are not reproducible. A search engine's ranking changes over time and across regions, so the same question asked twice can retrieve different pages. That is fine for a casual query. It is a problem for anything that needs to be audited, compared, or debugged, since there is no equivalent of recall@k to evaluate against, and no fixed corpus to evaluate it on.
Fetched content is untrusted by default. Content pulled from the open web can contain text deliberately crafted to redirect a model's behavior once it lands in the context window, a class of attack known as indirect prompt injection (Greshake et al., "Not What You've Signed Up For," 2023, arXiv:2302.12173). A pre-built index over a known, curated knowledge base does not eliminate this risk category entirely, but it is working from vetted sources chosen ahead of time rather than whatever a search engine ranked highest today.
The scope is different, and that is the real tradeoff. Web search reaches the public internet. Indexing reaches whatever knowledge base was connected to it, which is usually private: internal docs, a team's own wikis, a company's own data. These are not interchangeable. A question about a private codebase or an internal process has no web page to fetch in the first place. The honest comparison is not "which is better" in general, it is which failure mode a given question is exposed to.
What indexing does not fix
Indexing is not a hallucination guarantee. A stale index, a bad chunking strategy, or a search that misses the right passage can still send the model down the wrong path. The honest claim is narrower and more useful: indexing removes the failure mode caused by unfiltered, oversized context, and it does so while cutting latency and token cost at the same time. Whether it also improves the answer depends on the quality of what got indexed and how well it gets retrieved, which is exactly the part KBrain is built to keep tuning.
The practical takeaway
Pasting a knowledge base into a prompt works until it does not. Indexing is what makes a knowledge base usable at any size, answered from a handful of the right passages instead of everything at once, every time, for a fraction of the cost.
Indexing is one half of the story: it is what makes retrieval possible. The retrieval side, served over MCP, is what an assistant actually calls at query time.
Build your first knowledge brain
Subscribe to KBrain, create a brain from your expertise or your data, and make it available to Claude, ChatGPT, or any MCP compatible assistant.
Create a brainFrequently asked questions
Is indexing the same as storage?
No. Storage is where the documents sit. Indexing organizes their vector representations so a search can find the relevant ones in milliseconds instead of scanning everything.
When does indexing happen?
Offline, whenever the knowledge base changes. Retrieval, the part that reads from the index, happens online, on every question.
Does indexing eliminate hallucination?
No. It removes the failure mode caused by unfiltered, oversized context. A retrieval step can still miss the right passage, so answer quality still depends on what got indexed and how well it gets retrieved.
Why does indexing save tokens?
Because retrieval sends only the relevant chunks instead of the whole knowledge base. The larger the knowledge base, the larger the saving, and the size of what a person can connect stops being capped by the model's context window.
Isn't letting the model search the web the same as retrieval?
It is a form of retrieval, but without a pre-built index. Each question triggers a fresh, sequential round of searching and fetching, which is slower and less predictable than a single lookup against an index, and it only reaches public pages, not a private knowledge base.