Beyond the Demo: Building Production-Grade RAG Systems

A RAG demo takes an afternoon. You pip-install a vector store, dump some PDFs into it, wire up an embedding model and a chat endpoint, ask it three questions you already know the answers to, and it nails all three. Everyone in the room nods. Ship it.

Then it meets real users with real questions over a real corpus, and the cracks show immediately: it confidently cites a document that says the opposite of its answer, it can’t find a policy that’s plainly sitting in the knowledge base, and it invents a refund window that has never existed. The demo wasn’t a lie exactly. It was a controlled experiment that happened to flatter you. The gap between that experiment and a system you’d put your name on is where most of the actual engineering lives, and almost none of it is about the language model.

Why demos lie

Demos lie by construction, not by malice. The questions are cherry-picked by the person who built the index, so they map cleanly onto the chunks that exist. The corpus is small enough that even mediocre retrieval surfaces the right passage by luck. And nobody asks the adversarial questions — the ones with no answer in the corpus, the ones where two documents disagree, the ones phrased the way an actual frustrated user phrases them at 11pm.

Scale changes the physics. At 50 documents, top-k retrieval is forgiving because there isn’t much to confuse it with. At 50,000, near-duplicates, outdated versions, and superficially-similar-but-wrong passages crowd the top of every result list. The model is only ever as good as what you hand it, and what you hand it is decided long before the LLM is invoked.

The model is not the system. The retrieval pipeline is the system, and the model is the part that writes it up nicely at the end.

Retrieval quality is the whole game

If you only invest in one thing, invest here. An average model with excellent retrieval beats a frontier model fed garbage context, every time.

Chunking is a design decision, not a default

The single most common mistake we see is treating chunking as a config value — chunk_size=512, move on. Chunk boundaries decide what can ever be retrieved together. Split a table from its header, or a clause from the sentence that scopes it, and the right answer becomes physically unretrievable no matter how good your embeddings are. Respect document structure: chunk on headings, keep list items with their stems, and use modest overlap so a concept that straddles a boundary survives. Then keep a bit of surrounding context (parent section, neighboring chunks) available at generation time so a retrieved fragment isn’t read out of context.

Metadata is free precision

Every chunk should carry structured metadata — source, section, document date, version, access scope. This lets you filter before you rank (only the current handbook, only docs this user may see) and it lets you cite precisely later. Pure semantic similarity has no concept of “newest” or “authoritative.” Metadata does.

Hybrid search, then re-rank

Dense vector search is great at meaning and bad at exact tokens — product SKUs, error codes, surnames, legal references. Keyword search (BM25, or Postgres full-text) is the opposite. Production systems run both and fuse the results; the combination reliably beats either alone. Then take the fused top ~50 candidates and pass them through a cross-encoder re-ranker that scores each against the query directly. The first stage optimizes recall (don’t lose the right chunk); the re-ranker optimizes precision (put it at position one). That two-stage shape — cheap-and-wide, then expensive-and-narrow — is the backbone of every serious retrieval pipeline.

candidates = vector_search(query, k=40) ∪ keyword_search(query, k=40)
candidates = filter(candidates, metadata)          # scope, recency, perms
ranked     = cross_encoder.rerank(query, candidates)
context    = take(ranked, n=6)                      # fit the budget, with citations
answer     = llm.generate(query, context, "answer only from context; else say you don't know")

Embeddings: choose deliberately, then measure

Embedding choice is a real trade-off, not a leaderboard-chasing exercise. Larger models capture more nuance and cost more per token, more storage per vector, and more latency. Smaller models are cheaper and often perfectly adequate for a focused domain. Dimensionality directly drives your index size and memory footprint at scale.

Two things matter more than picking the “best” model. First, your domain may not match the model’s training distribution — embeddings tuned on general web text can be mediocre on dense legal, medical, or Turkish-language corpora. Second, changing the embedding model means re-embedding everything; it’s a migration, not a flag flip. So pin the model and its version in your metadata, and decide with your own eval set rather than someone else’s benchmark.

The missing discipline: evaluation

Here is the line between a hobby project and a product. A hobby project is evaluated by vibes — the founder asks a few questions and feels good. A product is evaluated by a repeatable eval set you can run on every change.

Build it by hand if you must. Collect 50-200 real questions, label the chunks that should be retrieved and the answer that’s actually correct, and include the awkward cases: questions with no answer, questions spanning multiple documents, near-duplicate-but-wrong traps. Then measure two distinct things, because they fail independently:

Retrieval quality — precision and recall over your labeled chunks. Did the right passage make the top-k at all (recall), and how much noise came with it (precision)? Most “the LLM is dumb” complaints are retrieval-recall failures in disguise.
Answer faithfulness — is the generated answer actually supported by the retrieved context, or did the model embroider? This is separate from whether the answer is “good.” An answer can be fluent, helpful, and entirely unsupported.

Without this harness you are flying blind. You’ll “improve” the prompt, feel better, and have no idea you regressed retrieval for a quarter of your users. With it, every chunking tweak, every re-ranker swap, every model upgrade becomes a measurable bet instead of a vibe.

Grounding, citations, and the dignity of “I don’t know”

A production RAG answer should be traceable. Every claim should map back to a retrieved chunk, and the UI should surface those citations so a user — or an auditor — can verify. This isn’t decoration; citations are the cheapest hallucination defense you have, because a model asked to ground each statement in provided sources hallucinates far less than one asked to “answer the question.”

The harder discipline is teaching the system to decline. If retrieval comes back with nothing above a relevance threshold, the correct output is “I don’t have that in the knowledge base,” not a confident guess assembled from the model’s parametric memory. Most teams skip this because demos never hit it. Production hits it constantly. A guardrail that gates generation on retrieval confidence — and refuses below it — turns your scariest failure mode (confident fabrication) into your most trustworthy one (honest abstention).

Latency, cost, and caching reality

Every stage you added for quality costs milliseconds and money. Embedding the query, dual retrieval, cross-encoder re-ranking, and a long-context generation call stack up fast, and re-ranking in particular is not free. Budget your latency before users do it for you.

Caching is the highest-leverage optimization here. Cache query embeddings (the same questions recur far more than you’d expect), cache retrieval results for hot queries, and cache full answers where freshness allows. Trim the context you send — six well-ranked chunks beat twenty mediocre ones on cost, latency, and faithfulness, because a smaller, cleaner context is harder to get lost in. Cost discipline and answer quality point the same direction more often than people assume.

Privacy, and the local-first option

For a lot of teams the most valuable knowledge — contracts, patient records, internal financials, source code — is exactly the data they can’t legally or comfortably ship to a third-party API. That constraint is why we’ve built local-first RAG engines, and they’re a clean illustration of the principles above.

One of ours is a Go application that ingests a directory of documents, embeds and stores them in PostgreSQL with pgvector, and serves grounded answers through a chat endpoint — with embeddings and generation both running on local models via Ollama, so no data leaves your infrastructure. The architectural choices are deliberate. Postgres means your vectors and your metadata live in one transactional store, so filtering by scope and recency is a WHERE clause next to the similarity search, not a second system to keep consistent. Go’s concurrency keeps ingestion fast over large corpora, where embedding throughput is usually the bottleneck. And local models turn the privacy question from a policy negotiation into an architectural guarantee — the trade-off being that you own the latency and the GPU budget instead of renting them. Local-first isn’t automatically better; it’s better when your data can’t leave, and then it’s the only option that ships.

What this means for you

If you’re past the prototype and staring at production, the work is mostly unglamorous and entirely worth it. Before you ship, you should be able to check these off:

None of this requires a frontier model or a research team. It requires treating retrieval as the product, evaluation as non-negotiable, and “I don’t know” as a feature. The demo earns you the meeting. This earns you the trust.