The Augmented Work.
An editorial on AI & the future of work
Article № 11 · RAG

Build a RAG, they said. It’s going to be easy, they said.

You followed the tutorial, embedded your docs, and now it answers questions. Here's why that isn't a production system yet — and how to measure how far off you actually are. For engineers and tech leads about to call a working demo “done.”

Issue May 2026
Read time 10 minutes
Filed under RAG · AI Engineering · Production Systems
Length 2,550 words
In brief

A retrieval-augmented generation demo and a production RAG system are not the same system at two different sizes. They are different systems. The demo you built from a tutorial — load some PDFs, embed them, do a similarity search, stuff the results into a prompt — works because everything about the demo was friendly: clean documents, one patient user, questions you picked yourself. Production is none of those things, and the gap between the two isn’t a matter of scaling up. It’s a matter of building the parts the tutorial quietly left out.

There are a lot of those parts. This piece walks through three of them — how your data gets in, how a query actually runs, and where the whole thing is allowed to live — because they’re the three that most reliably ambush a team that thought they were nearly done. They are not the whole list. They’re enough to make the point.

And the point is this:

You cannot fix what you cannot see. So before you argue with any of the above, do the one thing this article is actually asking you to do — pull 50 real user queries and score the answers honestly. If you can’t state that score, you don’t have a production system. You have a demo with users.

Think of it like a restaurant whose Yelp photos are stunning and whose food is mid. The photos got you in the door. The food is why nobody comes back.

§ 01This is for you if you’re about to call a demo “production”

You built a RAG that works. It came out of a course, a tutorial, a weekend hackathon, or a quick internal v0, and it answers questions in the demo convincingly enough that someone — maybe you — is ready to put it in front of real users.

This is written to talk you out of that, or at least to make you measure first. If you’re still deciding whether to use RAG at all, this isn’t for you yet. And if you’re looking for the confident “we have RAG” line for the next roadmap review, you should especially keep reading, because that line is the thing this article is about.

§ 02The gap is categorical, not incremental

Here is the tutorial version of RAG, more or less complete: parse the documents, split them into chunks, embed the chunks into vectors, store the vectors, retrieve the closest ones to the query, and paste them into the prompt. It’s about forty lines of code, and it’s a genuinely useful way to understand RAG.

It is not a smaller version of a production system. It’s a different kind of object. A production-ready RAG application is not a tutorial-grade implementation at a larger scale; it is an entirely different engineering system that introduces categories of complexity standard developer guides omit. Production RAG is better understood as an architecture of roughly a dozen decoupled subsystems — ingestion, chunking, embedding, indexing, query transformation, retrieval, reranking, context assembly, generation, grounding, evaluation, observability — each with its own failure mode, and each capable of quietly dragging the whole system down. These subsystems act as a chain: the weakest link dictates the quality of the output.

Architecture · side by side in the 40-line tutorial added in production
THE TUTORIAL ~40 lines of code
  1. STEP 01Parse
  2. STEP 02Chunk
  3. STEP 03Embed
  4. STEP 04Retrieve
  5. STEP 05Generate

one process · one user · one shot

PRODUCTION ~12 decoupled subsystems
EXHIBIT 01 · INGESTION offline · runs on documents
━ async boundary · message queue ━
EXHIBIT 02 · QUERY-TIME background workers · seconds to minutes
EXHIBIT 03 · DEPLOYMENT PERIMETER
Hover (or focus) a production block to see its failure mode.
Fig. 01 The tutorial pipeline vs. the production architecture it’s mistaken for. Hover the production blocks to read each one’s failure mode.

You don’t need all twelve to feel the difference. Three are enough — and I’ve picked them deliberately, because each one is invisible from inside the tutorial and each one alone pushes the project past “beginner.” Two are surprises the tutorial hid from you. One is a constraint that’s decided before you write a line of code.

Exhibit 01Ingestion: the front door the tutorial fakes

The tutorial hands you five clean PDFs and a one-line loader. Production hands you a folder nobody has curated: scanned faxes, decade-old reports, spreadsheets with merged cells, three slightly different versions of the same document, and PDFs that are really just photographs of text.

Two questions decide how hard your ingestion problem is, and the tutorial asks neither. First: what is the data, really? Clean text is easy; everything else is not. Standard parsers read multi-column layouts straight across the page, mixing sentences from separate columns and destroying the meaning — and because embedding models depend on coherent local context, that jumbled text produces distorted vectors, retrieval fails, and the model is left to synthesize answers from irrelevant context, which is a primary driver of hallucination in production. Tables are worse: flattening a table into a string of numbers loses the row-and-column associations that gave the numbers meaning in the first place. Scanned pages need OCR the tutorial never mentioned.

Second: do you control the source, or is a user uploading whatever they want? A predictable corpus you own — known formats, known schema, a known update cadence — is a hard but tractable pipeline. Arbitrary user uploads mean you have to handle every format’s parsing, cleaning, and chunking before a single embedding exists. Get this layer wrong and nothing downstream can save you. It is the most literal garbage-in, garbage-out in the system — except the garbage comes back out wearing a confident tone.

Exhibit 02Execution: your query is a long-running job, not a function call

This is the one that breaks the mental model the tutorial gave you. In the tutorial, a query is a function call: question in, answer out, you wait a beat, it returns. That model is wrong in production, in two directions at once.

A real query is a long-running task. It can take seconds — or, increasingly, minutes. The moment you add the techniques that make answers good, latency climbs: where standard RAG completes in one to two seconds, agentic multi-hop retrieval routinely takes 15 to 20 seconds and occasionally several minutes, because of network round-trips, sequential tool execution, cross-encoder reranking, and multiple sequential LLM calls. Minutes change the category of system you’re building. The result has to outlive the user’s session, so it must be persisted and fetchable by a job ID after they’ve closed the tab. You owe them visible progress, because a four-minute spinner reads as “broken.” And long tasks fail often enough — timeouts, transient errors, a model falling over mid-run — that checkpointing and resumable retries stop being polish and become the difference between a working product and one that restarts from zero at minute three.

Wall-clock side-by-side
How long is a query, really?
0.00s
Tutorial · blocking call
completes in ~1.5s
idle
1.5s
Production · agentic multi-hop
~18s · sometimes minutes
idle
18s
queued · 0.4s query xform · 1.2s hybrid retrieval · 2.6s cross-encoder rerank · 3.5s context assembly · 1.1s agentic generation · 7.5s grounding + cite · 1.7s
Press Run. Both queries start on the same clock. Watch the demo finish, then watch the rest of the production query.
Fig. 02 The demo finishes before you can blink; the production job is a job. Watch a tutorial query and a production agentic query run on the same wall clock.

This is why production serving looks nothing like the tutorial’s blocking call. The established pattern decouples submission from execution: the frontend validates the request, writes it to a message queue, and immediately returns an HTTP 202 with a task ID, while background workers pull the job, run it, and persist its state from pending to working to completed or failed. The client then polls or waits for a webhook. Streaming the answer back token-by-token — the thing you probably think of as “the hard part of the UI” — is real work too (it means committing to Server-Sent Events or WebSockets and their reconnection, ordering, and proxy-buffering quirks), but it’s only the visible tip of this iceberg.

And you run many of these at once. Production isn’t one patient user; it’s dozens firing concurrent queries, each spawning its own multi-step job, all contending for the same finite GPU capacity. Because LLMs process tokens sequentially through repeated forward passes, serving engines hit scheduling and memory bottlenecks that don’t exist in ordinary web apps. When GPU memory is exhausted, requests stall in the queue for tens of seconds unless the scheduler actively prioritizes shorter jobs or applies backpressure at the gateway. Keeping throughput sane under load takes real machinery — continuous batching and virtualized memory management among it — none of which the single blocking call in your tutorial ever hinted at.

Exhibit 03Deployment: “it has to run on our servers” rewrites everything

The tutorial calls the OpenAI API. Three lines, and the hardest infrastructure problem in your system belongs to someone else. Then you win a client in government, finance, or healthcare, and the requirement arrives in a single sentence: the data cannot leave our infrastructure.

That sentence rewrites the architecture. On-premise means there is no managed embedding API, no managed vector database, no managed LLM. You now self-host the embedding model, the vector store, and an open-weights LLM — a Llama, Mistral, or Qwen — along with the GPUs and the ops to keep all three alive. The constraint usually arrives with a second clause, too: be LLM-agnostic, so you’re not betting the system on one provider’s quirks and can swap the model underneath without rewriting your prompts and parsing. That’s an abstraction layer the tutorial never needed, because the tutorial was the hardcoded version. Three API calls become three pieces of infrastructure you own — and this requirement is increasingly common well outside the usual US and EU framing, including across regulated sectors in Africa and the Middle East.

§ 03So run the honest audit — you can’t fix what you can’t see

Everything above is abstract until you measure your own system. So here is the one thing to actually do, and it takes an afternoon.

  1. Pull 50 real user queries. Not the questions you used in the demo — those are friendly by construction. Real ones: what users actually typed, or what they would type, in all their vague, misspelled, multi-part glory. If you’re not live yet, get domain users to write them; do not write them yourself.
  2. Run them through the system as it stands. No fixing on the fly, no “well, if I rephrase it.” Capture what a real user would get.
  3. Score each answer blind, on a three-point scale: useful (a user would be satisfied), partly (some truth, but wrong, incomplete, or misleading enough to need a human), or wrong (incorrect, hallucinated, or empty). Score the output, not your knowledge of how hard the question was.
  4. Read the number. The share of useful answers is your real baseline — the rest is the gap between your demo and your production system.
The honest audit · 50 queries
Tap as you score. Or press 1 / 2 / 3.
Scored
0 / 50
Useful share
%
Awaiting first signal
Score at least a few before reading the percentage — small samples lie.

No data leaves the page. Reload resets — this is a gut check, not a dashboard.

Your ticks
Fig. 03 Score 50 real answers — your baseline, not anyone else’s. Tap, or press 1 / 2 / 3.

A word on what to expect, stated plainly as experience rather than a study: there is no clean industry benchmark for “what percentage of production RAG answers are useful,” and you should distrust anyone who quotes you one. In my own experience auditing these systems, a first honest score commonly lands somewhere in the 40–60% useful range — far below the impression the demo gave. That number is not a law; it’s a warning about the size of the gap the demo hides. Your job is to find your number, not to trust mine.

§ 04The five silent failure modes eating your score

Once you have a bad number, the next question is why. Production RAG systems rarely fail loudly — they fail quietly, in a handful of recurring ways, and most struggling systems have several at once. Here are the five that do the most damage:

The five quiet failures · diagnostic reference click a row for the first thing to check
Failure mode What it looks like in your audit Why the demo hid it
Stale embeddings
01 of 05
Answers cite the old policy, the old price, the old org chart. The source was updated; the index wasn’t.
The demo ran once, on data that never changed.
first check ›
Naive chunking
02 of 05
Answers are confidently half-right — the relevant fact got split across two chunks, or buried with unrelated text.
Clean tutorial docs chunk cleanly; messy real ones don’t.
first check ›
Missing or unused metadata
03 of 05
The system can’t filter by date, source, department, or permission, so it retrieves plausible-but-wrong context.
Five demo files need no metadata; ten thousand real ones do.
first check ›
Retrieval-grounded hallucination
04 of 05
The right documents were retrieved and the answer is still wrong — the model ignored or misread the context.
Friendly questions rarely expose unfaithful generation.
first check ›
META No evaluation
05 of 05
You can’t actually tell which of the above is happening, or whether a change helped or hurt.
The demo’s success metric was “it answered at all.”
first check ›
Fig. 04 The five quiet failures — and the one that hides the other four. Click a row to see the first thing to check.

That last row is the one that matters most, because it’s the one that hides the other four. Evaluation runs offline and online scoring loops to track retrieval precision, recall, and answer faithfulness — without it, every change you make is a guess, and you have no way to know whether today’s “fix” quietly broke yesterday’s working answers. The audit you just ran is the manual, one-time version of this. Mature systems run it automatically, on every change.

§ 05Maturity and scale are different axes — don’t confuse them

Here is the trap that lets good teams call a v1 “production.” It feels like progress to go from works for me to works for the whole team — and it is. But it’s progress along the wrong axis for the problem this article is about.

There are two independent axes, and the demo makes them look like one:

Scale × Maturity · two independent axes
STAGE
Mature but small
low scale · high maturity
GOAL
Production
high scale · high maturity
STAGE
Demo
low scale · low maturity
DANGER
Broadly wrong
high scale · low maturity
most “we have RAG” systems are here
YOU ARE HERE?
You placed yourself in · Broadly wrong
The trap. Confidently wrong, just faster and for more people. Most “we have RAG” systems live here. · drag the marker (or use arrow keys) to reposition
Fig. 05 Scaling moves you right. Only evaluation moves you up. Drag the marker to your team’s honest position.

You can move a long way along one and not move at all along the other. The dangerous place to be — and where a lot of “we have RAG” systems actually live — is high scale, low maturity: serving the whole company answers that nobody has measured. You can solve every concurrency problem flawlessly and still be confidently wrong, just faster and for more people. Scaling a system whose answers you’ve never scored doesn’t make it mature. It makes it broadly wrong.

This article is about the maturity axis. Most teams climbed the scale axis, saw real movement, and mistook it for the same thing.

§ 06Why nobody says this out loud

If production RAG is this much harder than the demo, why does every roadmap review still get the cheerful “we have RAG” answer?

Because the demo worked. The launch slide promised. The early data was friendly, the stakeholders were impressed, and walking that back later is expensive — politically more than technically. “We have RAG” is a much easier sentence to say in a status meeting than “we have a v1 that works for one user on questions we picked ourselves.” The incentive is to keep the first sentence alive as long as possible, and the system’s quiet failures make that easy, because nobody scored them.

The audit is how you replace the comfortable sentence with a true one. A true sentence is less fun to say once and far less expensive to live with.

§ 07Where to go from here

This piece had one job: to convince you the gap is real and to hand you the one tool — the blind audit — that measures your version of it. That’s deliberately where it stops. The fixes are each their own article, and the good news is dull: the fix is almost never a bigger model. It’s engineering.

The boring, effective moves, as signposts rather than instructions:

And remember these three exhibits were a sample, not the catalogue. Security, access control, PII handling, cost ceilings, prompt-injection defense, and the day your embedding model is deprecated are all waiting behind them. The point was never the specific list. It’s that the list is long, and the tutorial showed you none of it.

So before the next roadmap review: pull your 50 queries and get your number. The restaurant’s photos will always look great. Whether anyone comes back depends entirely on the food.

References & further reading
  1. 1
    The production RAG architecture — on RAG as a ~dozen-subsystem architecture distinct from tutorial pipelines, the chain-of-weakest-link failure model, and per-subsystem failure modes including evaluation. (Compiled research, 2026.)
  2. 2
    Document parsing for RAG — on standard parsers breaking multi-column layouts and table structure, the resulting vector distortion, and parsing quality as the bound on downstream quality. Omdena; Elastic Search Labs; IBM Docling. (2026.)
  3. 3
    Asynchronous serving and task-queue patterns for LLM/RAG — on decoupled submission/execution, HTTP 202 + task ID + polling/webhook, and duration-to-pattern thresholds. (Continuous-batching analysis, 2026.)
  4. 4
    Agentic multi-hop retrieval and latency — standard RAG at 1–2s vs. agentic at 15–20s and occasionally minutes, and the causes of latency accumulation. (Compiled research, 2026.)
  5. 5
    Compute contention and continuous batching — GPU-bound inference, head-of-line blocking and queue stalls under memory exhaustion, and backpressure. ORCA (OSDI 2022); vLLM. (2022–2026.)
  6. 6
    SSE vs. WebSockets for token streaming — transport choices and proxy-buffering gotchas for streaming LLM output to a browser. (Engineering write-ups, 2026.)
  7. 7
    On-premise AI in regulated, non-US/EU markets — data-residency requirements driving self-hosted deployment. Checkmarble, “On-prem in Africa.” (2026.)
Share a passage

Found a line that lands? Hand it to your network.

Pick a card, choose paper or ink, and share where you read.

See all cards