Production RAG That Says "I Don't Know" Instead of Hallucinating

2026-06-13

12 min read

RAG

AI Engineering

LLM

pgvector

Reliability

PostgreSQL

Backend

Evaluation

Written by Shailesh Chaudhari

Full-stack engineer with a backend focus

TL;DR: A RAG demo proves it can answer. Production is about what it does when it can't — and whether you'd notice if it got worse. Grounded does four things demos skip: a retrieval-score guardrail that refuses instead of guessing, content-hash idempotent ingestion (stop paying to re-embed), retry-with-backoff that fails fast on 4xx, and a CI eval harness. It runs offline with no API key. Repo at the end.

Why most RAG breaks in production

Hello everyone! I'm Shailesh Chaudhari, a backend engineer. Retrieval-augmented generation is easy to demo and hard to ship. The demo answers your three test questions beautifully. Then real users arrive and it confidently makes things up, your embedding bill balloons because every deploy re-embeds the whole corpus, and a prompt tweak silently makes retrieval worse with no alarm. I built Grounded — a small, readable RAG starter — to get the boring, reliable parts right. Here are the four that matter most.

1. The "I don't know" guardrail

The single most important reliability feature is knowing when not to answer. The naive pipeline retrieves some chunks and stuffs them into the prompt no matter how irrelevant they are — so an off-topic question gets a confident, wrong answer built from unrelated context.

Grounded gates on the retrieval score. In the ask pipeline: embed the question, query the store for the top-k chunks, and look at the best similarity score:

const top = retrieved[0];
if (!top || top.score < minScore) {
  return { answer: REFUSAL, citations: [], grounded: false };
}
// only now do we call the LLM

If nothing clears the threshold, it returns a fixed refusal and never even calls the model. The system prompt also says "answer only from context, say you don't know otherwise" — but I don't rely on the prompt alone. The retrieval-score gate is the hard guarantee; the prompt is defense-in-depth. A model can be talked out of a prompt instruction; it can't answer from context it was never given.

2. Idempotent ingestion (stop re-embedding everything)

Embeddings cost money and time per call. Naive ingestion re-embeds the entire corpus on every run, so every deploy burns API spend re-creating vectors that didn't change. Grounded hashes each chunk's content and only embeds what's new:

const chunks = chunkDoc(doc);
const existing = await store.existingHashes(doc.id);
const fresh = chunks.filter((c) => !existing.has(c.contentHash));
// embed only `fresh`; prune chunks that no longer exist

Re-ingesting unchanged docs becomes a near-no-op (embedded: 0). Chunks that were deleted from a document get pruned. This is the same idempotency discipline you'd apply to payments or a reservation system — applied to embedding spend.

3. Resilience: retry, but fail fast on 4xx

LLM and embedding APIs are flaky. Grounded wraps calls in exponential backoff with jitter, but the important decision is what's retryable:

function isRetryable(err) {
  if (err.code === "ECONNRESET" || err.code === "ETIMEDOUT") return true;
  const s = err.status;
  return typeof s === "number" ? (s === 429 || s >= 500) : true;
}

Network blips and 429/5xx get retried. A 4xx (bad request, auth failure) fails immediately — retrying a 400 just hammers the API and delays the inevitable. Jitter avoids a thundering herd of synchronized retries. This is unglamorous and it's exactly what keeps a feature up under real load.

4. An eval harness so you catch regressions

"Did my prompt change make things better or worse?" is unanswerable without measurement, and most teams ship blind. Grounded ships a labelled Q&A set scored by the pipeline — checking retrieval, answer quality, and crucially the refusal cases (off-topic questions must return grounded: false). It runs in CI, so a change that quietly breaks retrieval fails the build instead of reaching users. This is the regression test suite for your LLM feature.

Cited answers, and an honest design choice

Every answer carries citations built from the chunks that were actually retrieved (source, chunk id, score, snippet) — not whatever the model typed — so there's a real audit trail. And the whole thing is offline-testable: the embedder, vector store, and LLM are interfaces chosen by env. The default is a deterministic hash embedder + an in-memory store, so the tests and the eval run with no API key and no database. Production swaps in OpenAI embeddings + pgvector on Postgres.

A bug worth sharing: the offline guardrail once had a false positive — with a small 256-dim hash embedder, hash collisions let an off-topic question score above the threshold, so it answered when it should have refused. The fix was to make the offline embedder more discriminative (4096 dims + stopword filtering). The lesson: a guardrail is only as good as the signal it gates on. I also hit the classic pgvector gotcha — the native binary is tied to a Postgres major version (built for pg17, wouldn't load on pg16) — which is exactly why retrieval sits behind a VectorStore interface: dev and CI use the in-memory store, production uses pgvector, and an infra mismatch never blocks development.

Try it

It runs in 30 seconds with no setup:

npm install && npm start   # offline: in-memory store + extractive answers

Source: github.com/Shailesh93602/grounded

If you're adding AI to a product, the question that actually matters isn't "can it answer" — it's "what does it do when it shouldn't, and will you know if it regresses." Build those in from the start. Thanks for reading!