DField SolutionsLoading · Töltődik
Skip to content

We have audited 18 RAG systems in the last 18 months. The pattern is invariant. Every failing RAG is failing one of three ways. The team usually does not know which one. They assume 'the model is bad' and start swapping models. That is rarely the issue.

The three modes are · (1) retrieval miss · the right chunk was never returned · (2) retrieval-yes-answer-no · the right chunk was returned but the model ignored, misread or hallucinated around it · (3) stale truth · the chunk was returned, the answer used it, but the chunk itself is out of date. Each has a different fix, different cost.

Mode 1 · Retrieval miss

The right chunk never made it into the top-k. The vector search ranked it 47th, you took top 5. Symptoms: the model produces a plausible-sounding wrong answer, or 'I don't know'. The chunk exists in the corpus.

Causes, in order of frequency we see them:

  • Embedding model mismatch · query embedded with one model, corpus with another (yes, we have seen this in production).
  • Chunking strategy too coarse · 2000 token chunks bury the answer in noise, similarity falls.
  • No hybrid search · pure dense retrieval misses lexical matches like product SKUs, error codes, version numbers.
  • Query is too short · 'tax' returns nothing, 'how does VAT apply to imports' returns the right chunk.

Fastest fix: turn on hybrid search (BM25 + dense) and re-chunk to 400-800 tokens with 50 token overlap. This alone resolves 60% of mode 1 cases.

Mode 2 · Retrieval hit, answer wrong

The right chunk is in the top 5, but the model still produces the wrong answer. Symptoms: hallucinated citations, mixing facts from multiple chunks, contradicting the chunk it cites. This is the failure mode that looks like 'the LLM is hallucinating', because it is, but the cause is upstream of the model.

  • Prompt does not instruct grounding · 'answer the question' instead of 'answer using only the provided context, cite chunk IDs'.
  • Top k is too high · 20 chunks of mixed quality dilute the signal. Models pick the wrong fact.
  • No chunk separator in prompt · the model cannot tell where one chunk ends and the next begins.
  • Conflicting chunks (same fact, different version) confuse the model. It picks one, often wrong.

Fastest fix: top k 5, explicit chunk separators, prompt template that says 'answer only from these chunks, cite by ID, say I don't know if not sure'. Most teams already have this · but their `top_k = 20` and no chunk IDs in the prompt undo it.

Mode 3 · Stale truth

The chunk is returned, the model answers from it, the answer is wrong because the chunk is six months old. The product changed. The price changed. The policy changed. The corpus did not.

  • No re-ingestion pipeline · the corpus was a one time load, never refreshed.
  • Re-ingestion exists but does not catch updates in the source (e.g. only watches one folder, the policy team writes in another).
  • Older chunks are not removed · two versions of the same fact exist, retrieval picks the older one.
  • Date-aware retrieval is missing · query about 'current pricing' should bias to recent chunks.

Fastest fix: schedule a daily re-ingestion. Add a `last_modified` field to chunks, expose it as a retrieval bias for time sensitive queries. Periodically dedupe by source path.

The diagnostic table

When we sit down with a client, this is the literal table on the whiteboard. We pick 10 known-bad cases, walk down the table for each.

| symptom | retrieval rank of right chunk | answer cites chunk? | source up to date? | mode | fastest fix |
|---|---|---|---|---|---|
| plausible wrong answer | not in top 20 | no | yes | 1 | hybrid search + re-chunk |
| 'I don't know' | not in top 20 | no | yes | 1 | hybrid search + re-chunk |
| confident wrong | top 5 | yes | yes | 2 | grounding prompt + lower top_k |
| confident wrong | top 5 | hallucinated cite | yes | 2 | chunk IDs + 'cite by ID' rule |
| outdated price/policy | top 5 | yes | NO | 3 | re-ingest + date bias |
| 'two answers' | top 5 (2 versions) | yes | partial | 3 | dedupe by source path |
| right chunk, but partial | top 5 | yes | yes | 2 | larger answer budget, fewer chunks |
| user phrase mismatch | not in top 20 | no | yes | 1 | query rewriting / HyDE |

How to run the diagnosis in 30 minutes

  1. Pick 10 cases the client says are 'broken'. Real user queries with the right answer known.
  2. For each, log the top 20 chunks the retriever returns. Note the rank of the correct chunk.
  3. For each, look at the full prompt sent to the LLM. Check chunk separators, top k, instructions.
  4. For each, ask the source-of-truth owner · 'is this chunk still correct as of today'.
  5. Walk the table. Each case lands in mode 1, 2 or 3. Sometimes a case is mode 1 + 3.
  6. Sum the modes. The mode count tells you the fix order.

Why this matters

Most failed RAG projects we are called in on have the same root cause: the team did not separate 'is this a retrieval problem or a generation problem'. They throw a bigger model at it (mode 2 ish fix) when the chunk is stale (mode 3) or never retrieved (mode 1). Six weeks of work, the bill grows, the answer is still wrong.

If you want to skip an audit cycle, run the table on your own RAG today. 10 cases, 30 minutes, then you know which mode you are fighting.

ShareXLinkedIn#
Dezso Mezo
By

Dezso Mezo

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.

Keep reading
RELATED PROJECTS
Let's talk

Would rather build together?

Let's talk about your project. 30 minutes, no strings.