26 April 2026·9 min read

RAG AI LLM Audit Vector search

RAG's three failure modes · and the diagnostic table we use on every audit

Three failure modes, one table. 30 minutes of diagnosis, then you know what to fix. Stop guessing.

Last verified26 April 2026

By Dezso MezoFounder, DField Solutions

ShareX LinkedIn#

RAG's three failure modes · and the diagnostic table we use on every audit

We have audited 18 RAG systems in the last 18 months. The pattern is invariant. Every failing RAG is failing one of three ways. The team usually does not know which one. They assume 'the model is bad' and start swapping models. That is rarely the issue.

The three modes are · (1) retrieval miss · the right chunk was never returned · (2) retrieval-yes-answer-no · the right chunk was returned but the model ignored, misread or hallucinated around it · (3) stale truth · the chunk was returned, the answer used it, but the chunk itself is out of date. Each has a different fix, different cost.

Mode 1 · Retrieval miss

The right chunk never made it into the top-k. The vector search ranked it 47th, you took top 5. Symptoms: the model produces a plausible-sounding wrong answer, or 'I don't know'. The chunk exists in the corpus.

Causes, in order of frequency we see them:

Embedding model mismatch · query embedded with one model, corpus with another (yes, we have seen this in production).
Chunking strategy too coarse · 2000 token chunks bury the answer in noise, similarity falls.
No hybrid search · pure dense retrieval misses lexical matches like product SKUs, error codes, version numbers.
Query is too short · 'tax' returns nothing, 'how does VAT apply to imports' returns the right chunk.

Fastest fix: turn on hybrid search (BM25 + dense) and re-chunk to 400-800 tokens with 50 token overlap. This alone resolves 60% of mode 1 cases.

Mode 2 · Retrieval hit, answer wrong

The right chunk is in the top 5, but the model still produces the wrong answer. Symptoms: hallucinated citations, mixing facts from multiple chunks, contradicting the chunk it cites. This is the failure mode that looks like 'the LLM is hallucinating', because it is, but the cause is upstream of the model.

Prompt does not instruct grounding · 'answer the question' instead of 'answer using only the provided context, cite chunk IDs'.
Top k is too high · 20 chunks of mixed quality dilute the signal. Models pick the wrong fact.
No chunk separator in prompt · the model cannot tell where one chunk ends and the next begins.
Conflicting chunks (same fact, different version) confuse the model. It picks one, often wrong.

Fastest fix: top k 5, explicit chunk separators, prompt template that says 'answer only from these chunks, cite by ID, say I don't know if not sure'. Most teams already have this · but their `top_k = 20` and no chunk IDs in the prompt undo it.

Mode 3 · Stale truth

The chunk is returned, the model answers from it, the answer is wrong because the chunk is six months old. The product changed. The price changed. The policy changed. The corpus did not.

No re-ingestion pipeline · the corpus was a one time load, never refreshed.
Re-ingestion exists but does not catch updates in the source (e.g. only watches one folder, the policy team writes in another).
Older chunks are not removed · two versions of the same fact exist, retrieval picks the older one.
Date-aware retrieval is missing · query about 'current pricing' should bias to recent chunks.

Fastest fix: schedule a daily re-ingestion. Add a `last_modified` field to chunks, expose it as a retrieval bias for time sensitive queries. Periodically dedupe by source path.

The diagnostic table

When we sit down with a client, this is the literal table on the whiteboard. We pick 10 known-bad cases, walk down the table for each.

| symptom | retrieval rank of right chunk | answer cites chunk? | source up to date? | mode | fastest fix |
|---|---|---|---|---|---|
| plausible wrong answer | not in top 20 | no | yes | 1 | hybrid search + re-chunk |
| 'I don't know' | not in top 20 | no | yes | 1 | hybrid search + re-chunk |
| confident wrong | top 5 | yes | yes | 2 | grounding prompt + lower top_k |
| confident wrong | top 5 | hallucinated cite | yes | 2 | chunk IDs + 'cite by ID' rule |
| outdated price/policy | top 5 | yes | NO | 3 | re-ingest + date bias |
| 'two answers' | top 5 (2 versions) | yes | partial | 3 | dedupe by source path |
| right chunk, but partial | top 5 | yes | yes | 2 | larger answer budget, fewer chunks |
| user phrase mismatch | not in top 20 | no | yes | 1 | query rewriting / HyDE |

How to run the diagnosis in 30 minutes

Pick 10 cases the client says are 'broken'. Real user queries with the right answer known.
For each, log the top 20 chunks the retriever returns. Note the rank of the correct chunk.
For each, look at the full prompt sent to the LLM. Check chunk separators, top k, instructions.
For each, ask the source-of-truth owner · 'is this chunk still correct as of today'.
Walk the table. Each case lands in mode 1, 2 or 3. Sometimes a case is mode 1 + 3.
Sum the modes. The mode count tells you the fix order.

Why this matters

Most failed RAG projects we are called in on have the same root cause: the team did not separate 'is this a retrieval problem or a generation problem'. They throw a bigger model at it (mode 2 ish fix) when the chunk is stale (mode 3) or never retrieved (mode 1). Six weeks of work, the bill grows, the answer is still wrong.

If you want to skip an audit cycle, run the table on your own RAG today. 10 cases, 30 minutes, then you know which mode you are fighting.

ShareX LinkedIn#

Dezso Mezo

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.

ABOUT Let's talk

Keep reading

30 Sept 2026·11 min read

DField Q3 2026 roundup · what shifted, what we shipped, what is broken

Three months in. SZEP 2.0 live, NAV v3 cutover, AI Act enforcement, OWASP LLM Top 10 v2. Hard numbers, one strong opinion on the consulting tier.

Read

01 Jul 2026·11 min read

DField Q2 2026 roundup · what shifted, what we shipped, what is broken

Four months in. Eleven shipped projects, real before/after numbers, one strong opinion on what the consulting tier got wrong this quarter.

Read

26 Apr 2026·11 min read

We built our own LLM eval harness in 200 lines of TS · here is the file

Frameworks are great until they get in the way. Here is a 200-line TS eval harness that runs in CI, blocks regressions and prints a diff.

Read

RELATED PROJECTS

AI solutions · Website & online shop · Mobile app (iPhone + Android) · 2026AIHealthIQAI reads your wearable data, spots changes, recommends treatment and lifestyle tweaks.

AI solutions · Website & online shop · 20263D AI PropertyAI-generated 3D properties and interiors, walkable in FPV · with a manual editor and drone view.

AI solutions · Cybersecurity · Website & online shop · 2026Use AI EasilyAn AI firm's website · home of Hungary's first dedicated AI-security practice.

Let's talk

Would rather build together?

Let's talk about your project. 30 minutes, no strings.

Let's talk