June 5, 2026·9 min read

Why RAG Systems Fail in Production (And How to Fix Them)

I have built several RAG systems. The first looked great in demos. It fell apart within a week of real usage. The failure modes are predictable - and fixable. Here is what goes wrong.

RAGLLM SystemsProduction ML

Sehastrajit Selvachandran

AI/ML Engineer · M.S. CS @ ASU

I have built several RAG systems. The first one looked great in the demo. It fell apart in production within a week of real usage. This is the most common trajectory for RAG systems, and the failure modes are predictable.

Here is what goes wrong and how to fix it.

Failure Mode 1: Semantic Similarity Is Not Relevance

The most fundamental mistake in RAG is treating vector similarity as a proxy for relevance. Vector search retrieves documents that are semantically close to the query in embedding space. This is not the same as documents that answer the question.

A question like "what is the retention rate for enterprise customers?" will retrieve documents that mention customers, enterprise, retention, and rates. The top result might be a document about onboarding that happens to use the word "retention." It is semantically similar. It does not answer the question.

This failure mode is subtle because it is hard to see in testing. When you build a RAG system and test it with the queries you designed it for, the retrieved documents look reasonable. In production, users ask queries that no one anticipated, and the retrieval breaks in ways that are hard to predict from the test set.

Fix: Hybrid retrieval. Combine dense vector search with sparse keyword search (BM25). Keyword search retrieves documents that contain the exact terms in the query - which is precisely what vector search misses for specific terminology, proper nouns, and exact phrase matching. Fusing the rankings from both methods outperforms either approach consistently in production. The implementation cost is low; the accuracy gain is significant.

Failure Mode 2: Chunks That Break Context

Naive chunking splits documents at fixed character or token limits. This produces chunks that cut mid-sentence, separate a question from its answer, and strip the surrounding context that makes a paragraph interpretable.

The model receives a chunk that says "this increased by 23% compared to the baseline." Compared to which baseline? The chunk before it said so - but that chunk was not retrieved.

This failure mode compounds as the knowledge base grows. More documents means more chunks, and more chunks means the retrieval is making more decisions, each of which might produce a decontextualized fragment.

Fix: Semantic chunking with overlap. Instead of splitting at fixed lengths, split at natural boundaries - paragraph breaks, section headers, sentence endings. Add overlap so each chunk includes the last N tokens of the previous chunk and the first N tokens of the next. This preserves context across chunk boundaries without doubling storage requirements. For technical documentation, also consider parent-child chunking: store small chunks for retrieval precision but retrieve the parent paragraph or section when a small chunk matches.

Failure Mode 3: Context Window Assembly

Even when retrieval works well, the final assembled context can be incoherent. The top-3 retrieved chunks might be from three different sections of a document, or three different documents that were written with different assumptions about what the reader already knows.

The model tries to attend to all three chunks equally and produces an answer that hedges because the chunks partially contradict each other, or that misses the connection between them because they were written to be read sequentially.

Fix: Reranking, then assembly order. Add a cross-encoder reranking step after initial retrieval. Cross-encoders are slower than embedding-based retrieval but dramatically more accurate at judging whether a specific chunk answers a specific question - because they process the query and the chunk together, not as separate embeddings. After reranking, assemble chunks in logical order (by document section if possible, or by relevance score if from different documents) rather than by retrieval rank.

Maximal Marginal Relevance (MMR) is useful here too: instead of taking the top-k by score, pick chunks that are both relevant and diverse. This prevents five chunks from the same paragraph drowning out other relevant documents.

Failure Mode 4: No Evaluation Pipeline

The most consequential failure mode is not a technical one - it is building a RAG system without a way to measure whether it is working.

Retrieval recall (what fraction of relevant documents are retrieved?) is a useful intermediate metric. It is not sufficient. The end metric is: given a realistic distribution of user queries, what fraction of answers are correct? This is hard to measure because "correct" requires ground truth, and ground truth requires human judgment.

The practical result is that teams ship RAG systems, eyeball a few outputs, declare it works, and discover it does not in production. Evaluating RAG with retrieval recall alone is like evaluating an ML model on training accuracy - the number can look good while the system fails on real inputs.

Fix: Build evaluation from the start. Before building the RAG pipeline, curate a set of 50-100 questions with known correct answers drawn from your knowledge base. This evaluation set is your source of truth. Every architectural change - chunking strategy, retrieval method, reranker - is evaluated against this fixed set before deployment. RAGAS is a useful framework for automated evaluation that can scale human judgment on answer faithfulness and context precision.

What Good RAG Looks Like

A production RAG system has:

Hybrid retrieval (dense + sparse, fused rankings)
Semantic chunking with overlap, or parent-child chunking
Cross-encoder reranking with MMR for diversity
Context assembly that respects document structure and reading order
An evaluation pipeline with human-verified ground truth
Monitoring that tracks answer quality over time, not just retrieval latency

The evaluation pipeline is the most important part. Without it, you are flying blind - and the failure modes above will surface in production rather than in testing, where they are cheap to fix.

The good news: RAG systems that fail in the ways described above are fixable without architectural rewrites. Each failure mode has a targeted fix. The fixes compound - hybrid retrieval plus reranking plus semantic chunking produces substantially better results than any single improvement alone.

Start with the evaluation pipeline. Everything else follows from having a way to measure whether your changes helped.

Sehastrajit Selvachandran

AI/ML Engineer. M.S. CS at Arizona State University (GPA 3.94). Creator of LUNA. 4 peer-reviewed publications. Building LLM systems, CV pipelines, and production ML infrastructure.

Portfolio →GitHub →Subscribe →

← Older

72% to 89%: What Systematic Feature Engineering Actually Looks Like