The full pipeline, before we optimise anything
Before talking about what goes wrong, it's worth being precise about what a RAG pipeline actually is. There are two distinct phases: ingestion (processing your documents) and retrieval (answering queries). Most people focus on retrieval. Most failures happen in ingestion.
Every step in that pipeline has failure modes that silently degrade your system's quality. The LLM at the end is almost never the problem. If your answers are wrong, the cause is almost always upstream — bad chunking, weak embeddings, or poor retrieval that fetches loosely related rather than directly relevant context.
Chunking: where most RAG systems fail first
Chunking is the process of splitting your source documents into smaller pieces before embedding them. This is the decision that most affects retrieval quality, and it's almost always treated as an afterthought — "just split every 500 tokens with a 50-token overlap" — a default setting that performs adequately on clean, well-structured text and catastrophically on the messy documents real organisations actually have.
The naive approach and why it breaks
Fixed-size token splitting cuts your text every N tokens regardless of semantic boundaries. A sentence gets cut in half. A step in a process is separated from the step before it that provided context. A table header ends up in one chunk while the table data is in another. When these fragmented chunks get embedded and retrieved, the retrieved context doesn't contain enough information to answer the question — and the LLM either hallucinates what's missing or hedges into uselessness.
Semantic fragmentation
Document: "The refund process takes 3–5 business days. Refunds are processed to the original payment method. International orders may take longer due to currency conversion." Fixed-size splitting might put the first sentence in chunk 47 and the second in chunk 48. A query about "how long do refunds take" retrieves chunk 47 — which mentions 3–5 days but not the international exception. User gets incomplete information. LLM can't fill in what it wasn't given.
Semantic chunking with meaningful overlap
Use a recursive character text splitter that respects paragraph boundaries, then headings, then sentences — only falling back to character-level splitting when necessary. Set overlap to 15–20% of chunk size so context from the previous chunk bleeds into the next. For structured documents (policies, manuals, FAQs), chunk by section, not by token count.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # characters, not tokens
chunk_overlap=120, # ~15% overlap
separators=[
"\n\n", # paragraph breaks first
"\n", # then line breaks
". ", # then sentence endings
", ", # then clause boundaries
" ", # then words
"" # last resort: characters
],
length_function=len,
)
chunks = splitter.split_text(document_text)
# Add metadata to each chunk for better retrieval
enriched_chunks = []
for i, chunk in enumerate(chunks):
enriched_chunks.append({
"content": chunk,
"chunk_index": i,
"source": document_path,
"section": extract_section_heading(chunk),
"word_count": len(chunk.split())
})
Choosing the right embedding model
The embedding model converts text into a vector of numbers that represents its meaning. Two pieces of text with similar meaning end up with vectors that are close together in the vector space. The quality of your embeddings directly determines the quality of your retrieval.
The most common mistake is defaulting to OpenAI's text-embedding-ada-002 out of habit.
It's not bad. But it was the default in 2022. In 2025,
OpenAI's text-embedding-3-large and Cohere's embed-v3 substantially outperform it
on most benchmark tasks, and for many use cases the open-source models from
Hugging Face (particularly the E5 and BGE families) are close enough in quality that
the cost savings of self-hosting are worth it.
OpenAI text-embedding-3-large (3072 dims) or Cohere embed-v3. Strong quality, hosted, reliable. Pay per token. Best choice when you don't want to manage infrastructure.
BAAI/bge-large-en-v1.5 or intfloat/e5-large-v2 via HuggingFace. Self-hosted on a small GPU instance. Fixed cost, no data leaving your infrastructure, surprisingly competitive quality.
The critical thing: use the same embedding model for both ingestion and retrieval. If you embed your documents with model A and query with model B, the vector spaces are different and retrieval will be nonsense. This sounds obvious, and it causes production bugs roughly twice a year in every RAG codebase I've reviewed.
Retrieval: beyond pure vector search
Semantic vector search is good. But it has a specific weakness: keyword matching. If a user asks about "clause 7.3" or uses a product SKU or a proper noun that doesn't appear in your training data, pure vector search may retrieve semantically related content that completely misses the specific item being referenced.
The solution is hybrid retrieval: combining vector similarity search with BM25 keyword search, then fusing the results. This is now supported natively by Pinecone (sparse-dense retrieval), Weaviate, and most other modern vector databases.
from pinecone import Pinecone
from cohere import Client as CohereClient
def hybrid_retrieve(
query: str,
top_k: int = 20,
rerank_top_n: int = 5
) -> list[dict]:
# 1. Embed the query
query_embedding = embedding_model.embed(query)
# 2. Hybrid search: dense + sparse
results = index.query(
vector=query_embedding,
sparse_vector=bm25_encode(query),
top_k=top_k,
include_metadata=True,
alpha=0.7 # 0.7 dense, 0.3 sparse weight
)
# 3. Rerank for precision
# Raw vector similarity ≠ "most relevant for answering this question"
reranked = cohere.rerank(
query=query,
documents=[r.metadata["content"] for r in results.matches],
model="rerank-english-v3.0",
top_n=rerank_top_n
)
return [
results.matches[r.index].metadata
for r in reranked.results
]
The reranking step is the one most people skip — and it's what separates RAG systems that produce useful answers from ones that produce plausible-sounding answers. A reranker (Cohere Rerank, cross-encoder models from HuggingFace) takes your top-20 retrieved candidates and scores each one specifically against your query. The top-20 by vector similarity are not the same as the top-5 most relevant for answering the question. The reranker finds the difference.
The generation step: context assembly matters
You have your top-5 reranked chunks. Now you need to pass them to the LLM. The order you pass them in matters. The instruction you give the model matters. Whether you pass them as a single block of text or with clear separators matters.
Put the most relevant chunk first. LLMs exhibit a known "lost in the middle" effect — they attend most strongly to content at the beginning and end of the context window, and underweight content in the middle. Your best chunk should be at the top, not buried after four others.
Give the model an explicit instruction about what to do when the retrieved context doesn't contain the answer. "If the provided context does not contain enough information to answer the question, say so explicitly rather than guessing" dramatically reduces hallucination. Models without this instruction will often fill in plausible-sounding gaps from their training data, defeating the entire purpose of RAG.
Evaluating your RAG system
You cannot improve what you don't measure. RAG evaluation is genuinely hard — the answers are open-ended, there's often no single correct answer, and ground truth is expensive to create. But there are three metrics that give you most of the signal you need.
Retrieval recall: for your test set of questions, what percentage of the time does the correct source document appear in your top-K retrieved chunks? If you're retrieving the wrong documents, the LLM can't possibly give correct answers.
Answer faithfulness: does the generated answer make claims that are supported by the retrieved context? Use an LLM as a judge to score this automatically — ask it to check each claim in the answer against the source chunks. Frameworks like RAGAS automate this evaluation pattern.
Answer relevance: does the answer actually address what the user asked? An answer can be perfectly faithful to the retrieved context and still be irrelevant if the retrieval was off-topic.
Build a test set of 50–100 question/answer pairs from your domain. Run your RAG system against this set after every significant change. Track your metrics over time. This is the only way to know if you're making things better or worse.
Build the ingestion pipeline first
When I build RAG systems for clients, I spend the first 40% of the project on ingestion: document loading, cleaning, chunking strategy, metadata extraction, and embedding pipeline. Most clients want to jump straight to the "AI answers questions" part. I've learned to resist that. A solid ingestion pipeline produces dramatically better answers than a clever prompt on top of garbage retrieval.
If you want to go deeper on building AI tools in general — the infrastructure, the error handling, the cost management — my post on building production LLM tools in Python covers the scaffolding that supports a RAG system.
And if you need a RAG system built for your organisation — whether that's a knowledge base Q&A tool, a document search system, or a customer support assistant that answers from your own content — that's exactly what my AI automation service covers. I build the ingestion pipeline, the retrieval system, the evaluation framework, and the application layer — not just the "paste your documents into a chatbot" version.