AI tools/blog/ai-tools/rag-system-from-scratch

Building a RAG System From Scratch: What Nobody Tells You

Retrieval-Augmented Generation sounds, in theory, like a solved problem. Embed your documents, store the embeddings, retrieve the relevant ones when a user asks a question, pass them to an LLM, get a grounded answer back. Four steps. Dozens of tutorials. And yet most RAG implementations I've reviewed either hallucinate constantly, retrieve the wrong documents, or produce answers so hedged they're useless. The gap between "RAG that works in a demo" and "RAG that works for real users on real documents" is almost entirely about the details nobody writes about. This post covers those details.

The full pipeline, before we optimise anything

Before talking about what goes wrong, it's worth being precise about what a RAG pipeline actually is. There are two distinct phases: ingestion (processing your documents) and retrieval (answering queries). Most people focus on retrieval. Most failures happen in ingestion.

📄
Phase 1
Load documents
✂️
Phase 2
Chunk text
🔢
Phase 3
Embed chunks
🗄️
Phase 4
Store vectors
🔍
Query time
Retrieve + generate

Every step in that pipeline has failure modes that silently degrade your system's quality. The LLM at the end is almost never the problem. If your answers are wrong, the cause is almost always upstream — bad chunking, weak embeddings, or poor retrieval that fetches loosely related rather than directly relevant context.

Chunking: where most RAG systems fail first

Chunking is the process of splitting your source documents into smaller pieces before embedding them. This is the decision that most affects retrieval quality, and it's almost always treated as an afterthought — "just split every 500 tokens with a 50-token overlap" — a default setting that performs adequately on clean, well-structured text and catastrophically on the messy documents real organisations actually have.

The naive approach and why it breaks

Fixed-size token splitting cuts your text every N tokens regardless of semantic boundaries. A sentence gets cut in half. A step in a process is separated from the step before it that provided context. A table header ends up in one chunk while the table data is in another. When these fragmented chunks get embedded and retrieved, the retrieved context doesn't contain enough information to answer the question — and the LLM either hallucinates what's missing or hedges into uselessness.

Failure mode

Semantic fragmentation

Document: "The refund process takes 3–5 business days. Refunds are processed to the original payment method. International orders may take longer due to currency conversion." Fixed-size splitting might put the first sentence in chunk 47 and the second in chunk 48. A query about "how long do refunds take" retrieves chunk 47 — which mentions 3–5 days but not the international exception. User gets incomplete information. LLM can't fill in what it wasn't given.

Fix

Semantic chunking with meaningful overlap

Use a recursive character text splitter that respects paragraph boundaries, then headings, then sentences — only falling back to character-level splitting when necessary. Set overlap to 15–20% of chunk size so context from the previous chunk bleeds into the next. For structured documents (policies, manuals, FAQs), chunk by section, not by token count.

python — semantic chunking with LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,          # characters, not tokens
    chunk_overlap=120,        # ~15% overlap
    separators=[
        "\n\n",              # paragraph breaks first
        "\n",                # then line breaks
        ". ",                # then sentence endings
        ", ",                # then clause boundaries
        " ",                 # then words
        ""                   # last resort: characters
    ],
    length_function=len,
)

chunks = splitter.split_text(document_text)

# Add metadata to each chunk for better retrieval
enriched_chunks = []
for i, chunk in enumerate(chunks):
    enriched_chunks.append({
        "content": chunk,
        "chunk_index": i,
        "source": document_path,
        "section": extract_section_heading(chunk),
        "word_count": len(chunk.split())
    })
Add metadata to every chunk at ingestion time — the source document, the section heading, the chunk index, the page number if relevant. This metadata becomes searchable and filterable at retrieval time, and it's the foundation of any explainability in your answers ("based on section 4.2 of your returns policy…").

Choosing the right embedding model

The embedding model converts text into a vector of numbers that represents its meaning. Two pieces of text with similar meaning end up with vectors that are close together in the vector space. The quality of your embeddings directly determines the quality of your retrieval.

The most common mistake is defaulting to OpenAI's text-embedding-ada-002 out of habit. It's not bad. But it was the default in 2022. In 2025, OpenAI's text-embedding-3-large and Cohere's embed-v3 substantially outperform it on most benchmark tasks, and for many use cases the open-source models from Hugging Face (particularly the E5 and BGE families) are close enough in quality that the cost savings of self-hosting are worth it.

For most production use cases

OpenAI text-embedding-3-large (3072 dims) or Cohere embed-v3. Strong quality, hosted, reliable. Pay per token. Best choice when you don't want to manage infrastructure.

For cost-sensitive or private data

BAAI/bge-large-en-v1.5 or intfloat/e5-large-v2 via HuggingFace. Self-hosted on a small GPU instance. Fixed cost, no data leaving your infrastructure, surprisingly competitive quality.

The critical thing: use the same embedding model for both ingestion and retrieval. If you embed your documents with model A and query with model B, the vector spaces are different and retrieval will be nonsense. This sounds obvious, and it causes production bugs roughly twice a year in every RAG codebase I've reviewed.

Retrieval: beyond pure vector search

Semantic vector search is good. But it has a specific weakness: keyword matching. If a user asks about "clause 7.3" or uses a product SKU or a proper noun that doesn't appear in your training data, pure vector search may retrieve semantically related content that completely misses the specific item being referenced.

The solution is hybrid retrieval: combining vector similarity search with BM25 keyword search, then fusing the results. This is now supported natively by Pinecone (sparse-dense retrieval), Weaviate, and most other modern vector databases.

python — hybrid retrieval with reranking
from pinecone import Pinecone
from cohere import Client as CohereClient

def hybrid_retrieve(
    query: str,
    top_k: int = 20,
    rerank_top_n: int = 5
) -> list[dict]:
    # 1. Embed the query
    query_embedding = embedding_model.embed(query)

    # 2. Hybrid search: dense + sparse
    results = index.query(
        vector=query_embedding,
        sparse_vector=bm25_encode(query),
        top_k=top_k,
        include_metadata=True,
        alpha=0.7  # 0.7 dense, 0.3 sparse weight
    )

    # 3. Rerank for precision
    # Raw vector similarity ≠ "most relevant for answering this question"
    reranked = cohere.rerank(
        query=query,
        documents=[r.metadata["content"] for r in results.matches],
        model="rerank-english-v3.0",
        top_n=rerank_top_n
    )

    return [
        results.matches[r.index].metadata
        for r in reranked.results
    ]

The reranking step is the one most people skip — and it's what separates RAG systems that produce useful answers from ones that produce plausible-sounding answers. A reranker (Cohere Rerank, cross-encoder models from HuggingFace) takes your top-20 retrieved candidates and scores each one specifically against your query. The top-20 by vector similarity are not the same as the top-5 most relevant for answering the question. The reranker finds the difference.

The generation step: context assembly matters

You have your top-5 reranked chunks. Now you need to pass them to the LLM. The order you pass them in matters. The instruction you give the model matters. Whether you pass them as a single block of text or with clear separators matters.

Put the most relevant chunk first. LLMs exhibit a known "lost in the middle" effect — they attend most strongly to content at the beginning and end of the context window, and underweight content in the middle. Your best chunk should be at the top, not buried after four others.

Give the model an explicit instruction about what to do when the retrieved context doesn't contain the answer. "If the provided context does not contain enough information to answer the question, say so explicitly rather than guessing" dramatically reduces hallucination. Models without this instruction will often fill in plausible-sounding gaps from their training data, defeating the entire purpose of RAG.

Evaluating your RAG system

You cannot improve what you don't measure. RAG evaluation is genuinely hard — the answers are open-ended, there's often no single correct answer, and ground truth is expensive to create. But there are three metrics that give you most of the signal you need.

Retrieval recall: for your test set of questions, what percentage of the time does the correct source document appear in your top-K retrieved chunks? If you're retrieving the wrong documents, the LLM can't possibly give correct answers.

Answer faithfulness: does the generated answer make claims that are supported by the retrieved context? Use an LLM as a judge to score this automatically — ask it to check each claim in the answer against the source chunks. Frameworks like RAGAS automate this evaluation pattern.

Answer relevance: does the answer actually address what the user asked? An answer can be perfectly faithful to the retrieved context and still be irrelevant if the retrieval was off-topic.

Build a test set of 50–100 question/answer pairs from your domain. Run your RAG system against this set after every significant change. Track your metrics over time. This is the only way to know if you're making things better or worse.

Build the ingestion pipeline first

When I build RAG systems for clients, I spend the first 40% of the project on ingestion: document loading, cleaning, chunking strategy, metadata extraction, and embedding pipeline. Most clients want to jump straight to the "AI answers questions" part. I've learned to resist that. A solid ingestion pipeline produces dramatically better answers than a clever prompt on top of garbage retrieval.

If you want to go deeper on building AI tools in general — the infrastructure, the error handling, the cost management — my post on building production LLM tools in Python covers the scaffolding that supports a RAG system.

And if you need a RAG system built for your organisation — whether that's a knowledge base Q&A tool, a document search system, or a customer support assistant that answers from your own content — that's exactly what my AI automation service covers. I build the ingestion pipeline, the retrieval system, the evaluation framework, and the application layer — not just the "paste your documents into a chatbot" version.

Need a RAG system that actually works?

I build production RAG pipelines — from document ingestion through evaluation and deployment. Systems that retrieve correctly and hallucinate far less than the defaults.

See my AI service →

FAQs

What AI topics do you write about?

Production LLM use: prompts, evaluation, RAG architecture, cost/latency trade-offs, and tooling that helps teams ship safely—not slide-deck AI hype.

Do you recommend a specific model vendor?

Recommendations are task-dependent. Posts discuss interfaces and guardrails that port across providers; your contract, privacy, and latency needs should drive vendor choice.

How do you think about hallucinations?

Grounding, citations, retrieval quality, structured outputs, and human review for high-risk domains. No single trick eliminates risk; systems need measurement.

Are the free AI-related tools connected to the blog?

Yes. Prompt structuring, readability, and content brief helpers complement several articles in this category.

Can you help build an internal AI assistant?

That is a common engagement. Expect discovery on documents, access control, evaluation sets, and rollout—not a two-day chatbot demo without criteria.