Jeroen Herczeg
A Retrieval-Augmented Generation (RAG) system combines information retrieval with Large Language Models (LLMs) to improve the quality and relevance of generated text. This allows LLMs to access up-to-date or private information and provide factual answers with verifiable sources.
When you first learn about RAG, it might come across as a simple system meant to improve the accuracy of a Large Language Model. But once you start implementing it, you realize that it's quite complicated and requires a good grasp of retrieval and generation techniques.
In this book, we will start by examining how large language models work, as well as their limitations and challenges. Next, we'll take a close look at the RAG architecture and how it can improve the performance of a language model. Finally, we'll discuss the most frequent difficulties encountered when developing a RAG application.
Every chapter builds on a single running example with real code you can clone and run. By the end, you will have built, evaluated, hardened, and stress-tested a production RAG system from scratch.
Every chapter builds on a single running example — the Acme Corp knowledge base — with real code you can clone and run.
What LLMs actually are (and aren't)
The three failure modes that matter
The RAG pipeline end-to-end
RAG vs. fine-tuning vs. long context
The seven failure points
Bare LLM vs. RAG demo
From words to vectors
Bi-encoder architecture
Sentence-transformers
Cosine similarity
Visualizing embedding space with UMAP
Similarity search from scratch
The chunk size tradeoff
Fixed-size, recursive, semantic, document-structure-aware, and contextual chunking
Comparing strategies with a retrieval test
Exact vs. approximate nearest neighbor
How HNSW works
FAISS, pgvector, and Qdrant
Tuning index parameters
The comparison benchmark
Parsing real-world documents
Text cleaning and normalization
The full pipeline: parse, clean, chunk, embed, store
Metadata extraction
Idempotent re-ingestion
Sparse vs. dense retrieval
BM25
Semantic search
Keywords vs. semantics side-by-side
Building a retrieval evaluation harness
Retrieve, augment, generate
Hallucination in RAG
Prompt engineering for grounded answers
Context window management
Cataloging your RAG failures
The score fusion problem
Reciprocal Rank Fusion from scratch
Weighted score fusion
Metadata filtering
Measuring hybrid search improvement
Bi-encoder vs. cross-encoder
Adding a reranker to the pipeline
Tuning K and N
Latency budgets
The query-document asymmetry
Multi-query generation
Sub-question decomposition
HyDE
Query routing
Measuring the impact
Recall@K, Precision@K, MRR, nDCG
Faithfulness and Answer Relevancy via RAGAS
Building evaluation datasets
Ablation testing
Observability and tracing
Semantic caching
Citation and provenance
Embedding staleness
Cost optimization
Load testing
The production readiness report
The “more complex isn't always better” principle
Knowledge graphs and GraphRAG
Corrective RAG
Self-RAG
Measuring whether complexity pays off
From pipelines to agents
The 90% failure rate
Function-calling RAG
Multi-step retrieval with planning
Query routing across multiple knowledge bases
Building guardrails
Stress-testing agentic RAG
Every chapter operates on a single running example — the Acme Corp knowledge base, a fictional 500-employee SaaS company with 110 internal documents across HR, IT, operations, compliance, product, and engineering. The corpus is engineered to surface every failure mode the book teaches.
Enter your email address and I’ll send you the first chapter from the book for free.