Retrieval-Augmented Generation, book cover

Now available

Build RAG systems
that survive real users.

A practical engineering book for building retrieval-augmented generation systems with your own data, from the first prototype to production patterns like evaluation, observability, and agentic RAG.

Get it on Leanpub Read a free chapter

Most teams trying to ship a RAG system stall at the prototype stage. The notebook works, the demo wins the meeting, the system never reaches users at scale. The gap between “this works on my laptop” and “this runs reliably in production” is wide, and full of engineering challenges that don't make the demo reel. This book is about that gap.

The book

For engineers who need to ship something real

This book is written for engineers, not researchers writing benchmarks or managers picking vendors. For the person at the keyboard who needs to make decisions about chunking strategy, vector store choice, evaluation methodology, and production operations, and who's tired of vendor-shaped blog posts and tutorials that stop at the prototype.

Each chapter pairs concept with implementation. Real code on a real corpus, runnable end to end. The seven failure points of a RAG pipeline are introduced in chapter 1 and traced through every subsequent chapter, so you learn to recognize where things break, not just patch them when they do.

By the end you'll be able to

Every chapter builds on a single running example with real code you can clone and run.

Table of contents

13 chapters. 268 pages. From foundations to production to agentic RAG.

Every chapter builds on a single running example, the Acme Corp knowledge base, with real code you can clone and run.

  1. p. 1 The Problem RAG Solves

    What an LLM can and cannot do
    Limitations of a standalone LLM
    The RAG mental model
    The RAG pipeline end-to-end
    RAG vs. fine-tuning vs. long-context prompting
    The seven failure points
    Standalone LLM vs. RAG demo

  2. p. 20 Embeddings

    From words to vectors
    The bi-encoder architecture
    Generating embeddings locally and via API
    Cosine similarity and distance metrics
    Visualizing embedding space with UMAP
    Choosing an embedding model
    Similarity search from scratch

  3. p. 43 Chunking Strategies

    The chunk size tradeoff
    Fixed-size chunking
    Recursive character splitting
    Semantic chunking
    Document-structure-aware chunking
    Contextual chunking
    Comparing strategies: a retrieval test

  4. p. 61 Vector Storage and Indexing

    Exact vs. approximate nearest neighbor
    The speed-accuracy-memory tradeoff
    How HNSW works
    FAISS, pgvector, and Qdrant
    Tuning index parameters
    The comparison benchmark

  5. p. 90 Building the Ingestion Pipeline

    The ingestion flow
    Parsing real-world documents
    Text cleaning and normalization
    The full pipeline: parse, clean, chunk, embed, store
    Metadata extraction and storage
    Idempotent re-ingestion

  6. p. 115 Hybrid Retrieval

    Keyword retrieval and the BM25 mental model
    Adding a search vector to the chunks table
    Where each retriever fails the other's queries
    Hybrid retrieval as candidate generation
    Filters as candidate-set scoping

  7. p. 135 Your First RAG Pipeline

    Selecting context from the candidate pool
    Building the prompt
    The complete pipeline
    Where the pipeline succeeds and fails
    The failure catalog

  1. p. 152 Reranking

    Why first-stage retrieval optimizes for recall
    Bi-encoder versus cross-encoder
    Adding a local reranker (bge-reranker-v2-m3)
    Choosing K and N
    Latency and the cost of cross-encoders
    When reranking is not worth it

  2. p. 172 Query Transformation

    Where query transformation belongs
    Query rewriting
    HyDE: search with a hypothetical answer
    Multi-query expansion
    Decomposition
    When transformation hurts

  3. p. 188 Evaluating RAG Systems

    Two evaluation surfaces
    Building an evaluation set
    Retrieval metrics
    Generation metrics
    The ablation table
    Regression tracking

  4. p. 205 Hardening for Production

    Stage-level observability
    Tracing across stages
    Failure modes and graceful degradation
    Configuration and secrets
    Model versioning and the silent-rebuild trap
    Security boundaries in RAG systems
    Deploying changes safely

  5. p. 226 Advanced Retrieval Patterns

    Parent-document retrieval
    Contextual retrieval
    Graph-based retrieval
    ColBERT and late interaction
    The complexity test

  6. p. 244 Agentic RAG

    Retrieval as a tool call
    Multi-step reasoning loops
    Bounding agentic loops
    Observability for agents
    When agentic RAG is worth it

Running example

One corpus. 13 chapters. Real code you can run.

Every chapter operates on a single running example: the Acme Corp knowledge base, a fictional 500-employee SaaS company with 110 internal documents across HR, IT, operations, compliance, product, and engineering. The corpus is engineered to surface every failure mode the book teaches.

Stop shipping RAG demos
that break in production.

Learn how to build retrieval systems you can test, debug, measure, and improve.

Get it on Leanpub Read a free chapter

Jeroen Herczeg
About the author

Jeroen Herczeg

Jeroen Herczeg is a senior software engineer who builds AI systems for production. Twenty years of engineering experience, now applied to AI agent orchestration and retrieval-augmented generation.

Most recently he built the AI agents demo for Google and BBC, which won the Broadcast Tech Innovation Award. He started studying AI seriously in 2017 with Udacity's Artificial Intelligence Nanodegree, well before the current wave of large language models.

He writes about practical AI engineering at herczeg.be/blog and lives in Belgium. This book exists because most of what he learned shipping production AI is locked in private codebases, and someone should write it down.