Retrieval-Augmented Generation, book cover

Retrieval-Augmented Generation

An Engineer's Guide to Building RAG Systems with Your Own Data

Twenty years of engineering experience, now applied to AI agent orchestration and retrieval-augmented generation.

Get it on Leanpub Read the first chapter free

Why this book exists

Most teams trying to ship a RAG system stall at the prototype stage. The notebook works, the demo wins the meeting, the system never reaches users at scale. The gap between “this works on my laptop” and “this runs reliably in production” is wide, and full of engineering challenges that don't make the demo reel. This book is about that gap.

The book

For engineers who need to ship something real

This book is written for engineers, not researchers writing benchmarks or managers picking vendors. For the person at the keyboard who needs to make decisions about chunking strategy, vector store choice, evaluation methodology, and production operations, and who's tired of vendor-shaped blog posts and tutorials that stop at the prototype.

Each chapter pairs concept with implementation. Real code on a real corpus, runnable end to end. The seven failure points of a RAG pipeline are introduced in chapter 1 and traced through every subsequent chapter, so you learn to recognize where things break, not just patch them when they do.

By the end you'll be able to

Every chapter builds on a single running example with real code you can clone and run.

Early access

The book is in active development.

Chapters 1 to 4 are polished and ready (the Foundations sequence). New chapters ship every 2 to 3 weeks until the full book is complete (target: end of 2026). Buying now gets you all current chapters plus every future update at no additional cost.

Buy on Leanpub

Table of contents

14 chapters. 358 pages. From foundations to production to agentic RAG.

Every chapter builds on a single running example, the Acme Corp knowledge base, with real code you can clone and run.

Part I. Foundations

  1. p. 2 The Problem RAG Solves

    What LLMs actually are (and aren't)
    The three failure modes that matter
    The RAG pipeline end-to-end
    RAG vs. fine-tuning vs. long context
    The seven failure points
    Bare LLM vs. RAG demo

  2. p. 18 Embeddings from First Principles

    From words to vectors
    Bi-encoder architecture
    Sentence-transformers
    Cosine similarity
    Visualizing embedding space with UMAP
    Similarity search from scratch

  3. p. 36 Chunking Strategies

    The chunk size tradeoff
    Fixed-size, recursive, semantic, document-structure-aware, and contextual chunking
    Comparing strategies with a retrieval test

  4. p. 58 Vector Storage and Indexing

    Exact vs. approximate nearest neighbor
    How HNSW works
    FAISS, pgvector, and Qdrant
    Tuning index parameters
    The comparison benchmark

  5. p. 85 Building the Ingestion Pipeline

    Parsing real-world documents
    Text cleaning and normalization
    The full pipeline: parse, clean, chunk, embed, store
    Metadata extraction
    Idempotent re-ingestion

  6. p. 115 Retrieval: From Keywords to Semantics

    Sparse vs. dense retrieval
    BM25
    Semantic search
    Keywords vs. semantics side-by-side
    Building a retrieval evaluation harness

Part II. Building and Improving

  1. p. 139 Your First RAG Pipeline

    Retrieve, augment, generate
    Hallucination in RAG
    Prompt engineering for grounded answers
    Context window management
    Cataloging your RAG failures

  2. p. 163 Hybrid Search and Score Fusion

    The score fusion problem
    Reciprocal Rank Fusion from scratch
    Weighted score fusion
    Metadata filtering
    Measuring hybrid search improvement

  3. p. 188 Reranking

    Bi-encoder vs. cross-encoder
    Adding a reranker to the pipeline
    Tuning K and N
    Latency budgets

  4. p. 211 Query Transformation

    The query-document asymmetry
    Multi-query generation
    Sub-question decomposition
    HyDE
    Query routing
    Measuring the impact

  5. p. 238 Evaluating RAG Systems

    Recall@K, Precision@K, MRR, nDCG
    Faithfulness and Answer Relevancy via RAGAS
    Building evaluation datasets
    Ablation testing

Part III. Production and Beyond

  1. p. 271 Hardening the Pipeline for Production

    Observability and tracing
    Semantic caching
    Citation and provenance
    Embedding staleness
    Cost optimization
    Load testing
    The production readiness report

  2. p. 304 Advanced Retrieval Patterns

    The “more complex isn't always better” principle
    Knowledge graphs and GraphRAG
    Corrective RAG
    Self-RAG
    Measuring whether complexity pays off

  3. p. 332 Agentic RAG

    From pipelines to agents
    The 90% failure rate
    Function-calling RAG
    Multi-step retrieval with planning
    Query routing across multiple knowledge bases
    Building guardrails
    Stress-testing agentic RAG

Running example

One corpus. 14 chapters. Real code you can run.

Every chapter operates on a single running example: the Acme Corp knowledge base, a fictional 500-employee SaaS company with 110 internal documents across HR, IT, operations, compliance, product, and engineering. The corpus is engineered to surface every failure mode the book teaches.

About the author
Jeroen Herczeg

Jeroen Herczeg

Jeroen Herczeg is a senior software engineer who builds AI systems for production. Twenty years of engineering experience, now applied to AI agent orchestration and retrieval-augmented generation.

Most recently he built the AI agents demo for Google and BBC, which won the Broadcast Tech Innovation Award. He started studying AI seriously in 2017 with Udacity's Artificial Intelligence Nanodegree, well before the current wave of large language models.

He writes about practical AI engineering at herczeg.be/blog and lives in Belgium. This book exists because most of what he learned shipping production AI is locked in private codebases, and someone should write it down.

Pre-order

Become an early reader.

Get the first chapter free. No spam, just the book updates.