Hands-On
Retrieval-Augmented Generation

Hands-On Retrieval-Augmented Generation

Jeroen Herczeg

A hands-on guide to grounding LLMs with your own data.

Get the first chapter →

What is Retrieval-Augmented Generation (RAG)?

A Retrieval-Augmented Generation (RAG) system combines information retrieval with Large Language Models (LLMs) to improve the quality and relevance of generated text. This allows LLMs to access up-to-date or private information and provide factual answers with verifiable sources.

Designed to teach you practical, hands-on methods for implementing Retrieval-Augmented Generation (RAG).

When you first learn about RAG, it might come across as a simple system meant to improve the accuracy of a Large Language Model. But once you start implementing it, you realize that it's quite complicated and requires a good grasp of retrieval and generation techniques.

In this book, we will start by examining how large language models work, as well as their limitations and challenges. Next, we'll take a close look at the RAG architecture and how it can improve the performance of a language model. Finally, we'll discuss the most frequent difficulties encountered when developing a RAG application.

Every chapter builds on a single running example with real code you can clone and run. By the end, you will have built, evaluated, hardened, and stress-tested a production RAG system from scratch.

Get the first chapter for free straight to your inbox

01 Table of contents

14 chapters. 358 pages. From foundations to production to agentic RAG.

Every chapter builds on a single running example — the Acme Corp knowledge base — with real code you can clone and run.

Part I — Foundations

  1. p. 2 The Problem RAG Solves

    What LLMs actually are (and aren't)
    The three failure modes that matter
    The RAG pipeline end-to-end
    RAG vs. fine-tuning vs. long context
    The seven failure points
    Bare LLM vs. RAG demo

  2. p. 18 Embeddings from First Principles

    From words to vectors
    Bi-encoder architecture
    Sentence-transformers
    Cosine similarity
    Visualizing embedding space with UMAP
    Similarity search from scratch

  3. p. 36 Chunking Strategies

    The chunk size tradeoff
    Fixed-size, recursive, semantic, document-structure-aware, and contextual chunking
    Comparing strategies with a retrieval test

  4. p. 58 Vector Storage and Indexing

    Exact vs. approximate nearest neighbor
    How HNSW works
    FAISS, pgvector, and Qdrant
    Tuning index parameters
    The comparison benchmark

  5. p. 85 Building the Ingestion Pipeline

    Parsing real-world documents
    Text cleaning and normalization
    The full pipeline: parse, clean, chunk, embed, store
    Metadata extraction
    Idempotent re-ingestion

  6. p. 115 Retrieval: From Keywords to Semantics

    Sparse vs. dense retrieval
    BM25
    Semantic search
    Keywords vs. semantics side-by-side
    Building a retrieval evaluation harness

Part II — Building and Improving

  1. p. 139 Your First RAG Pipeline

    Retrieve, augment, generate
    Hallucination in RAG
    Prompt engineering for grounded answers
    Context window management
    Cataloging your RAG failures

  2. p. 163 Hybrid Search and Score Fusion

    The score fusion problem
    Reciprocal Rank Fusion from scratch
    Weighted score fusion
    Metadata filtering
    Measuring hybrid search improvement

  3. p. 188 Reranking

    Bi-encoder vs. cross-encoder
    Adding a reranker to the pipeline
    Tuning K and N
    Latency budgets

  4. p. 211 Query Transformation

    The query-document asymmetry
    Multi-query generation
    Sub-question decomposition
    HyDE
    Query routing
    Measuring the impact

  5. p. 238 Evaluating RAG Systems

    Recall@K, Precision@K, MRR, nDCG
    Faithfulness and Answer Relevancy via RAGAS
    Building evaluation datasets
    Ablation testing

Part III — Production and Beyond

  1. p. 271 Hardening the Pipeline for Production

    Observability and tracing
    Semantic caching
    Citation and provenance
    Embedding staleness
    Cost optimization
    Load testing
    The production readiness report

  2. p. 304 Advanced Retrieval Patterns

    The “more complex isn't always better” principle
    Knowledge graphs and GraphRAG
    Corrective RAG
    Self-RAG
    Measuring whether complexity pays off

  3. p. 332 Agentic RAG

    From pipelines to agents
    The 90% failure rate
    Function-calling RAG
    Multi-step retrieval with planning
    Query routing across multiple knowledge bases
    Building guardrails
    Stress-testing agentic RAG

02 Running example

One corpus. 14 chapters. Real code you can run.

Every chapter operates on a single running example — the Acme Corp knowledge base, a fictional 500-employee SaaS company with 110 internal documents across HR, IT, operations, compliance, product, and engineering. The corpus is engineered to surface every failure mode the book teaches.

03 Pre-order

Become an early reader.

Enter your email address and I’ll send you the first chapter from the book for free.

“We are currently living in a remarkable time. Artificial intelligence is advancing at an unprecedented rate. With access to the most advanced AI models, we are now able to develop software features that were previously difficult or even impossible to create. The future of AI is not just about algorithms and data. It's about the people who harness these models to solve real-world problems.

04 Author
Jeroen Herczeg

Jeroen Herczeg

Hey there, I’m the author.

I have worked in software engineering for over two decades, specializing in building and maintaining efficient, reliable, and scalable systems. In 2015, I discovered my passion for artificial intelligence and have been learning more about this field and how to apply it in a practical way. As a speaker at various meetups, I have always been passionate about learning and sharing my knowledge with others.