Retrieval-Augmented Generation: An Engineer's Guide to Building RAG Systems with Your Own Data

The book

For engineers who need to ship something real

This book is written for engineers, not researchers writing benchmarks or managers picking vendors. For the person at the keyboard who needs to make decisions about chunking strategy, vector store choice, evaluation methodology, and production operations, and who's tired of vendor-shaped blog posts and tutorials that stop at the prototype.

Each chapter pairs concept with implementation. Real code on a real corpus, runnable end to end. The seven failure points of a RAG pipeline are introduced in chapter 1 and traced through every subsequent chapter, so you learn to recognize where things break, not just patch them when they do.

By the end you'll be able to

Choose a chunking strategy on retrieval evidence, not intuition.
Pick FAISS, pgvector, or Qdrant based on your actual constraints.
Build a RAG pipeline that handles real PDFs with OCR artifacts, encoding issues, and dirty markdown.
Evaluate retrieval quality separately from generation quality, and prove your changes help.
Add reranking, hybrid search, and query transformation when (and only when) they earn it.
Catch the seven failure points before they reach production.
Scale, monitor, and cost-optimize a RAG system that survives a deploy.

Every chapter builds on a single running example with real code you can clone and run.

Table of contents

14 chapters. 358 pages. From foundations to production to agentic RAG.

Every chapter builds on a single running example, the Acme Corp knowledge base, with real code you can clone and run.

Part I. Foundations

p. 2 The Problem RAG Solves
What LLMs actually are (and aren't)
The three failure modes that matter
The RAG pipeline end-to-end
RAG vs. fine-tuning vs. long context
The seven failure points
Bare LLM vs. RAG demo
p. 18 Embeddings from First Principles
From words to vectors
Bi-encoder architecture
Sentence-transformers
Cosine similarity
Visualizing embedding space with UMAP
Similarity search from scratch
p. 36 Chunking Strategies
The chunk size tradeoff
Fixed-size, recursive, semantic, document-structure-aware, and contextual chunking
Comparing strategies with a retrieval test
p. 58 Vector Storage and Indexing
Exact vs. approximate nearest neighbor
How HNSW works
FAISS, pgvector, and Qdrant
Tuning index parameters
The comparison benchmark
p. 85 Building the Ingestion Pipeline
Parsing real-world documents
Text cleaning and normalization
The full pipeline: parse, clean, chunk, embed, store
Metadata extraction
Idempotent re-ingestion
p. 115 Retrieval: From Keywords to Semantics
Sparse vs. dense retrieval
BM25
Semantic search
Keywords vs. semantics side-by-side
Building a retrieval evaluation harness

Part II. Building and Improving

p. 139 Your First RAG Pipeline
Retrieve, augment, generate
Hallucination in RAG
Prompt engineering for grounded answers
Context window management
Cataloging your RAG failures
p. 163 Hybrid Search and Score Fusion
The score fusion problem
Reciprocal Rank Fusion from scratch
Weighted score fusion
Metadata filtering
Measuring hybrid search improvement
p. 188 Reranking
Bi-encoder vs. cross-encoder
Adding a reranker to the pipeline
Tuning K and N
Latency budgets
p. 211 Query Transformation
The query-document asymmetry
Multi-query generation
Sub-question decomposition
HyDE
Query routing
Measuring the impact
p. 238 Evaluating RAG Systems
Recall@K, Precision@K, MRR, nDCG
Faithfulness and Answer Relevancy via RAGAS
Building evaluation datasets
Ablation testing

Part III. Production and Beyond

p. 271 Hardening the Pipeline for Production
Observability and tracing
Semantic caching
Citation and provenance
Embedding staleness
Cost optimization
Load testing
The production readiness report
p. 304 Advanced Retrieval Patterns
The “more complex isn't always better” principle
Knowledge graphs and GraphRAG
Corrective RAG
Self-RAG
Measuring whether complexity pays off
p. 332 Agentic RAG
From pipelines to agents
The 90% failure rate
Function-calling RAG
Multi-step retrieval with planning
Query routing across multiple knowledge bases
Building guardrails
Stress-testing agentic RAG

About the author

Jeroen Herczeg

Jeroen Herczeg is a senior software engineer who builds AI systems for production. Twenty years of engineering experience, now applied to AI agent orchestration and retrieval-augmented generation.

Most recently he built the AI agents demo for Google and BBC, which won the Broadcast Tech Innovation Award. He started studying AI seriously in 2017 with Udacity's Artificial Intelligence Nanodegree, well before the current wave of large language models.

He writes about practical AI engineering at herczeg.be/blog and lives in Belgium. This book exists because most of what he learned shipping production AI is locked in private codebases, and someone should write it down.

GitHub LinkedIn X

Retrieval-Augmented Generation

An Engineer's Guide to Building RAG Systems with Your Own Data