Now available
A practical engineering book for building retrieval-augmented generation systems with your own data, from the first prototype to production patterns like evaluation, observability, and agentic RAG.
Most teams trying to ship a RAG system stall at the prototype stage. The notebook works, the demo wins the meeting, the system never reaches users at scale. The gap between “this works on my laptop” and “this runs reliably in production” is wide, and full of engineering challenges that don't make the demo reel. This book is about that gap.
This book is written for engineers, not researchers writing benchmarks or managers picking vendors. For the person at the keyboard who needs to make decisions about chunking strategy, vector store choice, evaluation methodology, and production operations, and who's tired of vendor-shaped blog posts and tutorials that stop at the prototype.
Each chapter pairs concept with implementation. Real code on a real corpus, runnable end to end. The seven failure points of a RAG pipeline are introduced in chapter 1 and traced through every subsequent chapter, so you learn to recognize where things break, not just patch them when they do.
Every chapter builds on a single running example with real code you can clone and run.
Every chapter builds on a single running example, the Acme Corp knowledge base, with real code you can clone and run.
What an LLM can and cannot do
Limitations of a standalone LLM
The RAG mental model
The RAG pipeline end-to-end
RAG vs. fine-tuning vs. long-context prompting
The seven failure points
Standalone LLM vs. RAG demo
From words to vectors
The bi-encoder architecture
Generating embeddings locally and via API
Cosine similarity and distance metrics
Visualizing embedding space with UMAP
Choosing an embedding model
Similarity search from scratch
The chunk size tradeoff
Fixed-size chunking
Recursive character splitting
Semantic chunking
Document-structure-aware chunking
Contextual chunking
Comparing strategies: a retrieval test
Exact vs. approximate nearest neighbor
The speed-accuracy-memory tradeoff
How HNSW works
FAISS, pgvector, and Qdrant
Tuning index parameters
The comparison benchmark
The ingestion flow
Parsing real-world documents
Text cleaning and normalization
The full pipeline: parse, clean, chunk, embed, store
Metadata extraction and storage
Idempotent re-ingestion
Keyword retrieval and the BM25 mental model
Adding a search vector to the chunks table
Where each retriever fails the other's queries
Hybrid retrieval as candidate generation
Filters as candidate-set scoping
Selecting context from the candidate pool
Building the prompt
The complete pipeline
Where the pipeline succeeds and fails
The failure catalog
Why first-stage retrieval optimizes for recall
Bi-encoder versus cross-encoder
Adding a local reranker (bge-reranker-v2-m3)
Choosing K and N
Latency and the cost of cross-encoders
When reranking is not worth it
Where query transformation belongs
Query rewriting
HyDE: search with a hypothetical answer
Multi-query expansion
Decomposition
When transformation hurts
Two evaluation surfaces
Building an evaluation set
Retrieval metrics
Generation metrics
The ablation table
Regression tracking
Stage-level observability
Tracing across stages
Failure modes and graceful degradation
Configuration and secrets
Model versioning and the silent-rebuild trap
Security boundaries in RAG systems
Deploying changes safely
Parent-document retrieval
Contextual retrieval
Graph-based retrieval
ColBERT and late interaction
The complexity test
Retrieval as a tool call
Multi-step reasoning loops
Bounding agentic loops
Observability for agents
When agentic RAG is worth it
Every chapter operates on a single running example: the Acme Corp knowledge base, a fictional 500-employee SaaS company with 110 internal documents across HR, IT, operations, compliance, product, and engineering. The corpus is engineered to surface every failure mode the book teaches.
Learn how to build retrieval systems you can test, debug, measure, and improve.