Twenty years of engineering experience, now applied to AI agent orchestration and retrieval-augmented generation.
Most teams trying to ship a RAG system stall at the prototype stage. The notebook works, the demo wins the meeting, the system never reaches users at scale. The gap between “this works on my laptop” and “this runs reliably in production” is wide, and full of engineering challenges that don't make the demo reel. This book is about that gap.
This book is written for engineers, not researchers writing benchmarks or managers picking vendors. For the person at the keyboard who needs to make decisions about chunking strategy, vector store choice, evaluation methodology, and production operations, and who's tired of vendor-shaped blog posts and tutorials that stop at the prototype.
Each chapter pairs concept with implementation. Real code on a real corpus, runnable end to end. The seven failure points of a RAG pipeline are introduced in chapter 1 and traced through every subsequent chapter, so you learn to recognize where things break, not just patch them when they do.
Every chapter builds on a single running example with real code you can clone and run.
Chapters 1 to 4 are polished and ready (the Foundations sequence). New chapters ship every 2 to 3 weeks until the full book is complete (target: end of 2026). Buying now gets you all current chapters plus every future update at no additional cost.
Every chapter builds on a single running example, the Acme Corp knowledge base, with real code you can clone and run.
What LLMs actually are (and aren't)
The three failure modes that matter
The RAG pipeline end-to-end
RAG vs. fine-tuning vs. long context
The seven failure points
Bare LLM vs. RAG demo
From words to vectors
Bi-encoder architecture
Sentence-transformers
Cosine similarity
Visualizing embedding space with UMAP
Similarity search from scratch
The chunk size tradeoff
Fixed-size, recursive, semantic, document-structure-aware, and contextual chunking
Comparing strategies with a retrieval test
Exact vs. approximate nearest neighbor
How HNSW works
FAISS, pgvector, and Qdrant
Tuning index parameters
The comparison benchmark
Parsing real-world documents
Text cleaning and normalization
The full pipeline: parse, clean, chunk, embed, store
Metadata extraction
Idempotent re-ingestion
Sparse vs. dense retrieval
BM25
Semantic search
Keywords vs. semantics side-by-side
Building a retrieval evaluation harness
Retrieve, augment, generate
Hallucination in RAG
Prompt engineering for grounded answers
Context window management
Cataloging your RAG failures
The score fusion problem
Reciprocal Rank Fusion from scratch
Weighted score fusion
Metadata filtering
Measuring hybrid search improvement
Bi-encoder vs. cross-encoder
Adding a reranker to the pipeline
Tuning K and N
Latency budgets
The query-document asymmetry
Multi-query generation
Sub-question decomposition
HyDE
Query routing
Measuring the impact
Recall@K, Precision@K, MRR, nDCG
Faithfulness and Answer Relevancy via RAGAS
Building evaluation datasets
Ablation testing
Observability and tracing
Semantic caching
Citation and provenance
Embedding staleness
Cost optimization
Load testing
The production readiness report
The “more complex isn't always better” principle
Knowledge graphs and GraphRAG
Corrective RAG
Self-RAG
Measuring whether complexity pays off
From pipelines to agents
The 90% failure rate
Function-calling RAG
Multi-step retrieval with planning
Query routing across multiple knowledge bases
Building guardrails
Stress-testing agentic RAG
Every chapter operates on a single running example: the Acme Corp knowledge base, a fictional 500-employee SaaS company with 110 internal documents across HR, IT, operations, compliance, product, and engineering. The corpus is engineered to surface every failure mode the book teaches.
Get the first chapter free. No spam, just the book updates.