Building RAG Pipelines
From chunking strategies and embedding models to vector databases, hybrid search, re-ranking, and evaluation — the complete RAG engineering guide.
RAG Engineering
Building a production RAG system goes far beyond “put documents in a vector store.” Each component requires careful engineering decisions.
Document Processing
Loading
Handle diverse formats: PDF (use unstructured.io or PyMuPDF), HTML (BeautifulSoup), DOCX (python-docx), spreadsheets, emails. Preserve structure — headings, tables, and lists carry semantic meaning.
Chunking Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split every N tokens with M overlap | Simple documents |
| Recursive | Split by headings → paragraphs → sentences | Structured content |
| Semantic | Group by embedding similarity | Mixed-topic documents |
| Parent-child | Small chunks for retrieval, return parent chunk for context | Precision + context |
Recommended starting point: Recursive splitting at 400 tokens with 50-token overlap.
Metadata Enrichment
Attach metadata to each chunk: source document, page number, section heading, date, author. This enables filtered retrieval and citation.
Embeddings
Convert text chunks to vectors that capture semantic meaning.
| Model | Dimensions | Quality | Speed |
|---|---|---|---|
OpenAI text-embedding-3-large | 3072 | Excellent | Fast (API) |
Cohere embed-v4 | 1024 | Excellent | Fast (API) |
bge-large-en-v1.5 | 1024 | Very good | Self-hosted |
all-MiniLM-L6-v2 | 384 | Good | Very fast |
Choose based on your constraints: API models are easy but have cost/privacy implications. Open-source models can run locally.
Vector Databases
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud | Production, no ops overhead |
| Qdrant | Self-hosted or cloud | Flexible, great filtering |
| ChromaDB | Embedded | Prototyping, small datasets |
| pgvector | PostgreSQL extension | Already using Postgres |
| Weaviate | Self-hosted or cloud | Multi-modal search |
Retrieval Pipeline
Basic: Vector Search
Query embedding → find top-K nearest chunks → pass to LLM.
Better: Hybrid Search
Combine vector search with BM25 keyword search using Reciprocal Rank Fusion (RRF):
$$RRF(d) = \sum_{r \in R} \frac{1}{k + r(d)}$$
Where $r(d)$ is the rank of document $d$ in each result set, and $k$ is typically 60.
Best: Hybrid Search + Re-ranking
After hybrid retrieval, use a cross-encoder re-ranker (e.g., Cohere Rerank, bge-reranker-v2) to re-score the top 20-50 results and keep the top 5-10.
This dramatically improves precision at a small latency cost.
Generation
Structure your generation prompt carefully:
Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say so.
Context:
{retrieved_chunks}
Question: {user_query}
Key techniques:
- Instruct faithfulness — tell the model to only use provided context
- Include citations — ask the model to reference which chunks it used
- Handle uncertainty — instruct it to say “I don’t know” rather than hallucinate
Evaluation with RAGAS
Evaluate your RAG pipeline on four dimensions:
- Faithfulness — Is the answer supported by the retrieved context?
- Context Relevancy — Are the retrieved chunks actually relevant?
- Answer Relevancy — Does the answer address the question?
- Answer Correctness — Is the answer factually correct? (requires ground truth)
Build an evaluation dataset of 50-100 question/answer pairs from your documents and run regular evaluations as you tune the pipeline.
Common Failure Modes
- Wrong chunks retrieved → improve chunking or add re-ranking
- Answer not faithful to context → strengthen the generation prompt
- Missing information → check if documents are properly indexed
- Contradictory chunks → add metadata filtering and recency weighting
Key takeaway: RAG engineering is iterative. Start simple (basic vector search), measure with RAGAS, then add complexity (hybrid search, re-ranking, metadata filtering) where evaluation shows gaps.
Quick Quiz
Test what you just learned. Pick the best answer for each question.
Q1 What is the main trade-off when choosing chunk size for RAG?
Q2 What is 'hybrid search' in RAG?
Q3 What is a 're-ranker' in a RAG pipeline?
Q4 What does the RAGAS framework evaluate?