Building RAG Pipelines

From chunking strategies and embedding models to vector databases, hybrid search, re-ranking, and evaluation — the complete RAG engineering guide.

RAG Engineering

Building a production RAG system goes far beyond “put documents in a vector store.” Each component requires careful engineering decisions.

Document Processing

Loading

Handle diverse formats: PDF (use unstructured.io or PyMuPDF), HTML (BeautifulSoup), DOCX (python-docx), spreadsheets, emails. Preserve structure — headings, tables, and lists carry semantic meaning.

Chunking Strategies

Strategy	How It Works	Best For
Fixed-size	Split every N tokens with M overlap	Simple documents
Recursive	Split by headings → paragraphs → sentences	Structured content
Semantic	Group by embedding similarity	Mixed-topic documents
Parent-child	Small chunks for retrieval, return parent chunk for context	Precision + context

Recommended starting point: Recursive splitting at 400 tokens with 50-token overlap.

Metadata Enrichment

Attach metadata to each chunk: source document, page number, section heading, date, author. This enables filtered retrieval and citation.

Embeddings

Convert text chunks to vectors that capture semantic meaning.

Model	Dimensions	Quality	Speed
OpenAI `text-embedding-3-large`	3072	Excellent	Fast (API)
Cohere `embed-v4`	1024	Excellent	Fast (API)
`bge-large-en-v1.5`	1024	Very good	Self-hosted
`all-MiniLM-L6-v2`	384	Good	Very fast

Choose based on your constraints: API models are easy but have cost/privacy implications. Open-source models can run locally.

Vector Databases

Database	Type	Best For
Pinecone	Managed cloud	Production, no ops overhead
Qdrant	Self-hosted or cloud	Flexible, great filtering
ChromaDB	Embedded	Prototyping, small datasets
pgvector	PostgreSQL extension	Already using Postgres
Weaviate	Self-hosted or cloud	Multi-modal search

Retrieval Pipeline

Basic: Vector Search

Query embedding → find top-K nearest chunks → pass to LLM.

Better: Hybrid Search

Combine vector search with BM25 keyword search using Reciprocal Rank Fusion (RRF):

$$RRF(d) = \sum_{r \in R} \frac{1}{k + r(d)}$$

Where $r(d)$ is the rank of document $d$ in each result set, and $k$ is typically 60.

Best: Hybrid Search + Re-ranking

After hybrid retrieval, use a cross-encoder re-ranker (e.g., Cohere Rerank, bge-reranker-v2) to re-score the top 20-50 results and keep the top 5-10.

This dramatically improves precision at a small latency cost.

Generation

Structure your generation prompt carefully:

Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say so.

Context:
{retrieved_chunks}

Question: {user_query}

Key techniques:

Instruct faithfulness — tell the model to only use provided context
Include citations — ask the model to reference which chunks it used
Handle uncertainty — instruct it to say “I don’t know” rather than hallucinate

Evaluation with RAGAS

Evaluate your RAG pipeline on four dimensions:

Faithfulness — Is the answer supported by the retrieved context?
Context Relevancy — Are the retrieved chunks actually relevant?
Answer Relevancy — Does the answer address the question?
Answer Correctness — Is the answer factually correct? (requires ground truth)

Build an evaluation dataset of 50-100 question/answer pairs from your documents and run regular evaluations as you tune the pipeline.

Common Failure Modes

Wrong chunks retrieved → improve chunking or add re-ranking
Answer not faithful to context → strengthen the generation prompt
Missing information → check if documents are properly indexed
Contradictory chunks → add metadata filtering and recency weighting

Key takeaway: RAG engineering is iterative. Start simple (basic vector search), measure with RAGAS, then add complexity (hybrid search, re-ranking, metadata filtering) where evaluation shows gaps.