Why We Run Our Own RAG

The Demo vs Production Gap

Most RAG implementations you see online are optimised for demos. They chunk documents naively, embed everything into a single vector store, and retrieve the top-k results by cosine similarity. This works well when your documents are clean PDFs about a single topic. It falls apart in production.

At TheTechFinn we run RAG for three client products: a legal document assistant, a POS troubleshooting bot, and an internal knowledge base. All three needed custom retrieval logic that off-the-shelf solutions could not provide.

The Two-Stage Architecture

Our pipeline uses a two-stage approach: a fast ANN (approximate nearest neighbour) search narrows candidates to 50ÔÇô100 chunks, then a cross-encoder re-ranker scores those candidates and returns the top 5. This costs roughly 40 ms more than single-stage retrieval but cuts hallucination rate by ~60% on our eval sets.

Chunking Strategy

Sentence-level chunks with 30-token overlap. Each chunk stores metadata: document ID, section heading, page number, creation date. This lets the re-ranker use structural signals, not just semantic similarity.

Cost Analysis

Running our own pipeline on a single GPU instance costs roughly $180/month. Azure AI Search at equivalent query volume would run $600+. The engineering investment paid back in under 4 months.

When to Use Off-the-Shelf

If you are prototyping, use LlamaIndex or LangChain. If you are serving 50+ QPS with domain-specific documents, consider owning the retrieval stack. The control is worth it.