Home / Journal / Engineering

Why We Run Our Own RAG

Off-the-shelf retrieval augmented generation pipelines optimise for demos. Production is different.

The Demo vs Production Gap

Most RAG implementations you see online are optimised for demos. They chunk documents naively, embed everything into a single vector store, and retrieve the top-k results by cosine similarity. This works well when your documents are clean PDFs about a single topic. It falls apart in production.

At TheTechFinn we run RAG for three client products: a legal document assistant, a POS troubleshooting bot, and an internal knowledge base. All three needed custom retrieval logic that off-the-shelf solutions could not provide.

The Two-Stage Architecture

Our pipeline uses a two-stage approach: a fast ANN (approximate nearest neighbour) search narrows candidates to 50ÔÇô100 chunks, then a cross-encoder re-ranker scores those candidates and returns the top 5. This costs roughly 40 ms more than single-stage retrieval but cuts hallucination rate by ~60% on our eval sets.

Chunking Strategy

Sentence-level chunks with 30-token overlap. Each chunk stores metadata: document ID, section heading, page number, creation date. This lets the re-ranker use structural signals, not just semantic similarity.

Cost Analysis

Running our own pipeline on a single GPU instance costs roughly $180/month. Azure AI Search at equivalent query volume would run $600+. The engineering investment paid back in under 4 months.

When to Use Off-the-Shelf

If you are prototyping, use LlamaIndex or LangChain. If you are serving 50+ QPS with domain-specific documents, consider owning the retrieval stack. The control is worth it.

Back to journal
Chat on WhatsApp