rag pipelines: from prototype to production

the prototype always works. you embed a few docs, run a similarity search, pass the results to gpt-4, and it gives back a great answer. then you take it to production with thousands of documents and real users and the whole thing falls apart.

here’s what i learned rebuilding the rag pipeline at spritle from a demo into something that could actually run.

chunking: the part everyone gets wrong first

naive chunking — split every 500 tokens, done — works fine in demos. in production it breaks in two ways:

semantic fragmentation: a paragraph that answers a question gets split across two chunks. retrieval returns the tail of one and the head of the next. neither is useful on its own.

size mismatch: a 500-token chunk retrieved alongside a 50-token chunk confuses the reranker. normalise chunk sizes within a reasonable range (300–600 tokens is my working default).

what actually works: recursive character splitting with overlap, or sentence-boundary splitting for documents with clear paragraph structure. for code, split by function/class, not by character count.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
)

embedding model selection

text-embedding-ada-002 is the default everyone reaches for. it’s fine but not optimal for domain-specific retrieval. for technical documentation i got measurably better recall with BAAI/bge-base-en-v1.5 — smaller, faster, and trained on more diverse retrieval tasks.

run your own benchmark on a sample of your documents before committing to a model. what matters is recall@k on your actual queries, not the mteb leaderboard ranking.

retrieval: beyond naive similarity search

pure vector similarity search has a recall ceiling. a query phrased differently from the source document will miss it even when the semantic match is strong. i use a two-stage retrieval approach:

vector search — top-20 candidates from chromadb using cosine similarity
bm25 search — top-20 candidates via keyword overlap (handles exact match cases)
reranking — pass all 40 candidates to a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) and take the top 5

the reranker is the biggest quality improvement i’ve made to any rag system. it’s cheap to run (small model, short sequences) and meaningfully changes which results survive.

async embedding in production

embedding at query time is fine. embedding at ingestion time is the bottleneck. for a corpus of 50,000 document chunks, synchronous embedding takes hours.

the fix: a worker queue. document uploads go into a queue, an async worker picks them up, embeds in batches of 64, and writes to the vector store. fastapi’s BackgroundTasks works for light loads; for anything heavier use celery or an actual message queue.

@app.post("/documents")
async def upload_document(file: UploadFile, background_tasks: BackgroundTasks):
    content = await file.read()
    doc_id = await store_raw(content)
    background_tasks.add_task(embed_and_index, doc_id, content)
    return {"id": doc_id, "status": "processing"}

what i’d change if starting over

add query expansion from the start. rewrite the user query into 3–4 variations before retrieval. the recall improvement is worth the extra llm call.

instrument retrieval quality. log every query, the chunks retrieved, and whether the user found the answer useful. without this data you’re flying blind on whether your chunking strategy is working.

version your embeddings. when you swap embedding models (and you will), you need to re-embed everything. if you’re not tracking which model embedded which chunks you’ll have silent recall degradation.

the pipeline described here powers the rag layer in smartle. the chromadb + fastapi stack has handled up to ~70k chunks in testing without significant performance issues.