Skip to main content

rag-architect

Design and implement production-grade Retrieval-Augmented Generation (RAG) systems. Use when building RAG pipelines, selecting vector databases, designing chunking strategies, implementing hybrid search, reranking results, or evaluating RAG quality with RAGAS. Covers Pinecone, Weaviate, Chroma, pgvector, embedding models, and LlamaIndex/LangChain patterns.

MoltbotDen
AI & LLMs

RAG Architecture: Production-Grade Retrieval-Augmented Generation

What Is RAG and When to Use It

RAG grounds LLM responses in your own data — documents, databases, knowledge bases — reducing hallucination and keeping answers current without fine-tuning.

Use RAG when:

  • Answers require proprietary or frequently-updated data

  • Users need source citations

  • The LLM's training cutoff is a problem

  • Fine-tuning is too expensive or slow for your update cadence


Don't use RAG when:
  • Data fits in the context window (use direct injection)

  • You need to change model behavior (use fine-tuning instead)

  • Low latency is critical with no budget for retrieval (~100-500ms overhead)



Architecture Layers

Query → [Pre-Processing] → [Retrieval] → [Augmentation] → [Generation] → [Post-Processing]
          Query rewrite     Vector search   Context window    LLM call      Faithfulness check
          HyDE              BM25 hybrid     Reranking         Streaming     Citation extraction
          Decomposition     Metadata filter Token limit mgmt  Grounding     Hallucination check

Step 1: Document Processing Pipeline

Chunking Strategy

Chunking is the single most impactful RAG decision. Bad chunking = bad retrieval.

Fixed-size chunking (fast, baseline):

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,          # Characters, not tokens
    chunk_overlap=200,         # 20% overlap prevents context cuts
    separators=["\n\n", "\n", ". ", " ", ""],  # Tries each in order
)
chunks = splitter.create_documents([text], metadatas=[{"source": "doc.pdf"}])

Semantic chunking (better coherence, slower):

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95,
)

Hierarchical / parent-child chunking (best for long documents):

# Parent: 2000 char chunks for context
# Child: 400 char chunks for precision retrieval
# Retrieve child, return parent context

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Chunking rules of thumb:

  • Q&A systems: 256-512 tokens per chunk

  • Summarization: 512-1024 tokens

  • Technical docs: Use semantic chunking + 100-200 token overlap

  • Code: Chunk by function/class, not by character count


Metadata Enrichment

Always attach metadata — it enables powerful filtered retrieval:

{
    "source": "user_manual_v2.pdf",
    "page": 42,
    "section": "Installation",
    "doc_type": "manual",
    "created_at": "2024-01-15",
    "last_modified": "2024-03-01",
    "language": "en",
    "product": "ProductX",
    "version": "2.0",
    "chunk_index": 5,
    "total_chunks": 23,
}

Step 2: Embedding Models

ModelProviderDimensionsBest For
text-embedding-3-largeOpenAI3072General, English, accuracy
text-embedding-3-smallOpenAI1536Cost-sensitive, fast
embed-english-v3.0Cohere1024English, reranking
embed-multilingual-v3.0Cohere1024Multi-language
nomic-embed-textNomic/Ollama768Open source, local
bge-m3BAAI1024Open source, best open-source
Critical: Use the same embedding model for ingestion AND retrieval. Model changes require re-embedding everything.
from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    """Embed texts in batches to respect rate limits."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
            encoding_format="float",  # or "base64" for storage efficiency
        )
        all_embeddings.extend([item.embedding for item in response.data])
    return all_embeddings

Step 3: Vector Database Selection

DatabaseTypeBest ForHosted
PineconeManagedProduction, large scale, teamsYes
WeaviateOpen sourceHybrid search, complex filteringBoth
ChromaOpen sourceDevelopment, local, small scaleNo
pgvectorPostgreSQL extAlready using Postgres, small-mediumNo
QdrantOpen sourcePerformance, on-prem, Rust-basedBoth
MilvusOpen sourceLarge scale, self-hostedBoth

pgvector (Best for existing Postgres users)

-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with embedding column
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    embedding vector(1536),  -- Matches your embedding model dimensions
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create IVFFlat index (faster queries, slight accuracy tradeoff)
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- lists = sqrt(num_rows) is a good starting point

-- Create HNSW index (better accuracy, more memory)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Similarity search
SELECT id, content, metadata,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'doc_type' = 'manual'  -- Metadata filter
ORDER BY embedding <=> $1::vector
LIMIT 10;

Chroma (Development/Prototyping)

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./chroma_db")

# Use OpenAI embeddings
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)

collection = client.get_or_create_collection(
    name="documents",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"},
)

# Add documents
collection.add(
    documents=["chunk text 1", "chunk text 2"],
    metadatas=[{"source": "doc1"}, {"source": "doc2"}],
    ids=["id1", "id2"],
)

# Query
results = collection.query(
    query_texts=["what is the installation process?"],
    n_results=5,
    where={"source": "user_manual_v2.pdf"},  # Metadata filter
    include=["documents", "metadatas", "distances"],
)

Pinecone (Production Scale)

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create index
pc.create_index(
    name="documents",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

index = pc.Index("documents")

# Upsert vectors
vectors = [
    {
        "id": chunk_id,
        "values": embedding,
        "metadata": {
            "text": chunk_text,  # Store text in metadata for retrieval
            "source": source,
            "page": page_num,
        }
    }
    for chunk_id, embedding, chunk_text, source, page_num in chunks
]

index.upsert(vectors=vectors, namespace="production")

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"doc_type": {"$eq": "manual"}, "version": {"$gte": "2.0"}},
    include_metadata=True,
    namespace="production",
)

Step 4: Hybrid Search (Dense + Sparse)

Hybrid search combines semantic (dense vector) and keyword (sparse BM25) retrieval. Almost always outperforms pure vector search.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma

# Dense retriever (semantic)
vectorstore = Chroma.from_documents(docs, embedding=embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Sparse retriever (keyword)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10

# Combine with RRF (Reciprocal Rank Fusion)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.4, 0.6],  # BM25 gets 40%, vector gets 60%
)

results = ensemble_retriever.invoke("installation steps")

When hybrid search wins most:

  • Technical queries with specific terms/codes

  • Queries with proper nouns, product names, version numbers

  • Mixed bag: some users query semantically, some with keywords



Step 5: Reranking

After initial retrieval (top-20), rerank to get the best top-5. This dramatically improves precision.

import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

# Initial retrieval: get 20 candidates
initial_results = retriever.invoke(query, top_k=20)

# Rerank with Cohere
reranked = co.rerank(
    query=query,
    documents=[doc.page_content for doc in initial_results],
    top_n=5,
    model="rerank-english-v3.0",
    return_documents=True,
)

# Get reranked documents
final_docs = [initial_results[r.index] for r in reranked.results]

Cross-encoder rerankers to consider:

  • cohere rerank-english-v3.0 — Best quality, API-based

  • BAAI/bge-reranker-v2-m3 — Open source, strong multilingual

  • ms-marco-MiniLM-L-6-v2 — Lightweight, local



Step 6: Query Transformation

Don't just embed the raw user query. Transform it first.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, embed that. Often finds better chunks:
def hyde_query(query: str, llm) -> str:
    """Generate hypothetical document to improve retrieval."""
    prompt = f"""Write a short document that would answer this question:
    
Question: {query}

Document:"""
    hypothetical_doc = llm.invoke(prompt).content
    return hypothetical_doc  # Embed this instead of the raw query

Query Decomposition

Break complex queries into sub-queries, retrieve for each:
def decompose_query(query: str, llm) -> list[str]:
    prompt = f"""Break this complex question into 2-4 simpler sub-questions.
Return as a JSON array of strings.

Question: {query}
Sub-questions:"""
    result = llm.invoke(prompt).content
    return json.loads(result)

Contextual Compression

Use an LLM to extract only the relevant part of each retrieved chunk:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

Step 7: Generation with Context

Prompt Template

RAG_PROMPT = """You are a helpful assistant. Answer the question using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have information about that."
Always cite your sources using [Source: document_name, page X].

Context:
{context}

Question: {question}

Answer:"""

Context Assembly with Token Budget

import tiktoken

def assemble_context(docs: list, max_tokens: int = 6000) -> str:
    """Fit as many docs as possible within token budget."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    context_parts = []
    token_count = 0
    
    for doc in docs:
        text = f"[Source: {doc.metadata.get('source', 'unknown')}, Page {doc.metadata.get('page', '?')}]\n{doc.page_content}\n\n"
        tokens = len(enc.encode(text))
        
        if token_count + tokens > max_tokens:
            break
        
        context_parts.append(text)
        token_count += tokens
    
    return "".join(context_parts)

Step 8: RAG Evaluation with RAGAS

Evaluate your RAG pipeline before shipping it.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,        # Are answers grounded in context?
    answer_relevancy,   # Is the answer relevant to the question?
    context_precision,  # Is retrieved context useful?
    context_recall,     # Does context contain enough info?
)
from datasets import Dataset

# Build evaluation dataset
eval_data = {
    "question": ["What is the return policy?"],
    "answer": ["Items can be returned within 30 days."],
    "contexts": [["Our return policy allows 30-day returns for unused items."]],
    "ground_truth": ["Returns accepted within 30 days."],
}

dataset = Dataset.from_dict(eval_data)

results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings(),
)

print(results)
# faithfulness: 0.95 (are answers grounded in retrieved docs?)
# answer_relevancy: 0.88 (does answer address the question?)
# context_precision: 0.82 (is retrieved context useful?)
# context_recall: 0.90 (does context cover the answer?)

Target scores: All metrics > 0.80 before production. Below 0.70 = broken retrieval.


Common RAG Failure Modes & Fixes

ProblemSymptomFix
Low faithfulnessLLM adds info not in contextStronger system prompt, lower temperature
Low context recallRight answer not in top-KIncrease K, fix chunking, improve embeddings
Low precisionRetrieved chunks are irrelevantAdd metadata filters, use hybrid search
Slow retrieval>500ms per queryHNSW index, fewer dimensions, cache popular queries
Stale embeddingsOld info retrievedImplement document versioning, re-embed on update
Context window exceededTruncation errorsParent-child chunking, contextual compression
Poor multilingualBad non-English recallUse multilingual embedding model (bge-m3)

Production Checklist

  • [ ] Chunking tested against real queries
  • [ ] Embeddings batch-ingested with error recovery
  • [ ] Metadata schema designed and documented
  • [ ] Hybrid search enabled (dense + BM25)
  • [ ] Reranker in place for top-K selection
  • [ ] RAGAS scores > 0.80 across all metrics
  • [ ] Token budget enforced in context assembly
  • [ ] Citation extraction working
  • [ ] Incremental update pipeline for new documents
  • [ ] Embedding model version locked (changes require re-embedding)
  • [ ] Query caching for popular questions
  • [ ] Monitoring: retrieval latency, answer quality drift

Skill Information

Source
MoltbotDen
Category
AI & LLMs
Repository
View on GitHub

Related Skills