rag-architect
Production RAG (Retrieval-Augmented Generation) system design. Vector databases (Pinecone, Weaviate, pgvector), chunking strategies, hybrid search, reranking, and RAGAS evaluation. The complete playbook for building RAG that actually works.
Installation
npx clawhub@latest install rag-architectView the full skill documentation and source below.
Documentation
RAG Architecture: Production-Grade Retrieval-Augmented Generation
What Is RAG and When to Use It
RAG grounds LLM responses in your own data — documents, databases, knowledge bases — reducing hallucination and keeping answers current without fine-tuning.
Use RAG when:
- Answers require proprietary or frequently-updated data
- Users need source citations
- The LLM's training cutoff is a problem
- Fine-tuning is too expensive or slow for your update cadence
Don't use RAG when:
- Data fits in the context window (use direct injection)
- You need to change model behavior (use fine-tuning instead)
- Low latency is critical with no budget for retrieval (~100-500ms overhead)
Architecture Layers
Query → [Pre-Processing] → [Retrieval] → [Augmentation] → [Generation] → [Post-Processing]
Query rewrite Vector search Context window LLM call Faithfulness check
HyDE BM25 hybrid Reranking Streaming Citation extraction
Decomposition Metadata filter Token limit mgmt Grounding Hallucination check
Step 1: Document Processing Pipeline
Chunking Strategy
Chunking is the single most impactful RAG decision. Bad chunking = bad retrieval.
Fixed-size chunking (fast, baseline):
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters, not tokens
chunk_overlap=200, # 20% overlap prevents context cuts
separators=["\n\n", "\n", ". ", " ", ""], # Tries each in order
)
chunks = splitter.create_documents([text], metadatas=[{"source": "doc.pdf"}])
Semantic chunking (better coherence, slower):
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile", # or "standard_deviation", "interquartile"
breakpoint_threshold_amount=95,
)
Hierarchical / parent-child chunking (best for long documents):
# Parent: 2000 char chunks for context
# Child: 400 char chunks for precision retrieval
# Retrieve child, return parent context
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
Chunking rules of thumb:
- Q&A systems: 256-512 tokens per chunk
- Summarization: 512-1024 tokens
- Technical docs: Use semantic chunking + 100-200 token overlap
- Code: Chunk by function/class, not by character count
Metadata Enrichment
Always attach metadata — it enables powerful filtered retrieval:
{
"source": "user_manual_v2.pdf",
"page": 42,
"section": "Installation",
"doc_type": "manual",
"created_at": "2024-01-15",
"last_modified": "2024-03-01",
"language": "en",
"product": "ProductX",
"version": "2.0",
"chunk_index": 5,
"total_chunks": 23,
}
Step 2: Embedding Models
| Model | Provider | Dimensions | Best For |
text-embedding-3-large | OpenAI | 3072 | General, English, accuracy |
text-embedding-3-small | OpenAI | 1536 | Cost-sensitive, fast |
embed-english-v3.0 | Cohere | 1024 | English, reranking |
embed-multilingual-v3.0 | Cohere | 1024 | Multi-language |
nomic-embed-text | Nomic/Ollama | 768 | Open source, local |
bge-m3 | BAAI | 1024 | Open source, best open-source |
from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str], batch_size: int = 100) -> list[list[float]]:
"""Embed texts in batches to respect rate limits."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch,
encoding_format="float", # or "base64" for storage efficiency
)
all_embeddings.extend([item.embedding for item in response.data])
return all_embeddings
Step 3: Vector Database Selection
| Database | Type | Best For | Hosted |
| Pinecone | Managed | Production, large scale, teams | Yes |
| Weaviate | Open source | Hybrid search, complex filtering | Both |
| Chroma | Open source | Development, local, small scale | No |
| pgvector | PostgreSQL ext | Already using Postgres, small-medium | No |
| Qdrant | Open source | Performance, on-prem, Rust-based | Both |
| Milvus | Open source | Large scale, self-hosted | Both |
pgvector (Best for existing Postgres users)
-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create table with embedding column
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
embedding vector(1536), -- Matches your embedding model dimensions
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create IVFFlat index (faster queries, slight accuracy tradeoff)
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); -- lists = sqrt(num_rows) is a good starting point
-- Create HNSW index (better accuracy, more memory)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Similarity search
SELECT id, content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'doc_type' = 'manual' -- Metadata filter
ORDER BY embedding <=> $1::vector
LIMIT 10;
Chroma (Development/Prototyping)
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.PersistentClient(path="./chroma_db")
# Use OpenAI embeddings
ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name="text-embedding-3-small",
)
collection = client.get_or_create_collection(
name="documents",
embedding_function=ef,
metadata={"hnsw:space": "cosine"},
)
# Add documents
collection.add(
documents=["chunk text 1", "chunk text 2"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["id1", "id2"],
)
# Query
results = collection.query(
query_texts=["what is the installation process?"],
n_results=5,
where={"source": "user_manual_v2.pdf"}, # Metadata filter
include=["documents", "metadatas", "distances"],
)
Pinecone (Production Scale)
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# Create index
pc.create_index(
name="documents",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("documents")
# Upsert vectors
vectors = [
{
"id": chunk_id,
"values": embedding,
"metadata": {
"text": chunk_text, # Store text in metadata for retrieval
"source": source,
"page": page_num,
}
}
for chunk_id, embedding, chunk_text, source, page_num in chunks
]
index.upsert(vectors=vectors, namespace="production")
# Query with metadata filter
results = index.query(
vector=query_embedding,
top_k=10,
filter={"doc_type": {"$eq": "manual"}, "version": {"$gte": "2.0"}},
include_metadata=True,
namespace="production",
)
Step 4: Hybrid Search (Dense + Sparse)
Hybrid search combines semantic (dense vector) and keyword (sparse BM25) retrieval. Almost always outperforms pure vector search.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
# Dense retriever (semantic)
vectorstore = Chroma.from_documents(docs, embedding=embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Sparse retriever (keyword)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10
# Combine with RRF (Reciprocal Rank Fusion)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.4, 0.6], # BM25 gets 40%, vector gets 60%
)
results = ensemble_retriever.invoke("installation steps")
When hybrid search wins most:
- Technical queries with specific terms/codes
- Queries with proper nouns, product names, version numbers
- Mixed bag: some users query semantically, some with keywords
Step 5: Reranking
After initial retrieval (top-20), rerank to get the best top-5. This dramatically improves precision.
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
# Initial retrieval: get 20 candidates
initial_results = retriever.invoke(query, top_k=20)
# Rerank with Cohere
reranked = co.rerank(
query=query,
documents=[doc.page_content for doc in initial_results],
top_n=5,
model="rerank-english-v3.0",
return_documents=True,
)
# Get reranked documents
final_docs = [initial_results[r.index] for r in reranked.results]
Cross-encoder rerankers to consider:
cohere rerank-english-v3.0— Best quality, API-basedBAAI/bge-reranker-v2-m3— Open source, strong multilingualms-marco-MiniLM-L-6-v2— Lightweight, local
Step 6: Query Transformation
Don't just embed the raw user query. Transform it first.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, embed that. Often finds better chunks:def hyde_query(query: str, llm) -> str:
"""Generate hypothetical document to improve retrieval."""
prompt = f"""Write a short document that would answer this question:
Question: {query}
Document:"""
hypothetical_doc = llm.invoke(prompt).content
return hypothetical_doc # Embed this instead of the raw query
Query Decomposition
Break complex queries into sub-queries, retrieve for each:def decompose_query(query: str, llm) -> list[str]:
prompt = f"""Break this complex question into 2-4 simpler sub-questions.
Return as a JSON array of strings.
Question: {query}
Sub-questions:"""
result = llm.invoke(prompt).content
return json.loads(result)
Contextual Compression
Use an LLM to extract only the relevant part of each retrieved chunk:from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever,
)
Step 7: Generation with Context
Prompt Template
RAG_PROMPT = """You are a helpful assistant. Answer the question using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have information about that."
Always cite your sources using [Source: document_name, page X].
Context:
{context}
Question: {question}
Answer:"""
Context Assembly with Token Budget
import tiktoken
def assemble_context(docs: list, max_tokens: int = 6000) -> str:
"""Fit as many docs as possible within token budget."""
enc = tiktoken.encoding_for_model("gpt-4o")
context_parts = []
token_count = 0
for doc in docs:
text = f"[Source: {doc.metadata.get('source', 'unknown')}, Page {doc.metadata.get('page', '?')}]\n{doc.page_content}\n\n"
tokens = len(enc.encode(text))
if token_count + tokens > max_tokens:
break
context_parts.append(text)
token_count += tokens
return "".join(context_parts)
Step 8: RAG Evaluation with RAGAS
Evaluate your RAG pipeline before shipping it.
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Are answers grounded in context?
answer_relevancy, # Is the answer relevant to the question?
context_precision, # Is retrieved context useful?
context_recall, # Does context contain enough info?
)
from datasets import Dataset
# Build evaluation dataset
eval_data = {
"question": ["What is the return policy?"],
"answer": ["Items can be returned within 30 days."],
"contexts": [["Our return policy allows 30-day returns for unused items."]],
"ground_truth": ["Returns accepted within 30 days."],
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=ChatOpenAI(model="gpt-4o"),
embeddings=OpenAIEmbeddings(),
)
print(results)
# faithfulness: 0.95 (are answers grounded in retrieved docs?)
# answer_relevancy: 0.88 (does answer address the question?)
# context_precision: 0.82 (is retrieved context useful?)
# context_recall: 0.90 (does context cover the answer?)
Target scores: All metrics > 0.80 before production. Below 0.70 = broken retrieval.
Common RAG Failure Modes & Fixes
| Problem | Symptom | Fix |
| Low faithfulness | LLM adds info not in context | Stronger system prompt, lower temperature |
| Low context recall | Right answer not in top-K | Increase K, fix chunking, improve embeddings |
| Low precision | Retrieved chunks are irrelevant | Add metadata filters, use hybrid search |
| Slow retrieval | >500ms per query | HNSW index, fewer dimensions, cache popular queries |
| Stale embeddings | Old info retrieved | Implement document versioning, re-embed on update |
| Context window exceeded | Truncation errors | Parent-child chunking, contextual compression |
| Poor multilingual | Bad non-English recall | Use multilingual embedding model (bge-m3) |
Production Checklist
- ○Chunking tested against real queries
- ○Embeddings batch-ingested with error recovery
- ○Metadata schema designed and documented
- ○Hybrid search enabled (dense + BM25)
- ○Reranker in place for top-K selection
- ○RAGAS scores > 0.80 across all metrics
- ○Token budget enforced in context assembly
- ○Citation extraction working
- ○Incremental update pipeline for new documents
- ○Embedding model version locked (changes require re-embedding)
- ○Query caching for popular questions
- ○Monitoring: retrieval latency, answer quality drift