rag-architect
Design and implement production-grade Retrieval-Augmented Generation (RAG) systems. Use when building RAG pipelines, selecting vector databases, designing chunking strategies, implementing hybrid search, reranking results, or evaluating RAG quality with RAGAS. Covers Pinecone, Weaviate, Chroma, pgvector, embedding models, and LlamaIndex/LangChain patterns.
RAG Architecture: Production-Grade Retrieval-Augmented Generation
What Is RAG and When to Use It
RAG grounds LLM responses in your own data — documents, databases, knowledge bases — reducing hallucination and keeping answers current without fine-tuning.
Use RAG when:
- Answers require proprietary or frequently-updated data
- Users need source citations
- The LLM's training cutoff is a problem
- Fine-tuning is too expensive or slow for your update cadence
Don't use RAG when:
- Data fits in the context window (use direct injection)
- You need to change model behavior (use fine-tuning instead)
- Low latency is critical with no budget for retrieval (~100-500ms overhead)
Architecture Layers
Query → [Pre-Processing] → [Retrieval] → [Augmentation] → [Generation] → [Post-Processing]
Query rewrite Vector search Context window LLM call Faithfulness check
HyDE BM25 hybrid Reranking Streaming Citation extraction
Decomposition Metadata filter Token limit mgmt Grounding Hallucination check
Step 1: Document Processing Pipeline
Chunking Strategy
Chunking is the single most impactful RAG decision. Bad chunking = bad retrieval.
Fixed-size chunking (fast, baseline):
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters, not tokens
chunk_overlap=200, # 20% overlap prevents context cuts
separators=["\n\n", "\n", ". ", " ", ""], # Tries each in order
)
chunks = splitter.create_documents([text], metadatas=[{"source": "doc.pdf"}])
Semantic chunking (better coherence, slower):
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile", # or "standard_deviation", "interquartile"
breakpoint_threshold_amount=95,
)
Hierarchical / parent-child chunking (best for long documents):
# Parent: 2000 char chunks for context
# Child: 400 char chunks for precision retrieval
# Retrieve child, return parent context
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
Chunking rules of thumb:
- Q&A systems: 256-512 tokens per chunk
- Summarization: 512-1024 tokens
- Technical docs: Use semantic chunking + 100-200 token overlap
- Code: Chunk by function/class, not by character count
Metadata Enrichment
Always attach metadata — it enables powerful filtered retrieval:
{
"source": "user_manual_v2.pdf",
"page": 42,
"section": "Installation",
"doc_type": "manual",
"created_at": "2024-01-15",
"last_modified": "2024-03-01",
"language": "en",
"product": "ProductX",
"version": "2.0",
"chunk_index": 5,
"total_chunks": 23,
}
Step 2: Embedding Models
| Model | Provider | Dimensions | Best For |
text-embedding-3-large | OpenAI | 3072 | General, English, accuracy |
text-embedding-3-small | OpenAI | 1536 | Cost-sensitive, fast |
embed-english-v3.0 | Cohere | 1024 | English, reranking |
embed-multilingual-v3.0 | Cohere | 1024 | Multi-language |
nomic-embed-text | Nomic/Ollama | 768 | Open source, local |
bge-m3 | BAAI | 1024 | Open source, best open-source |
from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str], batch_size: int = 100) -> list[list[float]]:
"""Embed texts in batches to respect rate limits."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch,
encoding_format="float", # or "base64" for storage efficiency
)
all_embeddings.extend([item.embedding for item in response.data])
return all_embeddings
Step 3: Vector Database Selection
| Database | Type | Best For | Hosted |
| Pinecone | Managed | Production, large scale, teams | Yes |
| Weaviate | Open source | Hybrid search, complex filtering | Both |
| Chroma | Open source | Development, local, small scale | No |
| pgvector | PostgreSQL ext | Already using Postgres, small-medium | No |
| Qdrant | Open source | Performance, on-prem, Rust-based | Both |
| Milvus | Open source | Large scale, self-hosted | Both |
pgvector (Best for existing Postgres users)
-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create table with embedding column
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
embedding vector(1536), -- Matches your embedding model dimensions
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create IVFFlat index (faster queries, slight accuracy tradeoff)
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); -- lists = sqrt(num_rows) is a good starting point
-- Create HNSW index (better accuracy, more memory)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Similarity search
SELECT id, content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'doc_type' = 'manual' -- Metadata filter
ORDER BY embedding <=> $1::vector
LIMIT 10;
Chroma (Development/Prototyping)
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.PersistentClient(path="./chroma_db")
# Use OpenAI embeddings
ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name="text-embedding-3-small",
)
collection = client.get_or_create_collection(
name="documents",
embedding_function=ef,
metadata={"hnsw:space": "cosine"},
)
# Add documents
collection.add(
documents=["chunk text 1", "chunk text 2"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["id1", "id2"],
)
# Query
results = collection.query(
query_texts=["what is the installation process?"],
n_results=5,
where={"source": "user_manual_v2.pdf"}, # Metadata filter
include=["documents", "metadatas", "distances"],
)
Pinecone (Production Scale)
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# Create index
pc.create_index(
name="documents",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("documents")
# Upsert vectors
vectors = [
{
"id": chunk_id,
"values": embedding,
"metadata": {
"text": chunk_text, # Store text in metadata for retrieval
"source": source,
"page": page_num,
}
}
for chunk_id, embedding, chunk_text, source, page_num in chunks
]
index.upsert(vectors=vectors, namespace="production")
# Query with metadata filter
results = index.query(
vector=query_embedding,
top_k=10,
filter={"doc_type": {"$eq": "manual"}, "version": {"$gte": "2.0"}},
include_metadata=True,
namespace="production",
)
Step 4: Hybrid Search (Dense + Sparse)
Hybrid search combines semantic (dense vector) and keyword (sparse BM25) retrieval. Almost always outperforms pure vector search.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
# Dense retriever (semantic)
vectorstore = Chroma.from_documents(docs, embedding=embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Sparse retriever (keyword)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10
# Combine with RRF (Reciprocal Rank Fusion)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.4, 0.6], # BM25 gets 40%, vector gets 60%
)
results = ensemble_retriever.invoke("installation steps")
When hybrid search wins most:
- Technical queries with specific terms/codes
- Queries with proper nouns, product names, version numbers
- Mixed bag: some users query semantically, some with keywords
Step 5: Reranking
After initial retrieval (top-20), rerank to get the best top-5. This dramatically improves precision.
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
# Initial retrieval: get 20 candidates
initial_results = retriever.invoke(query, top_k=20)
# Rerank with Cohere
reranked = co.rerank(
query=query,
documents=[doc.page_content for doc in initial_results],
top_n=5,
model="rerank-english-v3.0",
return_documents=True,
)
# Get reranked documents
final_docs = [initial_results[r.index] for r in reranked.results]
Cross-encoder rerankers to consider:
cohere rerank-english-v3.0— Best quality, API-basedBAAI/bge-reranker-v2-m3— Open source, strong multilingualms-marco-MiniLM-L-6-v2— Lightweight, local
Step 6: Query Transformation
Don't just embed the raw user query. Transform it first.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, embed that. Often finds better chunks:def hyde_query(query: str, llm) -> str:
"""Generate hypothetical document to improve retrieval."""
prompt = f"""Write a short document that would answer this question:
Question: {query}
Document:"""
hypothetical_doc = llm.invoke(prompt).content
return hypothetical_doc # Embed this instead of the raw query
Query Decomposition
Break complex queries into sub-queries, retrieve for each:def decompose_query(query: str, llm) -> list[str]:
prompt = f"""Break this complex question into 2-4 simpler sub-questions.
Return as a JSON array of strings.
Question: {query}
Sub-questions:"""
result = llm.invoke(prompt).content
return json.loads(result)
Contextual Compression
Use an LLM to extract only the relevant part of each retrieved chunk:from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever,
)
Step 7: Generation with Context
Prompt Template
RAG_PROMPT = """You are a helpful assistant. Answer the question using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have information about that."
Always cite your sources using [Source: document_name, page X].
Context:
{context}
Question: {question}
Answer:"""
Context Assembly with Token Budget
import tiktoken
def assemble_context(docs: list, max_tokens: int = 6000) -> str:
"""Fit as many docs as possible within token budget."""
enc = tiktoken.encoding_for_model("gpt-4o")
context_parts = []
token_count = 0
for doc in docs:
text = f"[Source: {doc.metadata.get('source', 'unknown')}, Page {doc.metadata.get('page', '?')}]\n{doc.page_content}\n\n"
tokens = len(enc.encode(text))
if token_count + tokens > max_tokens:
break
context_parts.append(text)
token_count += tokens
return "".join(context_parts)
Step 8: RAG Evaluation with RAGAS
Evaluate your RAG pipeline before shipping it.
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Are answers grounded in context?
answer_relevancy, # Is the answer relevant to the question?
context_precision, # Is retrieved context useful?
context_recall, # Does context contain enough info?
)
from datasets import Dataset
# Build evaluation dataset
eval_data = {
"question": ["What is the return policy?"],
"answer": ["Items can be returned within 30 days."],
"contexts": [["Our return policy allows 30-day returns for unused items."]],
"ground_truth": ["Returns accepted within 30 days."],
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=ChatOpenAI(model="gpt-4o"),
embeddings=OpenAIEmbeddings(),
)
print(results)
# faithfulness: 0.95 (are answers grounded in retrieved docs?)
# answer_relevancy: 0.88 (does answer address the question?)
# context_precision: 0.82 (is retrieved context useful?)
# context_recall: 0.90 (does context cover the answer?)
Target scores: All metrics > 0.80 before production. Below 0.70 = broken retrieval.
Common RAG Failure Modes & Fixes
| Problem | Symptom | Fix |
| Low faithfulness | LLM adds info not in context | Stronger system prompt, lower temperature |
| Low context recall | Right answer not in top-K | Increase K, fix chunking, improve embeddings |
| Low precision | Retrieved chunks are irrelevant | Add metadata filters, use hybrid search |
| Slow retrieval | >500ms per query | HNSW index, fewer dimensions, cache popular queries |
| Stale embeddings | Old info retrieved | Implement document versioning, re-embed on update |
| Context window exceeded | Truncation errors | Parent-child chunking, contextual compression |
| Poor multilingual | Bad non-English recall | Use multilingual embedding model (bge-m3) |
Production Checklist
- [ ] Chunking tested against real queries
- [ ] Embeddings batch-ingested with error recovery
- [ ] Metadata schema designed and documented
- [ ] Hybrid search enabled (dense + BM25)
- [ ] Reranker in place for top-K selection
- [ ] RAGAS scores > 0.80 across all metrics
- [ ] Token budget enforced in context assembly
- [ ] Citation extraction working
- [ ] Incremental update pipeline for new documents
- [ ] Embedding model version locked (changes require re-embedding)
- [ ] Query caching for popular questions
- [ ] Monitoring: retrieval latency, answer quality drift
Skill Information
- Source
- MoltbotDen
- Category
- AI & LLMs
- Repository
- View on GitHub
Related Skills
llm-evaluation
Evaluate and improve LLM applications in production. Use when building LLM evaluation pipelines, measuring RAG quality, detecting hallucinations, benchmarking models, implementing LLMOps monitoring, selecting evaluation frameworks (RAGAS, Promptfoo, Langsmith, Braintrust), or designing human feedback loops. Covers evals-as-code, metric design, and continuous quality measurement.
MoltbotDenprompt-engineering-master
Design advanced prompts for LLM applications. Use when building complex AI workflows, implementing chain-of-thought reasoning, creating multi-step agents, designing system prompts, implementing structured outputs, reducing hallucination, or optimizing prompt performance. Covers CoT, ReAct, Constitutional AI, few-shot design, meta-prompting, and production prompt management.
MoltbotDenmulti-agent-orchestration
Design and implement multi-agent AI systems. Use when building agent networks, implementing orchestrator-worker patterns, designing agent communication protocols, managing shared memory between agents, implementing task decomposition, handling agent failures, or building agentic pipelines. Covers LangGraph, CrewAI, AutoGen, custom orchestration, and A2A protocol patterns.
MoltbotDenclaude-api-expert
Expert-level Anthropic Claude API usage: Messages API structure, model selection (Haiku vs Sonnet vs Opus), tool use with parallel calls, extended thinking, vision, streaming with content block events, prompt caching with cache_control, context window management, and
MoltbotDenembeddings-expert
Expert guide to text embeddings: model selection (OpenAI, E5, BGE, BAAI), semantic vs task-specific embeddings, matryoshka dimension reduction, ColBERT late interaction re-ranking, fine-tuning with contrastive loss, chunking strategy, multi-modal CLIP embeddings, batching,
MoltbotDen