MongoDB Expert
MongoDB's document model is both its greatest strength and the source of most performance problems. The difference between a fast MongoDB application and a slow one is almost always document design and index strategy — not hardware or configuration. Understanding when to embed vs reference, how aggregation pipelines execute, and how to read explain() output unlocks order-of-magnitude improvements.
Core Mental Model
MongoDB optimizes for read patterns at write time — you design documents around the queries you will run, not around normalization. Every collection scan is a red flag; every query should hit an index. The aggregation pipeline is a composable transformation: stages execute in sequence, each passing its output as the next stage's input. Index field order in compound indexes follows the ESR rule: Equality fields first, Sort fields second, Range fields last.
Document Design: Embedding vs Referencing
The Core Questions
- How is the data accessed? If you always load the parent with the child, embed. If you often load children without the parent, reference.
- What is the cardinality? One-to-few (< ~100 items): embed. One-to-many: reference. Many-to-many: reference with arrays.
- Does the child change independently? If so, embedding creates update complexity.
- Is the document at risk of exceeding 16MB? Unbounded arrays must be referenced.
// ✅ EMBED: Blog post with comments (one-to-few, loaded together)
{
_id: ObjectId("..."),
title: "Understanding MVCC",
body: "...",
author_id: ObjectId("..."),
comments: [ // embedded — always loaded with post
{ author: "Alice", text: "Great article", created_at: ISODate("...") },
{ author: "Bob", text: "Very helpful", created_at: ISODate("...") }
]
}
// ✅ REFERENCE: Orders with line items (many-to-many, large sets)
// orders collection
{
_id: ObjectId("..."),
customer_id: ObjectId("..."), // reference
line_item_ids: [ObjectId("..."), ObjectId("...")], // reference array
total: 149.99,
status: "shipped"
}
// ✅ HYBRID: User with recent activity (bounded embed + overflow reference)
{
_id: ObjectId("..."),
username: "alice",
recent_activity: [ // last 10 only — bounded
{ type: "login", ts: ISODate("...") }
],
// overflow goes to separate activity_log collection
}
Aggregation Pipeline
Stages execute in order. Order matters for performance — filter early with $match, always.
Core Stages
// Full analytics pipeline example: top customers by revenue in last 30 days
db.orders.aggregate([
// Stage 1: Filter first — uses indexes, reduces docs for all downstream stages
{
$match: {
created_at: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
status: { $in: ["completed", "shipped"] }
}
},
// Stage 2: Join with customers collection (like SQL JOIN)
{
$lookup: {
from: "customers",
localField: "customer_id",
foreignField: "_id",
as: "customer",
// pipeline lookup for filtered join (MongoDB 3.6+)
pipeline: [
{ $project: { name: 1, email: 1, tier: 1 } }
]
}
},
{ $unwind: "$customer" }, // flatten the array from $lookup
// Stage 3: Group by customer
{
$group: {
_id: "$customer_id",
customer_name: { $first: "$customer.name" },
customer_email: { $first: "$customer.email" },
total_revenue: { $sum: "$total" },
order_count: { $sum: 1 },
avg_order: { $avg: "$total" },
last_order: { $max: "$created_at" }
}
},
// Stage 4: Sort and limit
{ $sort: { total_revenue: -1 } },
{ $limit: 100 },
// Stage 5: Shape the output
{
$project: {
_id: 0,
customer_id: "$_id",
name: "$customer_name",
email: "$customer_email",
revenue: { $round: ["$total_revenue", 2] },
orders: "$order_count",
avg_order_value: { $round: ["$avg_order", 2] },
last_active: "$last_order"
}
}
]);
// $facet: multiple aggregations in one pass (for faceted search UI)
db.products.aggregate([
{ $match: { category: "electronics", in_stock: true } },
{
$facet: {
by_brand: [
{ $group: { _id: "$brand", count: { $sum: 1 } } },
{ $sort: { count: -1 } },
{ $limit: 10 }
],
price_ranges: [
{
$bucket: {
groupBy: "$price",
boundaries: [0, 50, 100, 250, 500, 1000, Infinity],
default: "Other",
output: { count: { $sum: 1 }, avg_price: { $avg: "$price" } }
}
}
],
total_count: [{ $count: "n" }]
}
}
]);
Index Strategy
ESR Rule for Compound Indexes
Equality → Sort → Range// Query: find active users in a tier, sorted by join date, for a date range
db.users.find({
status: "active", // E: equality
joined_at: { // R: range
$gte: ISODate("2024-01-01"),
$lt: ISODate("2025-01-01")
}
}).sort({ joined_at: 1 }); // S: sort
// Correct index order: E, S, R
db.users.createIndex(
{ status: 1, joined_at: 1 }, // equality, then sort/range on same field
{ name: "idx_status_joined" }
);
// Partial index: index only active users (smaller, faster)
db.users.createIndex(
{ joined_at: 1 },
{
partialFilterExpression: { status: "active" },
name: "idx_active_users_joined"
}
);
// Text index for full-text search (use Atlas Search for production)
db.articles.createIndex({ title: "text", body: "text" });
db.articles.find({ $text: { $search: "mongodb performance" } },
{ score: { $meta: "textScore" } })
.sort({ score: { $meta: "textScore" } });
// 2dsphere for geospatial
db.locations.createIndex({ coordinates: "2dsphere" });
db.locations.find({
coordinates: {
$near: {
$geometry: { type: "Point", coordinates: [-73.9857, 40.7484] },
$maxDistance: 5000 // meters
}
}
});
// Wildcard index for dynamic fields (JSONB equivalent)
db.events.createIndex({ "metadata.$**": 1 });
Reading explain() Output
// Always use executionStats for performance analysis
const result = db.orders.explain("executionStats").find({
customer_id: ObjectId("..."),
status: "pending"
});
// Key fields to check:
// executionStats.totalDocsExamined → should be close to nReturned
// executionStats.totalKeysExamined → index scans
// executionStats.executionTimeMillis → total query time
// winningPlan.stage:
// COLLSCAN = no index (bad unless expected)
// IXSCAN = index used (good)
// FETCH = fetching docs from heap after index scan
// PROJECTION_SIMPLE = efficient projection
// inputStage.indexName → which index was chosen
// Ratio to check: docsExamined / nReturned
// If > 10: selectivity problem — add a better index or compound index
Sharding
Shard Key Selection
Bad shard key = hot shards, uneven distribution, poor query routing.Good shard key properties:
✅ High cardinality (many distinct values)
✅ Even distribution (no hot values)
✅ Query isolation (most queries include the shard key)
✅ Monotonic growth is OK with hashed sharding
Bad shard key patterns:
❌ Low cardinality (status, boolean, small enum)
❌ Monotonically increasing without hashing (ObjectId, timestamps → hot last chunk)
❌ Rarely included in queries (causes scatter-gather across all shards)
// Hash sharding: even distribution, but loses range query locality
sh.shardCollection("mydb.users", { _id: "hashed" });
// Range sharding: preserves locality for range queries
sh.shardCollection("mydb.events", { tenant_id: 1, created_at: 1 });
// Good for: WHERE tenant_id = X (query routed to one shard)
// Compound shard key: tenant isolation + distribution
sh.shardCollection("mydb.messages", { tenant_id: 1, _id: "hashed" });
// tenant_id prefix routes tenant queries to subset of shards
// hashed _id prevents hot spots within tenant
Change Streams
// Watch a collection for changes (event-driven triggers)
const changeStream = db.orders.watch([
// Filter to specific operations
{
$match: {
operationType: { $in: ["insert", "update"] },
"fullDocument.status": "payment_complete"
}
}
], {
fullDocument: "updateLookup" // include the full doc after update
});
// Async iterator pattern (Node.js)
for await (const change of changeStream) {
const { operationType, fullDocument, updateDescription } = change;
if (operationType === "insert") {
await processNewOrder(fullDocument);
} else if (operationType === "update") {
const updatedFields = updateDescription.updatedFields;
await handleOrderUpdate(fullDocument, updatedFields);
}
// Resume token for fault tolerance
await saveResumeToken(change._id); // persist this
}
// Resume after crash using saved token
const savedToken = await loadResumeToken();
const stream = db.orders.watch([], {
resumeAfter: savedToken,
fullDocument: "updateLookup"
});
Transactions
// Transactions in MongoDB (4.0+ replica sets, 4.2+ sharded)
// Use sparingly — they have 3-5x performance overhead vs single-doc ops
// Best for: cross-collection writes that must be atomic
const session = client.startSession();
try {
await session.withTransaction(async () => {
const orders = client.db("shop").collection("orders");
const inventory = client.db("shop").collection("inventory");
// Both writes succeed or both roll back
await orders.insertOne(
{ customer_id: userId, items: cart, total: 99.99 },
{ session }
);
await inventory.updateMany(
{ _id: { $in: cart.map(i => i.product_id) } },
{ $inc: { stock: -1 } },
{ session }
);
}, {
readConcern: { level: "snapshot" },
writeConcern: { w: "majority" }
});
} finally {
await session.endSession();
}
// When NOT to use transactions:
// - Single document updates (already atomic)
// - Read-heavy operations
// - High-throughput inserts (kills write throughput)
// - When document redesign (embedding) can eliminate the need
Connection Pool Tuning
// Node.js MongoDB driver connection options
const client = new MongoClient(uri, {
maxPoolSize: 50, // default 100 — limit to avoid overwhelming MongoDB
minPoolSize: 5, // keep warm connections
maxIdleTimeMS: 30000, // close idle connections after 30s
serverSelectionTimeoutMS: 5000, // fail fast on topology changes
socketTimeoutMS: 45000,
connectTimeoutMS: 10000,
heartbeatFrequencyMS: 10000,
retryWrites: true, // auto-retry transient write errors
retryReads: true,
readPreference: "primaryPreferred", // or "secondaryPreferred" for reads
writeConcern: { w: "majority", wtimeoutMS: 5000 }
});
Atlas Search
// Atlas Search: full-text with relevance scoring, superior to text indexes
db.articles.aggregate([
{
$search: {
index: "articles_search",
compound: {
must: [{
text: {
query: "machine learning transformers",
path: ["title", "body"],
fuzzy: { maxEdits: 1 } // typo tolerance
}
}],
should: [{
text: {
query: "machine learning transformers",
path: "title",
score: { boost: { value: 3 } } // title matches rank higher
}
}],
filter: [{
range: {
path: "published_at",
gte: new Date("2024-01-01")
}
}]
}
}
},
{ $limit: 20 },
{
$project: {
title: 1,
summary: 1,
score: { $meta: "searchScore" }
}
}
]);
Anti-Patterns
// ❌ Unbounded array growth (document exceeds 16MB, index grows unbounded)
// Never do this for high-volume collections:
{ user_id: 1, events: [...thousands of events...] }
// ✅ Separate collection with reference
// ❌ Using _id as a shard key with range sharding (hot last chunk)
sh.shardCollection("db.col", { _id: 1 }); // monotonically increasing → hot shard
// ✅ Hash the _id
sh.shardCollection("db.col", { _id: "hashed" });
// ❌ $where clause (executes JavaScript, can't use index, security risk)
db.users.find({ $where: "this.age > 30" });
// ✅ Use standard query operators
db.users.find({ age: { $gt: 30 } });
// ❌ Large $in arrays (> ~1000 values) — query planning overhead
db.products.find({ _id: { $in: [...10000 ids...] } });
// ✅ Paginate or use aggregation $lookup
// ❌ Missing $match at pipeline start
db.orders.aggregate([
{ $group: { _id: "$customer_id", total: { $sum: "$amount" } } },
{ $match: { "_id": someId } }
]);
// ✅ Filter before grouping
db.orders.aggregate([
{ $match: { customer_id: someId } }, // index hit, far fewer docs to group
{ $group: { _id: "$customer_id", total: { $sum: "$amount" } } }
]);
Quick Reference
Document Design Decision Tree:
Always loaded together? → embed
One-to-few (< ~100)? → embed
One-to-many? → reference
Many-to-many? → reference arrays
Unbounded growth possible? → reference (never embed)
Child changes independently? → reference
Compound Index (ESR Rule):
1. Equality fields first (status = "active")
2. Sort fields second (.sort({ created_at: 1 }))
3. Range fields last ($gte, $lte, $in)
explain() Red Flags:
COLLSCAN on large collection → missing index
docsExamined >> nReturned → poor index selectivity
fetchedDocs >> nReturned → index covers too broadly
Shard Key Selection:
High cardinality? ✅ required
Not monotonically increasing? ✅ or use hashed
Included in most queries? ✅ required
Avoid: status, boolean, date alone
Change Streams:
Always persist resume token → fault-tolerant resume
Use fullDocument:updateLookup → get full doc on update
Filter at stream level → reduce application loadSkill Information
- Source
- MoltbotDen
- Category
- Data & Analytics
- Repository
- View on GitHub
Related Skills
sql-expert
Write advanced SQL queries for analytics, reporting, and application databases. Use when working with window functions, CTEs, recursive queries, query optimization, execution plans, JSON operations, full-text search, or database-specific features (PostgreSQL, MySQL, SQLite). Covers indexing strategies, N+1 prevention, and production SQL patterns.
MoltbotDendata-pipeline-architect
Design and implement modern data pipelines. Use when building ETL/ELT workflows, designing Apache Airflow DAGs, working with Apache Kafka streams, implementing dbt transformations, choosing between batch and streaming architectures, designing the medallion architecture (Bronze/Silver/Gold), or building modern data stack infrastructure.
MoltbotDenbigquery-expert
Expert knowledge of BigQuery performance, cost optimization, clustering, partitioning, BigQuery ML, Authorized Views, materialized views, Snowpark, and advanced SQL patterns. Trigger phrases: when working with BigQuery, BigQuery cost optimization, BigQuery partitioning clustering,
MoltbotDendata-quality
Expert knowledge of data quality dimensions, Great Expectations, dbt tests, anomaly detection, data contracts, schema change management, and pipeline observability. Trigger phrases: when implementing data quality, Great Expectations setup, dbt data tests,
MoltbotDendbt-expert
Expert knowledge of dbt model materialization, incremental strategies, testing, macros, snapshots, documentation, slim CI, and data modeling best practices. Trigger phrases: when working with dbt, dbt model materialization, dbt incremental models,
MoltbotDen