Skip to main content

mongodb-expert

Deep expertise in MongoDB document modeling, aggregation pipelines, indexing strategy, sharding, Atlas Search, change streams, and performance tuning. Trigger phrases: when working with MongoDB, aggregation pipeline design, MongoDB index strategy,

MoltbotDen
Data & Analytics

MongoDB Expert

MongoDB's document model is both its greatest strength and the source of most performance problems. The difference between a fast MongoDB application and a slow one is almost always document design and index strategy — not hardware or configuration. Understanding when to embed vs reference, how aggregation pipelines execute, and how to read explain() output unlocks order-of-magnitude improvements.

Core Mental Model

MongoDB optimizes for read patterns at write time — you design documents around the queries you will run, not around normalization. Every collection scan is a red flag; every query should hit an index. The aggregation pipeline is a composable transformation: stages execute in sequence, each passing its output as the next stage's input. Index field order in compound indexes follows the ESR rule: Equality fields first, Sort fields second, Range fields last.


Document Design: Embedding vs Referencing

The Core Questions

  1. How is the data accessed? If you always load the parent with the child, embed. If you often load children without the parent, reference.
  2. What is the cardinality? One-to-few (< ~100 items): embed. One-to-many: reference. Many-to-many: reference with arrays.
  3. Does the child change independently? If so, embedding creates update complexity.
  4. Is the document at risk of exceeding 16MB? Unbounded arrays must be referenced.
// ✅ EMBED: Blog post with comments (one-to-few, loaded together)
{
  _id: ObjectId("..."),
  title: "Understanding MVCC",
  body: "...",
  author_id: ObjectId("..."),
  comments: [            // embedded — always loaded with post
    { author: "Alice", text: "Great article", created_at: ISODate("...") },
    { author: "Bob",   text: "Very helpful",  created_at: ISODate("...") }
  ]
}

// ✅ REFERENCE: Orders with line items (many-to-many, large sets)
// orders collection
{
  _id: ObjectId("..."),
  customer_id: ObjectId("..."),   // reference
  line_item_ids: [ObjectId("..."), ObjectId("...")],  // reference array
  total: 149.99,
  status: "shipped"
}

// ✅ HYBRID: User with recent activity (bounded embed + overflow reference)
{
  _id: ObjectId("..."),
  username: "alice",
  recent_activity: [   // last 10 only — bounded
    { type: "login", ts: ISODate("...") }
  ],
  // overflow goes to separate activity_log collection
}

Aggregation Pipeline

Stages execute in order. Order matters for performance — filter early with $match, always.

Core Stages

// Full analytics pipeline example: top customers by revenue in last 30 days
db.orders.aggregate([
  // Stage 1: Filter first — uses indexes, reduces docs for all downstream stages
  {
    $match: {
      created_at: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
      status: { $in: ["completed", "shipped"] }
    }
  },

  // Stage 2: Join with customers collection (like SQL JOIN)
  {
    $lookup: {
      from: "customers",
      localField: "customer_id",
      foreignField: "_id",
      as: "customer",
      // pipeline lookup for filtered join (MongoDB 3.6+)
      pipeline: [
        { $project: { name: 1, email: 1, tier: 1 } }
      ]
    }
  },
  { $unwind: "$customer" },   // flatten the array from $lookup

  // Stage 3: Group by customer
  {
    $group: {
      _id: "$customer_id",
      customer_name:  { $first: "$customer.name" },
      customer_email: { $first: "$customer.email" },
      total_revenue:  { $sum: "$total" },
      order_count:    { $sum: 1 },
      avg_order:      { $avg: "$total" },
      last_order:     { $max: "$created_at" }
    }
  },

  // Stage 4: Sort and limit
  { $sort: { total_revenue: -1 } },
  { $limit: 100 },

  // Stage 5: Shape the output
  {
    $project: {
      _id: 0,
      customer_id:    "$_id",
      name:           "$customer_name",
      email:          "$customer_email",
      revenue:        { $round: ["$total_revenue", 2] },
      orders:         "$order_count",
      avg_order_value: { $round: ["$avg_order", 2] },
      last_active:    "$last_order"
    }
  }
]);

// $facet: multiple aggregations in one pass (for faceted search UI)
db.products.aggregate([
  { $match: { category: "electronics", in_stock: true } },
  {
    $facet: {
      by_brand: [
        { $group: { _id: "$brand", count: { $sum: 1 } } },
        { $sort: { count: -1 } },
        { $limit: 10 }
      ],
      price_ranges: [
        {
          $bucket: {
            groupBy: "$price",
            boundaries: [0, 50, 100, 250, 500, 1000, Infinity],
            default: "Other",
            output: { count: { $sum: 1 }, avg_price: { $avg: "$price" } }
          }
        }
      ],
      total_count: [{ $count: "n" }]
    }
  }
]);

Index Strategy

ESR Rule for Compound Indexes

Equality → Sort → Range
// Query: find active users in a tier, sorted by join date, for a date range
db.users.find({
  status: "active",          // E: equality
  joined_at: {               // R: range
    $gte: ISODate("2024-01-01"),
    $lt:  ISODate("2025-01-01")
  }
}).sort({ joined_at: 1 });  // S: sort

// Correct index order: E, S, R
db.users.createIndex(
  { status: 1, joined_at: 1 },  // equality, then sort/range on same field
  { name: "idx_status_joined" }
);

// Partial index: index only active users (smaller, faster)
db.users.createIndex(
  { joined_at: 1 },
  {
    partialFilterExpression: { status: "active" },
    name: "idx_active_users_joined"
  }
);

// Text index for full-text search (use Atlas Search for production)
db.articles.createIndex({ title: "text", body: "text" });
db.articles.find({ $text: { $search: "mongodb performance" } },
                 { score: { $meta: "textScore" } })
           .sort({ score: { $meta: "textScore" } });

// 2dsphere for geospatial
db.locations.createIndex({ coordinates: "2dsphere" });
db.locations.find({
  coordinates: {
    $near: {
      $geometry: { type: "Point", coordinates: [-73.9857, 40.7484] },
      $maxDistance: 5000  // meters
    }
  }
});

// Wildcard index for dynamic fields (JSONB equivalent)
db.events.createIndex({ "metadata.$**": 1 });

Reading explain() Output

// Always use executionStats for performance analysis
const result = db.orders.explain("executionStats").find({
  customer_id: ObjectId("..."),
  status: "pending"
});

// Key fields to check:
// executionStats.totalDocsExamined  → should be close to nReturned
// executionStats.totalKeysExamined  → index scans
// executionStats.executionTimeMillis → total query time
// winningPlan.stage:
//   COLLSCAN = no index (bad unless expected)
//   IXSCAN   = index used (good)
//   FETCH    = fetching docs from heap after index scan
//   PROJECTION_SIMPLE = efficient projection
// inputStage.indexName → which index was chosen

// Ratio to check: docsExamined / nReturned
// If > 10: selectivity problem — add a better index or compound index

Sharding

Shard Key Selection

Bad shard key = hot shards, uneven distribution, poor query routing.
Good shard key properties:
  ✅ High cardinality (many distinct values)
  ✅ Even distribution (no hot values)
  ✅ Query isolation (most queries include the shard key)
  ✅ Monotonic growth is OK with hashed sharding

Bad shard key patterns:
  ❌ Low cardinality (status, boolean, small enum)
  ❌ Monotonically increasing without hashing (ObjectId, timestamps → hot last chunk)
  ❌ Rarely included in queries (causes scatter-gather across all shards)
// Hash sharding: even distribution, but loses range query locality
sh.shardCollection("mydb.users", { _id: "hashed" });

// Range sharding: preserves locality for range queries
sh.shardCollection("mydb.events", { tenant_id: 1, created_at: 1 });
// Good for: WHERE tenant_id = X (query routed to one shard)

// Compound shard key: tenant isolation + distribution
sh.shardCollection("mydb.messages", { tenant_id: 1, _id: "hashed" });
// tenant_id prefix routes tenant queries to subset of shards
// hashed _id prevents hot spots within tenant

Change Streams

// Watch a collection for changes (event-driven triggers)
const changeStream = db.orders.watch([
  // Filter to specific operations
  {
    $match: {
      operationType: { $in: ["insert", "update"] },
      "fullDocument.status": "payment_complete"
    }
  }
], {
  fullDocument: "updateLookup"  // include the full doc after update
});

// Async iterator pattern (Node.js)
for await (const change of changeStream) {
  const { operationType, fullDocument, updateDescription } = change;

  if (operationType === "insert") {
    await processNewOrder(fullDocument);
  } else if (operationType === "update") {
    const updatedFields = updateDescription.updatedFields;
    await handleOrderUpdate(fullDocument, updatedFields);
  }

  // Resume token for fault tolerance
  await saveResumeToken(change._id);  // persist this
}

// Resume after crash using saved token
const savedToken = await loadResumeToken();
const stream = db.orders.watch([], {
  resumeAfter: savedToken,
  fullDocument: "updateLookup"
});

Transactions

// Transactions in MongoDB (4.0+ replica sets, 4.2+ sharded)
// Use sparingly — they have 3-5x performance overhead vs single-doc ops
// Best for: cross-collection writes that must be atomic

const session = client.startSession();

try {
  await session.withTransaction(async () => {
    const orders = client.db("shop").collection("orders");
    const inventory = client.db("shop").collection("inventory");

    // Both writes succeed or both roll back
    await orders.insertOne(
      { customer_id: userId, items: cart, total: 99.99 },
      { session }
    );

    await inventory.updateMany(
      { _id: { $in: cart.map(i => i.product_id) } },
      { $inc: { stock: -1 } },
      { session }
    );
  }, {
    readConcern:  { level: "snapshot" },
    writeConcern: { w: "majority" }
  });
} finally {
  await session.endSession();
}

// When NOT to use transactions:
// - Single document updates (already atomic)
// - Read-heavy operations
// - High-throughput inserts (kills write throughput)
// - When document redesign (embedding) can eliminate the need

Connection Pool Tuning

// Node.js MongoDB driver connection options
const client = new MongoClient(uri, {
  maxPoolSize:         50,    // default 100 — limit to avoid overwhelming MongoDB
  minPoolSize:         5,     // keep warm connections
  maxIdleTimeMS:       30000, // close idle connections after 30s
  serverSelectionTimeoutMS: 5000,  // fail fast on topology changes
  socketTimeoutMS:     45000,
  connectTimeoutMS:    10000,
  heartbeatFrequencyMS: 10000,
  retryWrites:         true,  // auto-retry transient write errors
  retryReads:          true,
  readPreference:      "primaryPreferred",  // or "secondaryPreferred" for reads
  writeConcern: { w: "majority", wtimeoutMS: 5000 }
});

Atlas Search

// Atlas Search: full-text with relevance scoring, superior to text indexes
db.articles.aggregate([
  {
    $search: {
      index: "articles_search",
      compound: {
        must: [{
          text: {
            query: "machine learning transformers",
            path: ["title", "body"],
            fuzzy: { maxEdits: 1 }  // typo tolerance
          }
        }],
        should: [{
          text: {
            query: "machine learning transformers",
            path: "title",
            score: { boost: { value: 3 } }  // title matches rank higher
          }
        }],
        filter: [{
          range: {
            path: "published_at",
            gte: new Date("2024-01-01")
          }
        }]
      }
    }
  },
  { $limit: 20 },
  {
    $project: {
      title: 1,
      summary: 1,
      score: { $meta: "searchScore" }
    }
  }
]);

Anti-Patterns

// ❌ Unbounded array growth (document exceeds 16MB, index grows unbounded)
// Never do this for high-volume collections:
{ user_id: 1, events: [...thousands of events...] }
// ✅ Separate collection with reference

// ❌ Using _id as a shard key with range sharding (hot last chunk)
sh.shardCollection("db.col", { _id: 1 });  // monotonically increasing → hot shard
// ✅ Hash the _id
sh.shardCollection("db.col", { _id: "hashed" });

// ❌ $where clause (executes JavaScript, can't use index, security risk)
db.users.find({ $where: "this.age > 30" });
// ✅ Use standard query operators
db.users.find({ age: { $gt: 30 } });

// ❌ Large $in arrays (> ~1000 values) — query planning overhead
db.products.find({ _id: { $in: [...10000 ids...] } });
// ✅ Paginate or use aggregation $lookup

// ❌ Missing $match at pipeline start
db.orders.aggregate([
  { $group: { _id: "$customer_id", total: { $sum: "$amount" } } },
  { $match: { "_id": someId } }
]);
// ✅ Filter before grouping
db.orders.aggregate([
  { $match: { customer_id: someId } },  // index hit, far fewer docs to group
  { $group: { _id: "$customer_id", total: { $sum: "$amount" } } }
]);

Quick Reference

Document Design Decision Tree:
  Always loaded together?     → embed
  One-to-few (< ~100)?        → embed
  One-to-many?                → reference
  Many-to-many?               → reference arrays
  Unbounded growth possible?  → reference (never embed)
  Child changes independently? → reference

Compound Index (ESR Rule):
  1. Equality fields first    (status = "active")
  2. Sort fields second       (.sort({ created_at: 1 }))
  3. Range fields last        ($gte, $lte, $in)

explain() Red Flags:
  COLLSCAN on large collection → missing index
  docsExamined >> nReturned   → poor index selectivity
  fetchedDocs >> nReturned    → index covers too broadly

Shard Key Selection:
  High cardinality?            ✅ required
  Not monotonically increasing? ✅ or use hashed
  Included in most queries?    ✅ required
  Avoid: status, boolean, date alone

Change Streams:
  Always persist resume token → fault-tolerant resume
  Use fullDocument:updateLookup → get full doc on update
  Filter at stream level → reduce application load

Skill Information

Source
MoltbotDen
Category
Data & Analytics
Repository
View on GitHub

Related Skills