- Published on
MongoDB Sharding and Replica Set Operations Guide: From Cluster Design to Troubleshooting
- Authors
- Name
- Introduction
- Sharding Architecture Overview
- Shard Key Selection Strategies
- Replica Set Configuration and Failover
- Chunk Migration and Balancer
- Read/Write Concern Configuration
- Monitoring and Performance Tuning
- Backup and Recovery
- Troubleshooting Scenarios
- Operations Checklist
- References

Introduction
MongoDB provides two core distributed mechanisms for large-scale data processing and high availability: Sharding and Replica Sets. Sharding horizontally distributes data across multiple servers to scale storage capacity and throughput, while Replica Sets replicate data to ensure automatic failover upon failure. Properly combining these two mechanisms in production is the foundation of reliable MongoDB operations.
This guide covers the full spectrum of production cluster operations: from Sharding architecture components and Shard Key selection strategies, through Replica Set failover mechanisms, chunk migration and balancer management, read/write concern configuration, monitoring and performance tuning, backup/recovery, and real-world troubleshooting scenarios.
Sharding Architecture Overview
A MongoDB Sharded Cluster consists of three components.
mongos (Query Router): Acts as a proxy that routes client requests to the appropriate shard. Being stateless, multiple mongos instances can be deployed behind a load balancer.
Config Server: A Replica Set that stores cluster metadata (shard list, chunk ranges, data distribution information). Since MongoDB 3.4, Config Servers must be deployed as a Replica Set.
Shard: A Replica Set that stores actual data. Each shard is responsible for a subset (chunks) of the total data.
# Sharded Cluster Architecture (Minimum Configuration)
#
# Client
# |
# v
# mongos (Router) x 2 -- behind a load balancer
# |
# v
# Config Server Replica Set (3 nodes)
# |
# +--- Shard 1 Replica Set (Primary + 2 Secondary)
# +--- Shard 2 Replica Set (Primary + 2 Secondary)
# +--- Shard 3 Replica Set (Primary + 2 Secondary)
#
# Total nodes: 2 (mongos) + 3 (config) + 9 (shard) = 14
Initial Sharded Cluster Setup
// Connect to mongos and add shards
sh.addShard('shard1/mongo-shard1-a:27018,mongo-shard1-b:27018,mongo-shard1-c:27018')
sh.addShard('shard2/mongo-shard2-a:27018,mongo-shard2-b:27018,mongo-shard2-c:27018')
sh.addShard('shard3/mongo-shard3-a:27018,mongo-shard3-b:27018,mongo-shard3-c:27018')
// Enable sharding for the database
sh.enableSharding('myapp')
// Shard a collection with a Hashed key
sh.shardCollection('myapp.orders', { customerId: 'hashed' })
// Shard a collection with a Ranged key
sh.shardCollection('myapp.logs', { timestamp: 1 })
// Check cluster status
sh.status()
Shard Key Selection Strategies
The Shard Key is extremely difficult to change once set (resharding is supported from MongoDB 5.0 onward), so it must be chosen carefully during the design phase. A good Shard Key should have high cardinality, low frequency, and non-monotonic characteristics.
Shard Key Strategy Comparison
| Strategy | Description | Pros | Cons | Ideal Use Case |
|---|---|---|---|---|
| Hashed Sharding | Distributes by hash of field value | Even distribution, prevents hotspots | Inefficient range queries, requires all-shard scans | High-frequency writes, no range queries needed |
| Ranged Sharding | Distributes by value range | Efficient range queries, targets specific shards | Possible hotspots, monotonic key issues | Time-based logs, frequent range queries |
| Zone Sharding | Places data by region/condition | Data locality, regulatory compliance | Complex setup, potential imbalance | Multi-region, data sovereignty requirements |
| Compound Key | Combination of multiple fields | Higher cardinality, query optimization | Increased design complexity | When a single field is insufficient |
Shard Key Configuration Examples
// 1. Hashed Shard Key - when even distribution is the top priority
sh.shardCollection('myapp.users', { _id: 'hashed' })
// 2. Compound Shard Key - satisfying cardinality and query patterns
sh.shardCollection('myapp.events', { tenantId: 1, createdAt: 1 })
// 3. Zone Sharding - data separation by region
sh.addShardToZone('shard1', 'KR')
sh.addShardToZone('shard2', 'US')
sh.addShardToZone('shard3', 'EU')
sh.updateZoneKeyRange(
'myapp.users',
{ region: 'KR', _id: MinKey },
{ region: 'KR', _id: MaxKey },
'KR'
)
sh.updateZoneKeyRange(
'myapp.users',
{ region: 'US', _id: MinKey },
{ region: 'US', _id: MaxKey },
'US'
)
// 4. MongoDB 8.0+ resharding - when shard key change is needed
sh.reshardCollection('myapp.orders', { customerId: 1, orderId: 1 })
Shard Key Anti-Patterns
Using a monotonically increasing field (e.g., ObjectId, timestamp) as a Ranged Shard Key causes all new writes to concentrate on the last chunk, creating a Hot Shard. In this case, use Hashed Sharding or combine with a compound key. Fields with very low cardinality (e.g., boolean, status enums) cannot distribute data evenly and may cause Jumbo Chunks.
Replica Set Configuration and Failover
Replica Set Initialization
// Initialize Replica Set on the Primary
rs.initiate({
_id: 'shard1',
members: [
{ _id: 0, host: 'mongo-shard1-a:27018', priority: 10 },
{ _id: 1, host: 'mongo-shard1-b:27018', priority: 5 },
{ _id: 2, host: 'mongo-shard1-c:27018', priority: 1 },
],
settings: {
electionTimeoutMillis: 10000, // default 10 seconds
heartbeatTimeoutSecs: 10, // heartbeat timeout
chainingAllowed: true, // allow secondary-to-secondary chaining
},
})
// Check Replica Set status
rs.status()
// Check replication lag
rs.printReplicationInfo()
rs.printSecondaryReplicationInfo()
Failover Mechanism
MongoDB Replica Set failover operates on a Raft-based consensus algorithm. When the Primary becomes unresponsive, the remaining members initiate an election, and the member that receives a majority of votes becomes the new Primary.
Election conditions:
- A majority of members must be able to communicate with each other
- Members with priority 0 cannot become Primary
- Hidden and delayed members can only vote but are excluded from Primary candidacy
- An election starts if the Primary is unreachable within electionTimeoutMillis (default 10 seconds)
// Manual failover (for maintenance)
rs.stepDown(60) // relinquish Primary role for 60 seconds
// Promote a specific member by adjusting priority
cfg = rs.conf()
cfg.members[1].priority = 100
rs.reconfig(cfg)
// Hidden member setup (for backup)
cfg = rs.conf()
cfg.members[2].hidden = true
cfg.members[2].priority = 0
rs.reconfig(cfg)
// Delayed member setup (for data recovery, 1-hour delay)
cfg = rs.conf()
cfg.members[2].secondaryDelaySecs = 3600
cfg.members[2].hidden = true
cfg.members[2].priority = 0
rs.reconfig(cfg)
Chunk Migration and Balancer
How the Balancer Works
The balancer runs on the Config Server Primary and moves chunks when the chunk count difference between shards exceeds the migration threshold. In a cluster with n shards, a maximum of n/2 concurrent migrations are possible.
// Check balancer status
sh.getBalancerState()
sh.isBalancerRunning()
// Stop/start balancer
sh.stopBalancer()
sh.startBalancer()
// Set balancer time window (run only between 2-6 AM)
db.adminCommand({
balancerStart: 1,
condition: {
activeWindow: { start: '02:00', stop: '06:00' },
},
})
// Disable balancing for a specific collection
sh.disableBalancing('myapp.orders')
// Manual chunk move
sh.moveChunk('myapp.orders', { customerId: 'user123' }, 'shard3')
// Check current migration status
db.adminCommand({ currentOp: true, desc: /migrate/ })
Chunk Management
// Change chunk size (default 128MB in MongoDB 7.0+)
use config
db.settings.updateOne(
{ _id: "chunksize" },
{ $set: { value: 64 } },
{ upsert: true }
)
// Find jumbo chunks
db.chunks.find({ jumbo: true }).forEach(function(chunk) {
print("Jumbo chunk: " + chunk.ns + " on " + chunk.shard)
})
// Force split a jumbo chunk
sh.splitAt("myapp.orders", { customerId: "splitPoint" })
// Check chunk distribution
db.chunks.aggregate([
{ $group: { _id: "$shard", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])
Read/Write Concern Configuration
Read Preference Mode Comparison
| Mode | Read Target | Consistency | Latency | Ideal Use Case |
|---|---|---|---|---|
| primary | Primary only | Strong consistency | Baseline | Real-time data required |
| primaryPreferred | Primary first, fallback to Secondary | Mostly strong | Baseline | Primary failure fallback |
| secondary | Secondary only | Eventual consistency | Low (nearest node) | Reports, analytics queries |
| secondaryPreferred | Secondary first, fallback to Primary | Mostly eventual | Low | Read load distribution |
| nearest | Lowest network latency node | Eventual consistency | Minimum | Geo-distributed deployments |
Write Concern Level Comparison
| Write Concern | Description | Durability | Performance | Ideal Use Case |
|---|---|---|---|---|
| w: 0 | No acknowledgment | Very low | Highest | Logs, metrics (loss acceptable) |
| w: 1 | Primary acknowledgment only | Moderate | High | General writes |
| w: "majority" | Majority acknowledgment | High | Moderate | Critical data (recommended default) |
| w: N | N node acknowledgment | Very high | Low | Finance, strict requirements |
| j: true | Journal write confirmation | Highest | Low | Durability first |
// Read/Write Concern configuration examples
// 1. Collection-level Read Preference
db.orders.find({ status: 'pending' }).readPref('secondaryPreferred')
// 2. Connection string configuration
// mongodb://mongos1:27017,mongos2:27017/myapp?readPreference=secondaryPreferred&w=majority
// 3. Write Concern specification
db.orders.insertOne(
{ customerId: 'user123', amount: 50000 },
{ writeConcern: { w: 'majority', j: true, wtimeout: 5000 } }
)
// 4. Set global default Read/Write Concern
db.adminCommand({
setDefaultRWConcern: 1,
defaultWriteConcern: { w: 'majority', j: true },
defaultReadConcern: { level: 'majority' },
})
Monitoring and Performance Tuning
Essential Monitoring Queries
// 1. Check shard data distribution
db.orders.getShardDistribution()
// 2. Find long-running operations
db.currentOp({ secs_running: { $gt: 10 } })
// 3. Enable slow query profiling
db.setProfilingLevel(1, { slowms: 100 })
// 4. Analyze slow queries
db.system.profile
.find({
millis: { $gt: 100 },
ns: /myapp\.orders/,
})
.sort({ ts: -1 })
.limit(10)
// 5. Server status overview
db.serverStatus().opcounters // operation counters
db.serverStatus().connections // connection count
db.serverStatus().wiredTiger.cache // cache status
// 6. Check Replica Set replication lag
db.adminCommand({ replSetGetStatus: 1 }).members.forEach(function (m) {
if (m.stateStr === 'SECONDARY') {
var lag = (new Date() - m.optimeDate) / 1000
print(m.name + ' lag: ' + lag + 's')
}
})
// 7. Index usage statistics
db.orders.aggregate([{ $indexStats: {} }])
// 8. Config Server metadata consistency check (MongoDB 7.0+)
db.adminCommand({ checkMetadataConsistency: 1 })
Performance Tuning Checklist
- Index Covering: Use compound indexes that include the Shard Key so queries are processed on a single shard
- Minimize Scatter-Gather: Include the Shard Key in query filters to target specific shards
- Connection Pooling: Set appropriate mongos connection pool sizes (default maxPoolSize=100)
- Read Preference: Route analytics queries to Secondary nodes
- Chunk Size Adjustment: For write-heavy workloads, reduce chunk size to minimize migration impact
- WiredTiger Cache: Set cacheSizeGB to 50-60% of system RAM
Backup and Recovery
Backup with mongodump/mongorestore
# 1. Stop balancer before backup (sharded cluster)
mongosh --host mongos1:27017 --eval "sh.stopBalancer()"
# 2. Back up individual shards (--oplog for point-in-time consistency)
mongodump --host shard1/mongo-shard1-a:27018 \
--oplog --out /backup/shard1-$(date +%Y%m%d)
mongodump --host shard2/mongo-shard2-a:27018 \
--oplog --out /backup/shard2-$(date +%Y%m%d)
# 3. Back up Config Server
mongodump --host configRS/config1:27019 \
--oplog --out /backup/config-$(date +%Y%m%d)
# 4. Restart balancer
mongosh --host mongos1:27017 --eval "sh.startBalancer()"
# 5. Restore (--oplogReplay for point-in-time recovery)
mongorestore --host shard1/mongo-shard1-a:27018 \
--oplogReplay /backup/shard1-20260313
Point-in-Time Recovery (PITR) and Oplog
// Check oplog size and retention
db.getReplicationInfo()
// Example output:
// configured oplog size: 2048MB
// log length start to end: 172800secs (48hrs)
// oplog first event time: ...
// oplog last event time: ...
// Resize oplog (recommend minimum 48 hours of retention)
db.adminCommand({ replSetResizeOplog: 1, size: 4096 }) // 4GB
Production backup recommendations:
- For small clusters, mongodump is sufficient; for large-scale environments, use filesystem snapshots (LVM/EBS) or MongoDB Atlas Backup
- Configure oplog size to retain at least 48 hours of writes
- Use Delayed Secondary members for quick recovery from logical errors (accidental data deletion)
- Perform backup restoration tests regularly (at least monthly)
Troubleshooting Scenarios
1. Shard Key Hotspot
Symptoms: Only one shard shows high CPU/disk I/O while the rest remain idle
Cause: Using a monotonically increasing field (ObjectId, timestamp) as a Ranged Shard Key, concentrating all new writes on the last chunk
Resolution:
- Perform resharding with a Hashed Shard Key (MongoDB 5.0+)
- Use a Compound Shard Key to increase cardinality
- Use
sh.shardAndDistributeCollection()in MongoDB 8.0 for fast redistribution
2. Split-brain (Network Partition)
Symptoms: Two or more nodes acting as Primary simultaneously, risking data inconsistency
Cause: Network partition splits the Replica Set into two groups, each electing its own Primary
Resolution:
- Maintain an odd number (3, 5, 7) of Replica Set members so only one majority group can exist
- During a network partition, the Primary outside the majority group is automatically demoted to Secondary
- Use
w: "majority"Write Concern to prevent writes not replicated to the majority
3. Chunk Migration Failure
Symptoms: Balancer logs show "Data transfer error" or "WriteConcernFailed" errors
Cause: Severe replication lag, insufficient network bandwidth, or lack of disk space on the target shard
Resolution:
// Enable _secondaryThrottle to ensure replication sync
use config
db.settings.updateOne(
{ _id: "balancer" },
{ $set: { "_secondaryThrottle": { w: 2 } } },
{ upsert: true }
)
// Reduce rangeDeleter batch size
db.adminCommand({
setParameter: 1,
rangeDeleterBatchSize: 32,
rangeDeleterBatchDelayMS: 100
})
4. Replica Lag
Symptoms: Secondary optimeDate falls tens of seconds or more behind the Primary
Cause: Heavy write load, slow disk I/O, index builds, network bottlenecks
Resolution:
- Check and increase WiredTiger cache size if needed
- Identify and terminate heavy queries running on Secondaries
- Verify network bandwidth and optimize oplog transfer paths
- Perform index builds in a rolling fashion (one node at a time)
Operations Checklist
Pre-deployment:
- Was the Shard Key chosen based on query patterns, cardinality, and write distribution?
- Is each Replica Set configured with at least 3 members (odd number)?
- Is the Config Server deployed as a separate Replica Set (3 nodes)?
- Are there 2+ mongos instances behind a load balancer?
- Are authentication (keyFile or x.509) and network encryption (TLS) configured?
Daily operations:
- Is the balancer running normally with even chunk distribution?
- Is Replica Lag within threshold (e.g., 10 seconds)?
- Is the connection count appropriate relative to maxIncomingConnections?
- Are slow query logs analyzed regularly?
- Is disk usage below 70%?
Backup/Recovery:
- Is automated backup running daily?
- Does the oplog size retain at least 48 hours of data?
- Was a backup restoration test performed within the last 30 days?
- Is a Delayed Secondary configured for logical error protection?
Disaster preparedness:
- Was an automatic failover test performed within the last quarter?
- Are monitoring alerts set for Replica Lag, disk usage, and connection count?
- Is a recovery procedure for split-brain scenarios documented?
References
- MongoDB Official Docs - Sharding
- MongoDB Official Docs - Replication
- MongoDB Official Docs - Choose a Shard Key
- MongoDB Official Docs - Sharded Cluster Balancer
- MongoDB Official Docs - Backup and Restore
- MongoDB Official Docs - Troubleshoot Sharded Clusters
- Percona Blog - When Should I Enable MongoDB Sharding
- Severalnines - MongoDB Backup Management Tips for Sharded Clusters