Skip to content

Instantly share code, notes, and snippets.

@foxish
Forked from bprashanth/mongo.md
Last active August 9, 2016 09:07
Show Gist options
  • Save foxish/b0dfcded735d74e1cd9bc21ed441e3c3 to your computer and use it in GitHub Desktop.
Save foxish/b0dfcded735d74e1cd9bc21ed441e3c3 to your computer and use it in GitHub Desktop.
Mongo petset

MongoDB is document database that supports range and field queries (https://github.com/foxish/docker-mongodb/tree/master/kubernetes)

Concepts

Replication

A single server can run either standalone or as part of a replica set. A "replica set" is set of mongod instances with 1 primary. Primary: receives writes, services reads. Can step down and become secondary. Secondary: replicate the primary's oplog. If the primary goes down, secondaries will hold an election. Arbiter: used to achieve majority vote with even members, do not hold data, don't need dedicated nodes. Never becomes primary.

Replication is asynchronous. Failover: If a primary doesn't communicate with the others for > 10s, secondaries conduct election.

Configuration

Write concern: { w: <value>, j: <boolean>, wtimeout: <number> }; How writes are acknowledged in the system.

  • wtimeout is how long to wait for ack.
  • w: 0; no ack of write)
  • w: 1; ack when write has propagated to primary. (default)
  • w: majority; there needs to be an ack from a majority of voting nodes
  • w: n; ack from n voting members.
  • w: <tag set>; ack from members having a particular tag.

Priority values assigned to each node, and are floating point numbers between 0 and 1000. Priority 0 members cannot vote. Higher-priority members are more likely to call elections, and are more likely to win. Read concern: local/majority. Local means read from primary, majority might read from secondaries.

Roles:

  • Arbiter: Only votes, holds no data. Don't deploy more than 1 per replica set.
  • Hidden: just like priority 0 but cannot service reads, only vote. Does maintain a copy of master data.
  • Delayed: Typically hidden, records master copies with a delay to avoid eg: human error.

Initial Deployment

A simple mongodb replicaset, with three members. We start with an image which turns on replicasets for the instance by supplying the right commandline flags. This becomes the image that we supply to our petset with 3 replicas.

  • After the pods are created, we pick any one pod and execute rs.initiate() after connecting to its mongo instance. That node turns into primary. Then rs.add() the other two pods using their cluster domain names.
  • For example:
rs.add("mongodb-1.mongodb.default.svc.cluster.local") 
rs.add("mongodb-2.mongodb.default.svc.cluster.local") 

Scaling

  • Automatic Failover works with petsets out of the box.
  • Adding new nodes involves finding the PRIMARY and running the corresponding rs.add(...) commands on it.
  • Reading from slaves require execution of rs.slaveOk() on connections to slaves.

Fault tolerance

Number of members that can become unavailable and the cluster can still elect primary. 50 members, 7 voting members => 46 can go down (but only 3 of the voting members). WAN deployment: 1 member per DC in 3 DCs, can tolerate a single DC going down.

OpLog size: depends on storage engine, 3 types: in-memory, wiredTiger, mmapv1.

  • mmapv1 was default and preferred, due to maturity.
  • wiredTiger known to have some issues in the past but is default since 3.2.

Failover

New members or secondaries that fall behind too far must resync everything. Starting mongo with an empty datadir will force an initial sync. Starting it with a copy of a recent datadir from another member in the set will also hasten the initial sync. This could be done using snapshots.

Changing hostnames

  • Change hostnames of secondary members, remove the old hostname and add the new hostname to the replicaset.
  • Stop all members, reconfigure offline using same datadir.

2 problems

Rollbacks - network partition, secondary can't keep up with primary, primary goes down, stale secondary becomes master, master rejoins as primary -- master needs to rollback writes it accepted. Such a rollback will not happen if the write propagates to a healthy reachable secondary, because it will become master.

Rebooting 2 secondaries simultaneously in a 3 member replica set forces the primary to step down, meaning it closes all sockets (Connection reset by peer) till one of the secondaries becomes available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment