mikewadhera/MongoSFnotes.txt

## MongoSFnotes.txt
I. Map/Reduce, GeoSpatial Indexing & other cool features

* Client features

Safe Inserts - Batches a subsequent write and read together to ensure order preservation when using connection pooling in a multi-threaded client

Replication ACK - Can configure minimum acknowledgment level of slave writes. Client will raise exception if replication level not met during a write

GetMore operation - receive data from server in chunks

http://api.mongodb.org/js - reference for JS REPL

db.runCommand( (listCommands : 1) ) - lists all possible commands

* Lightweight query language, $ prefixed operators

$slice -- similar to LIMIT
  - supports (offset, limit) & negative indices

$gt, $gte

$elemMatch

* Other features

MapReduce - allows for arbitrary, ad-hoc queries. Write 2 functions in JS - mapper function emits key/values for each doc in collection, reducer linearly processes emitted key/values.

GridFS - splits up binary files into chunked documents -- nothing "magical", implemented on top of vanilla client operations

Indexes - create with ensureIndex(field, asc/desc) Dates, timestamps and other increasing values are good candidates for indexes as older values are swapped to disk.
Wow: Fields that have arrays are exploded and each member is indexed separately to allow for efficient searching

GeoSpatial indexes - create index with ensureIndex(<field> : '2d')
Search with $near operator


II. Schema Design

Document-oriented storage
Not relational
Not OODB
JSON data

Row -> Document
Table -> Collection
Indexes -> Index
Join -> Embedding / Linking

* Embedding vs Linking

- "Contains relationship" : embed
- Embed = "pre joined"
- Links: transparent client/server turnarounds
- When in doubt embed - Note 4MB object size limit

Partial updates are supported
 - $set order.total = x
 - $push order.line_items(y)
 - Also O(1) partial updates with references are supported - i.e. update all line_items[0].price to line_items[1].price

* Data Structures

Trees

3 possible designs:
- Embed whole tree in 1 document
- Link to parent (index by parent id)
- Store array of ancestors along with immediate parent (index by ancestors & parent id)

OO inheritance

STI works well
Dynamic schema doesn't incur "sparse schema" penalty when a collection is highly heterogenous
Indexing also works great as null's are not indexed

* Atomicity

2 ways to get document-level atomicity:

  - $ operators -- atomic in-place changes, though not turing complete
  - CAS "compare and swap" (optimistic concurrency)
     usually implemented in your driver's save() function
       essentially: while (savedId != savedDoc.id) change(doc); savedId = save(doc); savedDoc = fetch(doc)

* Capped collections
  - "special" collections with *very* fast read/write
  - at a high-level: preallocated append-only circular buffer
  - useful for activity streams
  - no delete (use delete flag and filter at app level)
  - can only perform update if the object doesn't change size

* Indexing caveats

 - ensureIndex() blocks reads/writes
 - Background indexing feature - runs on low pri thread in background, reads are still blocked


III. MySQL to MongoDB

Wordnik - project to track languages (gps for english)
- language is not static
- capture information about all words (meaning/relations)
- lots of data in corpus

* MongoDB migration

Started seeing lockups on MyISAM tables when inserting at 4B rows (wondered if he tried InnoDB edit: someone asked this. answer: InnoDB performance at 4B rows is resistant to optimization)
Researched noSQL solutions, chose MongoDB

* Migration:

> 5 B documents
> 1.5 TB
0 downtime

* Findings:

- MongoDB loves system resources
- run on dedicated physical hardware
- 8 core, 32gb ram, FC SAN
- Very bad performance with virtualization, switched to physical fibre channel enabled I/O, YMMV
- many X the disk space of MySQL (easy pill to swallow)

Two distinct use cases for storage:

- key/value storage - easy to migrate
- hierarchical - required more work

* Migration path:

- Using JDBC
- Implemented all CRUD methods in DAO
- Allowed swapping between Mongo and mysql at runtime
- Used auto ID generation in Mongo as backwards compatible with mysql
- Objects serialized to JSON using JacksonMapper
- De-normalized 12+ mysql tables to 2 collections
- Used embedded model to persist relations - no joins over 20x faster
- Mysql had 50 key/value objects per second read throughput
- MongoDB had 250k key/value objects per second read throughput

(hmm.. makes me wonder how much of the performance boost was due to de-normalizing data model and how much was switching data stores)

Question: how do you deal with de-normalization and data changes?
Answer: in practice not a big deal as dictionary data doesn't change that often

* Bulk insert migration:

- MongoDB has fire-and-forget inserts
- Wrote migration script to do bulk inserts into MongoDB
- Shut off online writes for 24 hours
- Sustained 100k inserts/sec during migration

* Caching:

- Do not skimp on your disk or RAM
- MongoDB uses LRU cache with mmap() -- no longer needed to run memcached/out-of-process cache


IV. Event logging and funnel analysis

Justin.tv

Questions:
Who does what (funnels)
How valuable are groups of users (virality)
Are our changes working? (retention, funnel conversion)

(Chris thinks this is a rather sophomoric implementation and says our stuff is much further along)


V. Administration

* Backups

 - mongodump - similar to mysqldump, dumps BSON documents to a file
 - fsync + lock - faster, puts DB into read-only then takes FS snapshot, usually takes a few seconds

* Replication

 - use replication!
 - delayed option - good safe guard against accidental operations run on master
 - oplog size is important - transaction log is stored in capped collection - default size is 5% of free space
 - shell command on a slave shows you how far behind it is from master
 - see more in replication talk

* Monitoring

mongostat - similar to iostat <--- awesome

web console
 - "top" for collections
 - shows total locked time
 - shows all running operations

3rd party plugins

- munin (written by 10gen)
- ganglia
- nagios
- cacti

Disk layout

 - MongoDB pre-generates large files (extents) to avoid OS block buffering idiosyncrasies

Connections

 - Default max: 20,000

Log message levels:

From critical/bad to ok:
 - Asserts
 - Warnings
 - Message Asserts
 - User Asserts (your fault)


VI. Replication

Similar to mysql - async M/S
In practice, near realtime
Generally works fine over WAN (cross datacenter capable)
1 master can handle about 10 slaves

start master with --master --oplog <sizeinmb>
start slave with --slave --source <masterhost:masterport>

Oplog

 - Similar to transaction log in mysql
 - Circular buffer (actually a capped collection)
 - Slave receives + replays oplog from master
 - Slave will catch up with master on startup then go into op replay
 - Op replay is idempotent

Replication Info

 - db.printReplicationInfo() shows replication state on connected master
 - db.printSlaveReplicationInfo() shows replication state on connected slave

Replica Pairs

 - Stable in next major version, 1.6 - now called Replica Sets
 - awesome
 - Single Master, N Slave with automatic failover & recovery
 - Consensus election of new master when master becomes partitioned
 - Lots of control over consensus - vote weighting, member types e.g. ability to have dummy arbiter for tie-breaking
 - Rack and datacenter aware
 - Driver will be notified of master change on next read/write against new topology (implemented as error + retry)
 - Write is "durable" when committed to majority of servers in the set
 - Caveat: read on master is immediately visible, true 100% read consistency is only on master. In practice this is usually not an issue if using durable writes


VII. Sharding

What? Vertical scaling
Why? Can scale wider than higher

* MongoDB supports automatic sharding

  - Range based
  - Zero downtime deployment/rollout
  - Full Consistency
  - Actually a framework
  - Completely orthogonal to replication

* Anatomy

Config servers:
 - at least 3 instances needed (changes use 2 phase commit)
 - service will turn read only if any instances are down
 - system is online so as long as 1/3 of the set is up

Shards:
 - can be master, master/slave or replica sets (!)
 - sharding + replica sets = automatic vertical partitioning with automatic failover (!!)
 - plain old mongod processes

Sharding Router (mongos):
  - Transparently handles all sharding routing logic
  - Stick between client and mongod -- looks just like a mongod to clients
  - Can have 1 or as many as you like
  - Can run directly on application server to avoid network overhead

* Writes

Inserts: require shard key (object must have that field, null is a valid shard key)
Removes: routed and/or scattered
Updates: routed or scattered

* Queries

By shard key: routed
Sorted by shard key: ?? didn't get this down

* Router Operations

- split: breaking a chunk into 2
- migrate: move a chunk from 1 shard to another

* Router Balancing

- distributes chunks *automatically*
- constantly running in the background of mongo routers
- factors system resources of shard instances: disk ops, cpu, data size

* Multi-DataCenter aware

- Intelligent geo honing
- Auto failover

* Limitations

- Unique indexes not based on shard key
- Doesn't scale past 20 Petabytes (sigh)

* Monitoring

- Use db.chunks.find() to introspect shards

* Demo

Wow. Demoing a live running 25-node cluster with replication + sharding. Been running since 8am. Sustaining 5M ops/sec. Currently doing 8.1M ops/sec.
	I. Map/Reduce, GeoSpatial Indexing & other cool features

	* Client features

	Safe Inserts - Batches a subsequent write and read together to ensure order preservation when using connection pooling in a multi-threaded client

	Replication ACK - Can configure minimum acknowledgment level of slave writes. Client will raise exception if replication level not met during a write

	GetMore operation - receive data from server in chunks

	http://api.mongodb.org/js - reference for JS REPL

	db.runCommand( (listCommands : 1) ) - lists all possible commands

	* Lightweight query language, $ prefixed operators

	$slice -- similar to LIMIT
	- supports (offset, limit) & negative indices

	$gt, $gte

	$elemMatch

	* Other features

	MapReduce - allows for arbitrary, ad-hoc queries. Write 2 functions in JS - mapper function emits key/values for each doc in collection, reducer linearly processes emitted key/values.

	GridFS - splits up binary files into chunked documents -- nothing "magical", implemented on top of vanilla client operations

	Indexes - create with ensureIndex(field, asc/desc) Dates, timestamps and other increasing values are good candidates for indexes as older values are swapped to disk.
	Wow: Fields that have arrays are exploded and each member is indexed separately to allow for efficient searching

	GeoSpatial indexes - create index with ensureIndex(<field> : '2d')
	Search with $near operator



	II. Schema Design

	Document-oriented storage
	Not relational
	Not OODB
	JSON data

	Row -> Document
	Table -> Collection
	Indexes -> Index
	Join -> Embedding / Linking

	* Embedding vs Linking

	- "Contains relationship" : embed
	- Embed = "pre joined"
	- Links: transparent client/server turnarounds
	- When in doubt embed - Note 4MB object size limit

	Partial updates are supported
	- $set order.total = x
	- $push order.line_items(y)
	- Also O(1) partial updates with references are supported - i.e. update all line_items[0].price to line_items[1].price

	* Data Structures

	Trees

	3 possible designs:
	- Embed whole tree in 1 document
	- Link to parent (index by parent id)
	- Store array of ancestors along with immediate parent (index by ancestors & parent id)

	OO inheritance

	STI works well
	Dynamic schema doesn't incur "sparse schema" penalty when a collection is highly heterogenous
	Indexing also works great as null's are not indexed

	* Atomicity

	2 ways to get document-level atomicity:

	- $ operators -- atomic in-place changes, though not turing complete
	- CAS "compare and swap" (optimistic concurrency)
	usually implemented in your driver's save() function
	essentially: while (savedId != savedDoc.id) change(doc); savedId = save(doc); savedDoc = fetch(doc)

	* Capped collections
	- "special" collections with very fast read/write
	- at a high-level: preallocated append-only circular buffer
	- useful for activity streams
	- no delete (use delete flag and filter at app level)
	- can only perform update if the object doesn't change size

	* Indexing caveats

	- ensureIndex() blocks reads/writes
	- Background indexing feature - runs on low pri thread in background, reads are still blocked



	III. MySQL to MongoDB

	Wordnik - project to track languages (gps for english)
	- language is not static
	- capture information about all words (meaning/relations)
	- lots of data in corpus

	* MongoDB migration

	Started seeing lockups on MyISAM tables when inserting at 4B rows (wondered if he tried InnoDB edit: someone asked this. answer: InnoDB performance at 4B rows is resistant to optimization)
	Researched noSQL solutions, chose MongoDB

	* Migration:

	> 5 B documents
	> 1.5 TB
	0 downtime

	* Findings:

	- MongoDB loves system resources
	- run on dedicated physical hardware
	- 8 core, 32gb ram, FC SAN
	- Very bad performance with virtualization, switched to physical fibre channel enabled I/O, YMMV
	- many X the disk space of MySQL (easy pill to swallow)

	Two distinct use cases for storage:

	- key/value storage - easy to migrate
	- hierarchical - required more work

	* Migration path:

	- Using JDBC
	- Implemented all CRUD methods in DAO
	- Allowed swapping between Mongo and mysql at runtime
	- Used auto ID generation in Mongo as backwards compatible with mysql
	- Objects serialized to JSON using JacksonMapper
	- De-normalized 12+ mysql tables to 2 collections
	- Used embedded model to persist relations - no joins over 20x faster
	- Mysql had 50 key/value objects per second read throughput
	- MongoDB had 250k key/value objects per second read throughput

	(hmm.. makes me wonder how much of the performance boost was due to de-normalizing data model and how much was switching data stores)

	Question: how do you deal with de-normalization and data changes?
	Answer: in practice not a big deal as dictionary data doesn't change that often

	* Bulk insert migration:

	- MongoDB has fire-and-forget inserts
	- Wrote migration script to do bulk inserts into MongoDB
	- Shut off online writes for 24 hours
	- Sustained 100k inserts/sec during migration

	* Caching:

	- Do not skimp on your disk or RAM
	- MongoDB uses LRU cache with mmap() -- no longer needed to run memcached/out-of-process cache



	IV. Event logging and funnel analysis

	Justin.tv

	Questions:
	Who does what (funnels)
	How valuable are groups of users (virality)
	Are our changes working? (retention, funnel conversion)

	(Chris thinks this is a rather sophomoric implementation and says our stuff is much further along)



	V. Administration

	* Backups

	- mongodump - similar to mysqldump, dumps BSON documents to a file
	- fsync + lock - faster, puts DB into read-only then takes FS snapshot, usually takes a few seconds

	* Replication

	- use replication!
	- delayed option - good safe guard against accidental operations run on master
	- oplog size is important - transaction log is stored in capped collection - default size is 5% of free space
	- shell command on a slave shows you how far behind it is from master
	- see more in replication talk

	* Monitoring

	mongostat - similar to iostat <--- awesome

	web console
	- "top" for collections
	- shows total locked time
	- shows all running operations

	3rd party plugins

	- munin (written by 10gen)
	- ganglia
	- nagios
	- cacti

	Disk layout

	- MongoDB pre-generates large files (extents) to avoid OS block buffering idiosyncrasies

	Connections

	- Default max: 20,000

	Log message levels:

	From critical/bad to ok:
	- Asserts
	- Warnings
	- Message Asserts
	- User Asserts (your fault)



	VI. Replication

	Similar to mysql - async M/S
	In practice, near realtime
	Generally works fine over WAN (cross datacenter capable)
	1 master can handle about 10 slaves

	start master with --master --oplog <sizeinmb>
	start slave with --slave --source <masterhost:masterport>

	Oplog

	- Similar to transaction log in mysql
	- Circular buffer (actually a capped collection)
	- Slave receives + replays oplog from master
	- Slave will catch up with master on startup then go into op replay
	- Op replay is idempotent

	Replication Info

	- db.printReplicationInfo() shows replication state on connected master
	- db.printSlaveReplicationInfo() shows replication state on connected slave

	Replica Pairs

	- Stable in next major version, 1.6 - now called Replica Sets
	- awesome
	- Single Master, N Slave with automatic failover & recovery
	- Consensus election of new master when master becomes partitioned
	- Lots of control over consensus - vote weighting, member types e.g. ability to have dummy arbiter for tie-breaking
	- Rack and datacenter aware
	- Driver will be notified of master change on next read/write against new topology (implemented as error + retry)
	- Write is "durable" when committed to majority of servers in the set
	- Caveat: read on master is immediately visible, true 100% read consistency is only on master. In practice this is usually not an issue if using durable writes



	VII. Sharding

	What? Vertical scaling
	Why? Can scale wider than higher

	* MongoDB supports automatic sharding

	- Range based
	- Zero downtime deployment/rollout
	- Full Consistency
	- Actually a framework
	- Completely orthogonal to replication

	* Anatomy

	Config servers:
	- at least 3 instances needed (changes use 2 phase commit)
	- service will turn read only if any instances are down
	- system is online so as long as 1/3 of the set is up

	Shards:
	- can be master, master/slave or replica sets (!)
	- sharding + replica sets = automatic vertical partitioning with automatic failover (!!)
	- plain old mongod processes

	Sharding Router (mongos):
	- Transparently handles all sharding routing logic
	- Stick between client and mongod -- looks just like a mongod to clients
	- Can have 1 or as many as you like
	- Can run directly on application server to avoid network overhead

	* Writes

	Inserts: require shard key (object must have that field, null is a valid shard key)
	Removes: routed and/or scattered
	Updates: routed or scattered

	* Queries

	By shard key: routed
	Sorted by shard key: ?? didn't get this down

	* Router Operations

	- split: breaking a chunk into 2
	- migrate: move a chunk from 1 shard to another

	* Router Balancing

	- distributes chunks automatically
	- constantly running in the background of mongo routers
	- factors system resources of shard instances: disk ops, cpu, data size

	* Multi-DataCenter aware

	- Intelligent geo honing
	- Auto failover

	* Limitations

	- Unique indexes not based on shard key
	- Doesn't scale past 20 Petabytes (sigh)

	* Monitoring

	- Use db.chunks.find() to introspect shards

	* Demo

	Wow. Demoing a live running 25-node cluster with replication + sharding. Been running since 8am. Sustaining 5M ops/sec. Currently doing 8.1M ops/sec.