Created
May 1, 2010 19:49
-
-
Save mikewadhera/386607 to your computer and use it in GitHub Desktop.
my notes from mongosf
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I. Map/Reduce, GeoSpatial Indexing & other cool features | |
* Client features | |
Safe Inserts - Batches a subsequent write and read together to ensure order preservation when using connection pooling in a multi-threaded client | |
Replication ACK - Can configure minimum acknowledgment level of slave writes. Client will raise exception if replication level not met during a write | |
GetMore operation - receive data from server in chunks | |
http://api.mongodb.org/js - reference for JS REPL | |
db.runCommand( (listCommands : 1) ) - lists all possible commands | |
* Lightweight query language, $ prefixed operators | |
$slice -- similar to LIMIT | |
- supports (offset, limit) & negative indices | |
$gt, $gte | |
$elemMatch | |
* Other features | |
MapReduce - allows for arbitrary, ad-hoc queries. Write 2 functions in JS - mapper function emits key/values for each doc in collection, reducer linearly processes emitted key/values. | |
GridFS - splits up binary files into chunked documents -- nothing "magical", implemented on top of vanilla client operations | |
Indexes - create with ensureIndex(field, asc/desc) Dates, timestamps and other increasing values are good candidates for indexes as older values are swapped to disk. | |
Wow: Fields that have arrays are exploded and each member is indexed separately to allow for efficient searching | |
GeoSpatial indexes - create index with ensureIndex(<field> : '2d') | |
Search with $near operator | |
II. Schema Design | |
Document-oriented storage | |
Not relational | |
Not OODB | |
JSON data | |
Row -> Document | |
Table -> Collection | |
Indexes -> Index | |
Join -> Embedding / Linking | |
* Embedding vs Linking | |
- "Contains relationship" : embed | |
- Embed = "pre joined" | |
- Links: transparent client/server turnarounds | |
- When in doubt embed - Note 4MB object size limit | |
Partial updates are supported | |
- $set order.total = x | |
- $push order.line_items(y) | |
- Also O(1) partial updates with references are supported - i.e. update all line_items[0].price to line_items[1].price | |
* Data Structures | |
Trees | |
3 possible designs: | |
- Embed whole tree in 1 document | |
- Link to parent (index by parent id) | |
- Store array of ancestors along with immediate parent (index by ancestors & parent id) | |
OO inheritance | |
STI works well | |
Dynamic schema doesn't incur "sparse schema" penalty when a collection is highly heterogenous | |
Indexing also works great as null's are not indexed | |
* Atomicity | |
2 ways to get document-level atomicity: | |
- $ operators -- atomic in-place changes, though not turing complete | |
- CAS "compare and swap" (optimistic concurrency) | |
usually implemented in your driver's save() function | |
essentially: while (savedId != savedDoc.id) change(doc); savedId = save(doc); savedDoc = fetch(doc) | |
* Capped collections | |
- "special" collections with *very* fast read/write | |
- at a high-level: preallocated append-only circular buffer | |
- useful for activity streams | |
- no delete (use delete flag and filter at app level) | |
- can only perform update if the object doesn't change size | |
* Indexing caveats | |
- ensureIndex() blocks reads/writes | |
- Background indexing feature - runs on low pri thread in background, reads are still blocked | |
III. MySQL to MongoDB | |
Wordnik - project to track languages (gps for english) | |
- language is not static | |
- capture information about all words (meaning/relations) | |
- lots of data in corpus | |
* MongoDB migration | |
Started seeing lockups on MyISAM tables when inserting at 4B rows (wondered if he tried InnoDB edit: someone asked this. answer: InnoDB performance at 4B rows is resistant to optimization) | |
Researched noSQL solutions, chose MongoDB | |
* Migration: | |
> 5 B documents | |
> 1.5 TB | |
0 downtime | |
* Findings: | |
- MongoDB loves system resources | |
- run on dedicated physical hardware | |
- 8 core, 32gb ram, FC SAN | |
- Very bad performance with virtualization, switched to physical fibre channel enabled I/O, YMMV | |
- many X the disk space of MySQL (easy pill to swallow) | |
Two distinct use cases for storage: | |
- key/value storage - easy to migrate | |
- hierarchical - required more work | |
* Migration path: | |
- Using JDBC | |
- Implemented all CRUD methods in DAO | |
- Allowed swapping between Mongo and mysql at runtime | |
- Used auto ID generation in Mongo as backwards compatible with mysql | |
- Objects serialized to JSON using JacksonMapper | |
- De-normalized 12+ mysql tables to 2 collections | |
- Used embedded model to persist relations - no joins over 20x faster | |
- Mysql had 50 key/value objects per second read throughput | |
- MongoDB had 250k key/value objects per second read throughput | |
(hmm.. makes me wonder how much of the performance boost was due to de-normalizing data model and how much was switching data stores) | |
Question: how do you deal with de-normalization and data changes? | |
Answer: in practice not a big deal as dictionary data doesn't change that often | |
* Bulk insert migration: | |
- MongoDB has fire-and-forget inserts | |
- Wrote migration script to do bulk inserts into MongoDB | |
- Shut off online writes for 24 hours | |
- Sustained 100k inserts/sec during migration | |
* Caching: | |
- Do not skimp on your disk or RAM | |
- MongoDB uses LRU cache with mmap() -- no longer needed to run memcached/out-of-process cache | |
IV. Event logging and funnel analysis | |
Justin.tv | |
Questions: | |
Who does what (funnels) | |
How valuable are groups of users (virality) | |
Are our changes working? (retention, funnel conversion) | |
(Chris thinks this is a rather sophomoric implementation and says our stuff is much further along) | |
V. Administration | |
* Backups | |
- mongodump - similar to mysqldump, dumps BSON documents to a file | |
- fsync + lock - faster, puts DB into read-only then takes FS snapshot, usually takes a few seconds | |
* Replication | |
- use replication! | |
- delayed option - good safe guard against accidental operations run on master | |
- oplog size is important - transaction log is stored in capped collection - default size is 5% of free space | |
- shell command on a slave shows you how far behind it is from master | |
- see more in replication talk | |
* Monitoring | |
mongostat - similar to iostat <--- awesome | |
web console | |
- "top" for collections | |
- shows total locked time | |
- shows all running operations | |
3rd party plugins | |
- munin (written by 10gen) | |
- ganglia | |
- nagios | |
- cacti | |
Disk layout | |
- MongoDB pre-generates large files (extents) to avoid OS block buffering idiosyncrasies | |
Connections | |
- Default max: 20,000 | |
Log message levels: | |
From critical/bad to ok: | |
- Asserts | |
- Warnings | |
- Message Asserts | |
- User Asserts (your fault) | |
VI. Replication | |
Similar to mysql - async M/S | |
In practice, near realtime | |
Generally works fine over WAN (cross datacenter capable) | |
1 master can handle about 10 slaves | |
start master with --master --oplog <sizeinmb> | |
start slave with --slave --source <masterhost:masterport> | |
Oplog | |
- Similar to transaction log in mysql | |
- Circular buffer (actually a capped collection) | |
- Slave receives + replays oplog from master | |
- Slave will catch up with master on startup then go into op replay | |
- Op replay is idempotent | |
Replication Info | |
- db.printReplicationInfo() shows replication state on connected master | |
- db.printSlaveReplicationInfo() shows replication state on connected slave | |
Replica Pairs | |
- Stable in next major version, 1.6 - now called Replica Sets | |
- awesome | |
- Single Master, N Slave with automatic failover & recovery | |
- Consensus election of new master when master becomes partitioned | |
- Lots of control over consensus - vote weighting, member types e.g. ability to have dummy arbiter for tie-breaking | |
- Rack and datacenter aware | |
- Driver will be notified of master change on next read/write against new topology (implemented as error + retry) | |
- Write is "durable" when committed to majority of servers in the set | |
- Caveat: read on master is immediately visible, true 100% read consistency is only on master. In practice this is usually not an issue if using durable writes | |
VII. Sharding | |
What? Vertical scaling | |
Why? Can scale wider than higher | |
* MongoDB supports automatic sharding | |
- Range based | |
- Zero downtime deployment/rollout | |
- Full Consistency | |
- Actually a framework | |
- Completely orthogonal to replication | |
* Anatomy | |
Config servers: | |
- at least 3 instances needed (changes use 2 phase commit) | |
- service will turn read only if any instances are down | |
- system is online so as long as 1/3 of the set is up | |
Shards: | |
- can be master, master/slave or replica sets (!) | |
- sharding + replica sets = automatic vertical partitioning with automatic failover (!!) | |
- plain old mongod processes | |
Sharding Router (mongos): | |
- Transparently handles all sharding routing logic | |
- Stick between client and mongod -- looks just like a mongod to clients | |
- Can have 1 or as many as you like | |
- Can run directly on application server to avoid network overhead | |
* Writes | |
Inserts: require shard key (object must have that field, null is a valid shard key) | |
Removes: routed and/or scattered | |
Updates: routed or scattered | |
* Queries | |
By shard key: routed | |
Sorted by shard key: ?? didn't get this down | |
* Router Operations | |
- split: breaking a chunk into 2 | |
- migrate: move a chunk from 1 shard to another | |
* Router Balancing | |
- distributes chunks *automatically* | |
- constantly running in the background of mongo routers | |
- factors system resources of shard instances: disk ops, cpu, data size | |
* Multi-DataCenter aware | |
- Intelligent geo honing | |
- Auto failover | |
* Limitations | |
- Unique indexes not based on shard key | |
- Doesn't scale past 20 Petabytes (sigh) | |
* Monitoring | |
- Use db.chunks.find() to introspect shards | |
* Demo | |
Wow. Demoing a live running 25-node cluster with replication + sharding. Been running since 8am. Sustaining 5M ops/sec. Currently doing 8.1M ops/sec. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment