Skip to content

Instantly share code, notes, and snippets.

@duien
Created February 8, 2011 22:33
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save duien/817421 to your computer and use it in GitHub Desktop.
Save duien/817421 to your computer and use it in GitHub Desktop.
Mongo Atlanta Notes

Tweet photo with #mongoatl to win!
Fur bus will shuttle back & forth to after-party, where there are 2 free drinks

Building an application with MongoDB

Will cover:

  • Data modeling, queries, geospacial, updates, map-reduce
  • Using location-based app as an example
  • Examples work in MongoDB JS shell
  1. current location => places near location
  2. add user-generated content
  3. Record user checking
  4. Stats about checkin

Example document

place1 = {
  name: '10gen HQ',
  address: '134 5th Ave',
  city: 'New York',
  zip: '10011'
}

db.places.find({zip:'10011'}).limit(10)

Let's add some tags!

place1['tags'] = ['business', 'recommended']

db.places.find({zip:'10011', tags:'business'})

Let's add a location and index it as a geospacial point

places['latlong'] = [40.0,72.0]

db.places.ensureIndex({latlong:'2d'}) // this is a geo-spacial point!
db.places.find({latlong:{$near:[40,70]}})

Let's have some user-generated content too!

places['tips'] = [
  { user: 'nosh', time: 6/26/2010, tip: "stop by for office hours Wednesdays from 4-6pm" },
  { ... }, { ... }
]

Querying this database

  • Set up indexes on the attributes you'll be querying
  • Find by $near, by regexp for typeahead, by tag

Inserting data

db.places.insert(place1)

Updating tips:

db.places.update({name:"10gen HQ"},
  { $push: { tips:
    { user: "nosh", ... } } } })

We've now done steps 1 and 2!

user1 = {
  name: "nosh",
  email: "nost@10gen.com",
  checkins: [ id, id ]
}

Choices: embed or reference
For checkins, we're referencing them instead of embedding -- this lets us do aggregate statistics on the checkins.

checkin1 = {
  place: "10gen HQ",
  ts: 9/20/2010 10:12:00,
  userId: id
}

Check-in in 2 operations

  • insert checking in collection

  • update ($push) user object

    ensureIndex({place: 1, ts: 1}) ensureIndex({ts: 1})

    $set, $unset, $rename, $push, $pop, $pull, $addToSet, $inc

We're now done with step 3

Simple stats:

db.checkins.find({place: "10gen HQ"}).sort({ts:-1}).limit(10) // last 10 checkins
db.checkins.find({place: "10gen HQ"}, ts: {$gt: midnight}}).count() // number of checkins since midnight

Stats with MapReduce

mapFunc = function(){
  emit(this.place, 1); }

reduceFunc = function(key, values){
  return Array.sum(values); }

res = db.checkins.mapReduce(mapFunc, reduceFunc,
  { query: { timestamp: { $gt: nowMinus3Hours } } })

//=> res = [{_id: "10gen HQ", value: 17}, {...}, {...}]

res.find({ value: { $gt: 15 },
           _id: { $in: [ ..., ..., ... ] } })

Deploying a MongoDB applciation

Single Master Deployments

  • One primary/master server
  • n secondary/slave server
  • Configure as replica set for automated failover
  • Add more secondaries to scale reads
  • Good for read-heavy applications and smaller applications

Autosharded deployment

Create a few replica sets and distribue data between them automatically
Config server handles distribution and balancing

MongoS is server that apps talk to, it knows where to find data
This is all transparent to the application

Use Cases

  • RDMS replacement for high-traffice web apps
  • Content-Management
  • Social, mobile
  • Lots of stuff!

Questions

MapReduce isn't really intended from realtime. Have a job that runs in the background to keep results fresh

MapReduce only uses one core right now, but it can work across shards. Will probably work across multiple cores soon too.

Hadoop plugin (experimental) might help get around this limitation

Community Resources

  • MongoDB.org -- tutorial, wiki, references, use cases, web shell you can try out, downloads (core & drivers)
  • groups.google.com/groups/mongodb-user - mailing list monitored by 10gen
  • jira.mongodb.org -- bug tracking -- questions about roadmaps and feature, request a feature, etc
  • irc.freenode.net #mongodb -- A lot of the 10gen team is on here
  • github.com/mongodb -- it's open source! db, drivers, etc
  • 10gen.com/events -- we're always looking for speakers
  • StackOverflow and Quora -- there's lots of good info here, but google group will get you quicker response

Schema Design

robert@10gen.com -- C# driver developer, based in Atlanta

Normalization - similar goals (to relational) also apply, but rules are different

Collections

Cheap to creake (max 24k)
Don't have a schema -- individual documents have a schema. Common for documents to share schema
Consider using multiple collections tied together by naming conventions: LogData-2011-02-08

  • It's easy to clean up old data
  • Can help with queries and statistics

Documents

BSON used on disk and for wire protocol
JSON used for humans
DateTime in database is always in UTC
ObjectID - binary 12bytes (24char hex is usual representation)

Denormalization (in order example) is good for snapshotting data -- what was the customer's address at the time this order was placed

Rich document is

  • holistic representation
  • still easy to manipulate
  • pre-joined for fast retrival

Document size

  • Max 4MB in earlier versions, now 16MB
  • Performance considerations long before reaching max size
  • GridFS is the way to store really large objects -- break them into chunks of 256K

Database design considerations

How can we manipulate it?
What kinds of queries?
What index?
Atomic updates? (document update is always atomic -- can't be atomic across multiple documents)
Access patterns

  • read/write ratio
  • types up updates and queries
  • data life-cycle

Considerations

  • You can't join - if used together, keep together (if possible)
  • Document writes are atomic

Document design

Make them map simply to application data
If you're using a higher level tool that maps objects to documents, a lot of this will be done for you
Mongo will add _id if you don't specify one. Custom one can be anything except array or document
Timestamp is embedded in id -- most drivers let you get at this data. They sort chronologically

Verify index exists

db.books.getIndexes()
{
  { _id: objectId,
    ns: 'test.books',
    key: { author: 1 },
    name: 'author_1'
  }
}

Examine the query plan

db.bookds.find({ author: '...'}).explain()
{
  cursor: 'BtreeCursor author_1',
  nscanned: 1, // how many were looked at
  nscannedObjects: 1,
  n: 1, // how many were returned
  millis: 1,
  indexBounds: {
    author: [
      [ ..., ... ]
    ]
  }
}

Lots of operators:

  • equality ({ author: '...'})
  • matches (regexp) ({ author: /^e/i }) // beginning with 'e'
  • ne, in, nin, mod, all, size, exists, type, lt, lte, gt, gte, ne

count, distinct and group are available aggregation operators

Extending the schema

db.books.update(
  { title: 'The Old Man ...' },  // the doc to update
  {                              // the changes to apply
    $inc: { comments_count: 1 }, // will create attribute with value 0 first if necessary
    $push: { comments: comment } // will create empty array first if necessary
  }

$inc and $push will create correctly typed attributes if the attribute doesn't already exist

Single table inheritance

(Relational) you have to have columns for all the types, some wil be null for each type
In mongo, just add the elements that make sense for that documnet
It's fine to create an index on fields that don't occur in every document

One-to-many

Embedded array

{ author: ...,
  comments: [
    { author: ... },
    { author: ...,
      replies: [
        { author: ... } // now we have a nested tree!
      ]
  ]
}

Normalized

{ author: ...,
  comment_ids: [ 1, 2 ]
}

Links can go in both or either direction for normalized

Many-to-many

Store references on both sides

{
  name: 'baseball bat',
  category_ids: [ 1, 2 ]
}
{
  name: 'sports equipment',
  product_ids: [ 1, ... ]
}

find({ category_ids: 1})
find({ product_ids: 1})

Normalize it a little bit -- Drop product ids from category

find({category_ids: 1})

prod = find({ _id: 1 }
find({ _id: { $in: prod.category_ids }})

Trees

Full tree in document (example in one-to-many)
Very hard to search -- data is at multiple levels
Could have documents that are too large
Link from parent -> child, child -> parent, or both
Also: arrays of ancestors

{ _id : 1 }
{ _id : 2, ancestors : [1], parent: 1 }
{ _id : 2, ancestors : [1, 2], parent: 2 }

Lets you do parent queries or ancestory queries farily easily
Lets you do child or desendents very easily

Can also store string representing path to document. Using / or . as delimiter will make regexp queries awkward
Finding ancestors or siblings is difficule

Queues

Use findAndModify to get and update the document in one atomic command

MongoDB and the Enterprise

Basically, you're competing with Oracle.

  • Scaling:
    • Buy a bigger DB server
    • Mostly vertical
  • Single-server durability is great until the server dies
  • Schema migrations are hard with so many instances
  • Closed-source = more bugs (or at least harder to know about bugs)
  • SQL = easy to shoot yourself
  • Licensing is hard for cloud deploys

Next-generation Video on Demand

  • Premium (high-bandwidth) content to cable, pc, and mobile
  • Migration path for current deployments : 50 - 200 installations

Want redundancy
New features shoud be fairly easy to add
Don't want to get backed into a corner on performance

Multi-site Oracle sucks

Enterprise requirements

  1. Transactions are required

    • ACID only works with single database
    • Many hardware systems are not RDMS anyway
    • Database of Record
      Authoritative source for a certain type of data
    • Transaction is ACID for each database you touch, but the overall flow is not
  2. All data is relational

    • Lots of it is hierarchical, lots of one-to-many relationships
    • Relational: make it all normalized then denormalize until performance is good enough
    • NoSQL tradeoff
      • data redundancy
      • Better performance because of co-locality of data
      • When data access is down through hierarchy, it works really well
  3. NoSQL is new and unproven

    • Relational DBs have been dominant for quarter century
    • NoSQL was around before SQL
    • Data retrieval was key/value for long time
    • RAIC : redundant array of inexpensive computers

Why MongoDB

Easy to use: Low friction, lots of language drivers
Avoid ORM mismatch
Get up and running quickly

Advantages

  • Fast binary drivers
  • C++ - no garbage collection
  • Document database - no schema
  • Seconday keys
  • Multi-server scaling

Challenges

  • Maturity - 2 years?! OMGWTFBBQ!
  • Expertise - There's not a lot of experts yet
  • DB Backups in sharded cluster - it's a little tricky

MongoDB is another tool, not the only tool

Results for Facebook Advertisers

Blinq - Socal engagement advertising

  • Provide content and technology for advertisers
  • Headquarters in NY, tech and operations in Atl
  • BAM - Blinq ad manager build on Facebook ads API
    • Allows targeting based on OpenGraph info
    • Quickly launch and manager compaigns
    • Get performance data to optimize

Using: Rails, git, redis (Resque), EngineYard, MySQL for account data, MongoDB

Why Mongo?

Trying to get performance data on all ads
40 - 50 key indicators
Information needs to be available quickly

Configuration

4 servers in each environment

  • Master (1.6.5)
  • Utility: redis and mongo slave
  • App server
  • MySQL DB

Use this data to

  • Make basic graphs
  • Pivot using MapReduce in realtime
  • Compare common attributes across multiple ads

Gotchas

ObjectID != String
If you're not careful, you can end up with both types and things get weird
Symptoms: no data or incomplete data

Not a perfect fit with Chef
It's hard to set up things like replica sets -- you have to run some commands from the Mongo shell, and that's awkward

Schemaless == you app has to be more careful with your schema
my_field vs. my_feild won't generate errors
Your mapper can sometimes help with this

Nice Surprises

It's a pain to set up with replication with MySQL and there's not a lot of feedback that it's right
It's dead easy with Mongo

Getting started is quick (compare: PostgreSQL)

SourceForge

2009: FossFor.us black ops project with user-generated content, new design

  • Started out using CouchDB
  • Adding fields was trival and happened frequently
  • Scaling Couch to SF.net level of traffice didn't work
  • Tried out some alternative, landed on MongoDB

This experiment was successful, so start moving it to SF.net

Most traffic (90%) is on 3 pages

  • Project summary
  • File browser
  • Download

Pages are read-heavy with infrequent updates from "Develop" side

Original idea is 1 MongoDB document per project
Periodic updates via RSS and AMQP from "Develop"

Ended up moving releases into separate docs because of extremely frequest releases on some projects bumping over 4MB limit

"Consume" Architecture

Consume is Python (TurboGears) app
Each server has its own slave
There's a single server that takes all the updates

Mongo's performance was actually so good the slaves weren't necessary
There's a single slave for durability

Downloads

New service in late 2009 -- allow non-SourceForce projects to use mirror network

Stats calculated in Hadoop and stored/server in MongoDB
Same basic architecture as Consume

Was eventually merged into Consume codebase

Allura

Rewrite developer tools (change from PHP/db mash to Python and MongoDB)

More tools: wiki, tracker, discussion, Git, Hg, SVN

Using a single MongoDB replica set manually sharded by project
There's a database per project, but that's a bad idea because there's a lot of of overhead

Release early and often, but move people over slowly

What We Liked

Performance -- 90% of traffic on one database works great

Schemaless server allows fast schema evolution in development
Ususally you don't need any migrations at all

Replication is easy, making scalability and backups easy as well Query language is very nice -- good performance without having to use map/reduce

GridFS is wonderful -- also, it's a filesystem shared between all your webservers without setting up NFS, Samba, etc.

Pitfalls

Too-large documents -- 4MB docs = 30 docs/sec (slow)

Ignoring indexing -- it's easy to forget in develpment
It's fast until it's not
Bad queries show up in your server log, so keep an eye on it

Always have an index on your sort field
If there's too much data, Mongo just won't do it if it's not indexed

Ignoring schema can be a problem

Don't use more databases than you need to -- there's a min 200MB of disk space

Using too many queries is bad
Keep an eye on your ORM -- it might be generating too many

Ming

sf.net/projects/merciless

Object-Document Mapper for Python

Your data does have a schema

  • Your schema lives in your app instead of DB
  • It's nice if the schema lives in one place

Sometimes you need a migration

  • Changing the structure/meaning of a field
  • Adding indexes
  • Sometimes lazy, sometimes eager -- Ming supports both
    You can define multiple schemas and it will migrate to the newest when you access a record

Modify objects in memory, they'll be automatically updated at the end of the web request

You can use objects instead of dicts

Inspired by SQLAlchemy but not as full-featured

MIM -- "Mongo in Memory" for unit tests

Indexing and Query Optimization

Indexing 101

Table scan -- look at every record in the database
Tree -- sorted so you can look at fewer objects

Profiling your Queries

db.setProfilingLevel( level )

0 - off
1 - slow operations (>100ms)
2- all operations

db.system.profile.find({millis:{$gt:5}})
{
  ts: timestamp,
  info: "query test.foo ntoreturn:0 exception bytes:53",
  millis: 88
}

Using explain

When you use an index, nscscanned will be index objects looked at, nscannedObjects is objects scanned after that

db.coll.find({title:"My Blog"}).explain()
{
  cursor: "BasicCursor",
  indexBounds: [],
  nscanned: 57594, // how many we looked at
  nscannedObjects: 57594,
  n: 3,
  millis: 108
}

Creating Indexes

db.posts.ensureIndex({name: 1})

1 = ascending
-1 = descending

db.posts.ensureIndex({name: 1, date: -1})

Index above useful for posts from a person, sorted by date

db.posts.ensureIndex({title: 1}, {unique: true})

You can't add a second document with the same title

Indexing embedded documents

db.posts.save({
  title: "...",
  comments: [
    { author: '...' }
  ]
})

db.posts.ensureIndex({"comments.author": 1})

Multikeys

{ tags: ["mongodb", "cool"] }

db.posts.ensureIndex({tags: 1})

Automatically indexes all the items in the array

Covered Indexes

New in 1.7.4

Usually, tree really just has ObjectID, then you pull it out of collection
When all the fields desired are in the index itself, it can pull straight from the index instead of looking in collection
Make sure you exclude _id from returned fields

Sparse Indexes

New in 1.7.4

Include documents in index only if the indexed value is present
Limited to a single field

db.people.ensureIndex({title: 1}, {sparse: true})

Geospatial Indexes

Set x/y coordinates and then use 2d index type
Get objects near or certain distance from a point

Spherical support added in 1.7.0
Implemented on top of b-tree indexes
Geo-hashing plus grid-by-grid search

db.posts.ensureIndex({ latlong: '2d' })

Can be used in a compound index

Managing Indexes

Listing your indexes

db.posts.getIndexes()

db.posts.dropIndex( ... )

Creating an index is a blocking operation. However, you can build them in the background

db.posts.ensureIndex(..., {background: true})

Query Planning

How the database decides what index to use

Index will be used for:

  • find on key
  • $in query for key
  • range query for key
  • count query for key -- done from just the index
  • sort on key
  • find on key, returning only key -- done from just the index

Partial index use:

  • db.coll.find({b: 0}).sort({a: -1}) uses index for sorting, but not query

Compound indexes

If you index on { a: 1, b: -1 }, you can't use the index to:

db.collection.find({b: 0})

But you can use it for a query that's on just a

Picking an index

Starts running all the possibly query plans in parallel
Remembers whichever one returns the quickest

Keeps track of number of updates, redoes query planning every 100 updates

Chooses query plan based on pattern of query -- "equality match on x" vs. "where x = 1"

With shards, query planning runs on each shard independently

Big and Fat

Using MongoDB with Deep and Diverse Datasets

Jeremy at Intridia -- @jm
Also wickhamhousebrand.com for bowties!
Working on Document Design for MongoDB from O'Reilly

Pharm MD - Audit medicines prescribed by lots of doctors

Lesson 1 -- Abstraction is a two-edged sword

Lesson 2 -- Schema design matters

  • Embedding works!
  • But be careful -- you can run into performance problems
  • Every write would overrun allocated space so it got really slow

Lesson 3 -- Schemaless is fun!

  • Transforming data is fun
  • Formless data is annoying
  • Arbitrary embedding is awesome
  • Building to work with schemaless data can lead to really powerful app concepts

Be wary. It can crush you.

  • Weird app behavior when you're expecting a value
  • Huge, long-running data transformations (eg, turning embedded documents into their own collection)
  • Annoying data transforms for dev environment
  • Difficult to version data models

Lesson 4 -- Dig deep

Server stats

  • opcounters : insert, query, update, delete, getmore, command
  • connections: current, available

mongostat is like top for mongo

admin console -- HTML representation of server stats

db._adminCommand({diagLogging: 1})
db.currentOp()
{
  insert: [ { opid: 35, op: "query",  ns: "fundb.parties" } ]
}

Administration

mongo, mongod, and mongos are key binaries

Configurations settings for mongod:

  • --fork fork the server process
  • --logpath to output log to a file

Mongo has some very basic security, but prefers to run in a trusted environment

Reserved databases: admin, local

Special query to run commands

db.runCommand("isMaster")
db.commandHelp("isMaster")

See what commands are currently in progress

db.currentOp()

Use mongodump to get a binary image of a running database
Most people just back up from slave

Oplog is a capped collection in local database

Mongo will automatically copy initial image from master to slave

Two types of replication:

  • Master/slave has no automatic promotion
  • Replica sets include automatic promotion

Slave delay option -- keep a slave intentionally behind by some amount of time to protect against fat-fingering

Database Diversity

Combining multiple data stores in one application or system

Every database has strengths and weaknesses -- use the best for the situation
Keep in mind that they are separate, and querying across them is difficult
You may lose your speed gains by trying to work across

RDMS

Good at handling highly relational data
Complex graphs with a lot of edges

Use for

  • Inventory records of different product option combinations
  • User records
  • Financial transaction records
  • Order records

Document-based Databases

Encapuslated models with well-defined boundaries
Concise, finite trees
Well, "documents"

Use for

  • Order and onderline details for reporting
  • References to uploaded assets

Key-Value Stores

Things you only need to look up one way
Very, very fast access

Use for

  • Temporary storage of uploaded binary data
  • Cross-cloud configuration states

Connecting across databases

Different methods depending on the nature of the relationship

Simplest is 1-1 between RDMS and K/V store -- just store the key in a RDMS column
That pretty much also applies to 1-1 relationship to MongoDB record, but in reverse

For 1-to-many, lots of people serialize an array and stick it in a text column
Fast and easy, using $in query in Mongo
Using the $in query won't preserve order

Another 1-to-many option, add a key on the document pointing to RDMS record
Can't use it for bidirectional relationships
Usually the best approach when you know about it up-front, but watch out when implementing after the fact (if you use sharding)

Can also make the relationship into an embedded document to add more info:

{ ...,
  owner_relationship: {
    owner_id: 5,
    sort: 2
  }
}

Many-to-many can use an array of ids on the Mongo side

Obviously, index this stuff! You're going to be doing a lot of queries on it!

Scaling with MongoDB

Replication -- high availability and data safety
Sharding -- scalability

Replication

Always asynchronous
Master/slave or replica sets

Consensus election to determine primary server
Automatic failover and recovery
Writes always go to primary
Reads can go to any
Defaults to reading from primary unless you say secondary is OK

A write is truly committed once it's on a majority of members of the set
Use getLastError to check that it's been replicated out

Setting up:

  • Give the set a name
  • Run initiation command
  • Allow one to be set as always a slave (priority = 0)
  • Two-node replica set with 1 node down can't elect a primary -- define an arbiter that can vote but doesn't store data

Sharding

Automatically shards and reshards
It's the resharding that's the tricky part

Shard migration is done in the background
In fact, as far as clients are concerned, the chunks don't exist

There can (and will) be more chuncks than shards
Shards are a place for storing chunks, not a 1-1 mapping

Sharding is similar to BigTable implementation -- everything else is very different

Roadmap

v1.6 -- August 2010

  • Sharding, replica sets

v1.8 -- Under development (1-3 weeks away)

  • Single-server durability (storage engine journaling for crash safety)
  • Enhancements to sharding/replica sets
  • Spherical coordinates
  • Covered indexes
  • $rename field

Later

  • Better map/reduce and aggregation capabilities
  • Full-text search
  • TTL timeout collections
  • Collection-level concurrency, eventually document-level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment