Skip to content

Instantly share code, notes, and snippets.

@juanmaguitar
Last active February 21, 2018 03:40
Show Gist options
  • Save juanmaguitar/3ce152b0e7f2a4bad291ffcf81732887 to your computer and use it in GitHub Desktop.
Save juanmaguitar/3ce152b0e7f2a4bad291ffcf81732887 to your computer and use it in GitHub Desktop.
MongoDb notes

Data Modeling

The key challenge in data modeling is balancing the needs of the application, the performance characteristics of the database engine, and the data retrieval patterns. When designing data models, always consider the application usage of the data (i.e. queries, updates, and processing of the data) as well as the inherent structure of the data itself.

One-To-One relationships

"Employee : Resumeé" case

  • document employee -> id of resumee (another collection)
  • document resumee -> id of employee (another collection)
  • document employee w/ embedded resumee
  • document resumee w/ embedded employee

Which one decide?

Depends on:

  • Freq of access data
  • Size of items (growability)
  • Atomicity of data

One-To-Many relationships

Basic

One-to-Few

Ex: addresses for a person.

> db.person.findOne()
{
  name: 'Kate Monster',
  ssn: '123-456-7890',
  addresses : [
     { street: '123 Sesame St', city: 'Anytown', cc: 'USA' },
     { street: '123 Avenue Q', city: 'New York', cc: 'USA' }
  ]
}

One-to-Many

Ex: parts for a product in a replacement parts ordering system.

Each Part would have its own document:

> db.parts.findOne()
{
    _id : ObjectID('AAAA'),
    partno : '123-aff-456',
    name : '#4 grommet',
    qty: 94,
    cost: 0.94,
    price: 3.99
}

Each Product would have its own document, which would contain an array of ObjectID references to the Parts that make up that Product:

> db.products.findOne()
{
    name : 'left-handed smoke shifter',
    manufacturer : 'Acme Corp',
    catalog_number: 1234,
    parts : [     // array of references to Part documents
        ObjectID('AAAA'),    // reference to the #4 grommet above
        ObjectID('F17C'),    // reference to a different Part
        ObjectID('D2AA'),
        // etc
    ]

Use an Application-level join to retrieve the parts for a particular product:

 // Fetch the Product document identified by this catalog number
> product = db.products.findOne({catalog_number: 1234});

// Fetch all the Parts that are linked to this Product
> product_parts = db.parts.find({_id: { $in : product.parts } } ).toArray() ;

For efficient operation, you’d need to have an index on ‘products.catalog_number’.

One-to-Squillions

Ex: event logging system that collects log messages for different machines.

> db.hosts.findOne()
{
    _id : ObjectID('AAAB'),
    name : 'goofy.example.com',
    ipaddr : '127.66.66.66'
}

>db.logmsg.findOne()
{
    time : ISODate("2014-03-28T09:42:41.382Z"),
    message : 'cpu is on fire!',
    host: ObjectID('AAAB')       // Reference to the Host document
}

Application-level join to find the most recent 5,000 messages for a host:

// find the parent ‘host’ document
> host = db.hosts.findOne({ipaddr : '127.66.66.66'});  // assumes unique index

// find the most recent 5000 log message documents linked to that host
> last_5k_msg = db.logmsg.find({host: host._id}).sort({time : -1}).limit(5000).toArray()

Conclusions (basics)

Consider two factors:

  • Will the entities on the “N” side of the One-to-N ever need to stand alone?
  • What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions?

Based on these factors, you can pick one of the three basic One-to-N schema designs:

  • Embed the N side if the cardinality is one-to-few and there is no need to access the embedded object outside the context of the parent object (ex: comments in a blog post)
  • Use an array of references to the N-side objects if the cardinality is one-to-many or if the N-side objects should stand alone for any reasons (ex: authors of books)
  • Use a reference to the One-side in the N-side objects if the cardinality is one-to-squillions (ex: logging actions of events)

Advanced

Two-Way Referencing

Tasks & Persons system

db.person.findOne()
{
    _id: ObjectID("AAF1"),
    name: "Kate Monster",
    tasks [     // array of references to Task documents
        ObjectID("ADF9"), 
        ObjectID("AE02"),
        ObjectID("AE73") 
        // etc
    ]
}


db.tasks.findOne()
{
    _id: ObjectID("ADF9"), 
    description: "Write lesson plan",
    due_date:  ISODate("2014-04-01"),
    owner: ObjectID("AAF1")     // Reference to Person document
}

Drawback:

  • reassign the task to another person => you’ll have to update both the reference from the Person to the Task document, and the reference from the Task to the Person.

### Denormalizing With “One-To-Many” Relationships

Denormalizing from Many -> One

Version of the Product document without denormalization...

> db.products.findOne()
{
    name : 'left-handed smoke shifter',
    manufacturer : 'Acme Corp',
    catalog_number: 1234,
    parts : [     // array of references to Part documents
        ObjectID('AAAA'),    // reference to the #4 grommet above
        ObjectID('F17C'),    // reference to a different Part
        ObjectID('D2AA'),
        // etc
    ]
}

Denormalizing would mean that you don’t have to perform the application-level join when displaying all of the part names for the product, but you would have to perform that join if you needed any other information about a part.

> db.products.findOne()
{
    name : 'left-handed smoke shifter',
    manufacturer : 'Acme Corp',
    catalog_number: 1234,
    parts : [
        { id : ObjectID('AAAA'), name : '#4 grommet' },         // Part name is denormalized
        { id: ObjectID('F17C'), name : 'fan blade assembly' },
        { id: ObjectID('D2AA'), name : 'power switch' },
        // etc
    ]
}

This new structure add just a bit more of client-side work to the application-level join:

// Fetch the product document
> product = db.products.findOne( { catalog_number: 1234 } );  

// Create an array of ObjectID()s containing *just* the part numbers
> part_ids = product.parts.map( part => part.id );

// Fetch all the Parts that are linked to this Product
> product_parts = db.parts.find({_id: { $in : part_ids } } ).toArray() ;

For example: assume the part name changes infrequently, but the quantity on hand changes frequently. This means that while it makes sense to denormalize the part name into the Product document, it does not make sense to denormalize the quantity on hand.

Denormalizing from One -> Many

> db.parts.findOne()
{
    _id : ObjectID('AAAA'),
    partno : '123-aff-456',
    name : '#4 grommet',
    product_name : 'left-handed smoke shifter',   // Denormalized from the ‘Product’ document
    product_catalog_number: 1234,                     // Ditto
    qty: 94,
    cost: 0.94,
    price: 3.99
}

The same considerations than the previous one but a bit worse

Denormalizing With “One-To-Squillions” Relationships

> db.logmsg.findOne()
{
    time : ISODate("2014-03-28T09:42:41.382Z"),
    message : 'cpu is on fire!',
    ipaddr : '127.66.66.66',
    host: ObjectID('AAAB')
}

In fact, if there’s only a limited amount of information you want to store at the “one” side, you can denormalize it ALL into the “squillions” side and get rid of the “one” collection altogether:

> db.logmsg.findOne()
{
    time : ISODate("2014-03-28T09:42:41.382Z"),
    message : 'cpu is on fire!',
    ipaddr : '127.66.66.66',
    hostname : 'goofy.example.com',
}

You can also denormalize into the “one” side. Lets say you want to keep the last 1000 messages from a host in the 'hosts’ document. You could use the $each / $slice functionality introduced in MongoDB 2.4 to keep that list sorted, and only retain the last 1000 messages:

//  Get log message from monitoring system
logmsg = get_log_msg();
log_message_here = logmsg.msg;
log_ip = logmsg.ipaddr;

// Get current timestamp
now = new Date()

// Find the _id for the host I’m updating
host_doc = db.hosts.findOne({ ipaddr : log_ip }, { _id:1 });  // Don’t return the whole document
host_id = host_doc._id;

// Insert the log message, the parent reference, and the denormalized data into the ‘many’ side

db.logmsg.save({
    time : now, 
    message : log_message_here, 
    ipaddr : log_ip, 
    host : host_id ) 
});
// Push the denormalized log message onto the ‘one’ side
db.hosts.update( 
    { _id: host_id }, 
    {
        $push : { 
            logmsgs : { 
                $each:  [ 
                    { 
                        time : now, message : 
                        log_message_here 
                    } 
                ],
                $sort:  { time : 1 },  
                // Only keep the latest ones 
                $slice: -1000 // Only keep the latest 1000
            }
         }
    } 
);

Conclussions

Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.

Benefits of embedding

  • Improved read performance
  • One Round trip to the DB

When to denormalize

1:1 → Embed 1:Many → Embed (from the many to the one) many:many → Link

MongoDB Notes

Clear Ideas

  • MongoDB follows an application driven schema
  • MongoDB supports rich documents (pre join / embed data)
  • "No Mongo joins" (... but there are ways to have them)
  • No constraints
  • Atomic operations
  • No declared schema (... but it can have one w/ Mongoose)

Import Data

Import (restore) from dumped data (BSON) [1]

Dumped data are files w/ .bson and .metadata.bson extensions, for example:

movieDetails.bson
movieDetails.metadata.json
movies.bson
movies.metadata.json
moviesScratch.bson
moviesScratch.metadata.json
reviews.bson
reviews.metadata.json
system.indexes.bson

To restore this data we use:

mongorestore --drop -d <database-name> <directory-of-dumped-backup>

mongorestore --drop -d movies video

Import from JSON

mongoimport -d crunchbase -c companies companies.json

Create companies collection in the crunchbase database and import documents in companies.json

Operations

Switch to DB

use video;

Collection

db.myCollection.create()
db.myCollection.drop()

Insert into DB

db.movies.insertOne({ "title": "Jaws", "year": 1975, "imdb": "tt0073195" });
db.movies.insertOne({ "title": "Mad Max 2: The Road Warrior", "year": 1981, "imdb": "tt0082694" })
db.movies.insertOne({ "title": "Raiders of the Lost Ark", "year": 1981, "imdb": "tt0082971" })

Specifying ID...

db.moviesScratch.insertOne({ "_id": "tt0075148", "title": "Rocky", "year": "1976" });

Select from DB

db.movies.find()
db.movies.find({})

db.movies.find({}).pretty()
db.movies.find({ 'title': 'Jaws' }).pretty()
db.movies.find({ 'year': 1981 }).pretty()

Cursors

Find returns a cursor

var c = db.movies.find()
c.objsLeftInBatch(); // 3
c.hasNext() // true
c.next()
c.next()
c.next()
c.objsLeftInBatch(); // 0
c.hasNext() // false

In the console it returns the first 20 results

db.movies.find()

Projections

db.movies.find({ rated: "PG-13"}, {title: 1}).pretty()
db.movies.find({ rated: "PG-13"}, {title: 1, _id: 0}).pretty()
db.movies.find({ rated: "PG-13"}, {writers: 0, actors: 0}).pretty()

Creating Documents

insertOne() & insertMany()

db.moviesScratch.insertOne({ "title": "Rocky", "year": "1976", "imdb": "tt0075148"});

db.moviesScratch.insertMany(
    [
        {
        "imdb" : "tt0084726",
        "title" : "Star Trek II: The Wrath of Khan",
        "year" : 1982,
        "type" : "movie"
        },
        {
        "imdb" : "tt0796366",
        "title" : "Star Trek",
        "year" : 2009,
        "type" : "movie"
        }
    ]
)

Reading Documents

> db.movies.find({ rated: "PG-13"}).count()
152
> db.movies.find({ rated: "PG-13", year: 2009 }).count()
8

> db.movieDetails.find({ "tomato.meter": 100 }).pretty()

Exact match

> db.movieDetails.find({ "writers": ["Ethan Cohen", "Joel Cohen"] }).count()

Any match in the array

> db.movieDetails.find({ "actors": "Jeff Bridges" }).count()

First value in the array

> db.movieDetails.find({ "actors.0": "Jeff Bridges" }).count()

Comparison Operators

db.movieDetails.find({ runtime: { $gt: 90 } }).count()
db.movieDetails.find({ runtime: { $gt: 90, $lt: 120 } }).count()
db.movieDetails.find({ "tomato.meter": { $gte: 95 }, runtime: { $gt: 180 } })
db.movieDetails.find({ rated: { $ne: "UNRATED" } }).count()
db.movieDetails.find({ rated: { $in: ["G", "PG"] } }).pretty()

schema

  • MMAP
  • Wired Tiger

### Indexes

Get Indexes

db.movies.getIndexes()

Create Index

db.members.createIndex( { "user_id": 1 } )

Create Unique Index

db.members.createIndex( { "user_id": 1 }, { unique: true } )
db.students.createIndex( { "student_id": 1, "class_id":1 }, { unique: true } )

Sparse Indexes (Unique index admitting null values)

db.students.createIndex( { "student_id": 1, "class_id":1 }, { unique: true, sparse: true } )

Foreground & Background (for creation)

  • Foreground
    • Fast
    • Blocks Writers & Readers in the database
  • Background
    • Slow
    • Don't block Readers/Writers
    db.members.createIndex( { "user_id": 1 }, { background: true } )

Explain

db.foo.explain() // returns an "explainable object"

db.foo.explain().find()
                .update()
                .remove()
                .aggregate()
                ....
                .help()

var exp = db.example.explain()
var exp = db.example.explain("executionStats")
var exp = db.example.explain("allPlainsExecution")

Covered Queries

Can be covered entirely by an index → 0 documents need to be inspected to get a result

With the following indexes

{ name : 1, dob : 1 }
{ _id : 1 }
{ hair : 1, name : 1 }

The folowing queriesare "covered" (can be returned just w/ the indexes)

db.example.find( { name : { $in : [ "Bart", "Homer" ] } }, {_id : 0, dob : 1, name : 1} )

Stats

db.students.stats()
db.students.totalIndexSize()

GeoSpatial Indexes

loc : [ x , y ]

db.restaurants.createIndex( { location : "2d" } )
db.restaurants.createIndex( { location : "2d", type: 1 } )

db.restaurants.find({ location: { $near: [x,y] }}).limit(20)
db.places.find({ location: { $near: [74,140] }}).limit(3)

GeoSpatial Spherical

GeospacialSpherical

loc : [ <longitude> , <latitude> ]
loc : { lng : <longitude> , lat : <latitude> }

Query Performance [1]

getIndexes()

> db.movies.getIndexes()
[
  {
    "v": 1,
    "key": {
      "_id": 1
    },
    "name": "_id_",
    "ns": "movies.movies"
  }
]
> db.movies.find({ year : { $gt : 1980, $lt: 1985 } }).count()
104


> db.movies.find({ year : { $gt : 1980, $lt: 1985 } }).explain()
{
  "queryPlanner": {
    "plannerVersion": 1,
    ...,
    "winningPlan": {
      "stage": "COLLSCAN",
      ...
    },
    ...
  },
  ...
}

Query w/o index

explain("executionStats")

> db.movies.find({ year : { $gt : 1980, $lt: 1985 } }).explain("executionStats")
{
  "queryPlanner": {
    "plannerVersion": 1,
    ...,
    "winningPlan": {
      "stage": "COLLSCAN",
      ...
    },
    ...
  },
  "executionStats": {
    "executionSuccess": true,
    "nReturned": 104,
    "executionTimeMillis": 3,
    "totalKeysExamined": 0,
    "totalDocsExamined": 3365,
    "executionStages": {
      "stage": "COLLSCAN",
      ...
  },
  ...
}
  • queryPlanner.winningPlan.stage displays COLLSCAN to indicate a collection scan.
  • executionStats.nReturned displays 104 to indicate that the query matches and returns 104 documents.
  • executionStats.totalDocsExamined display 3365 to indicate that MongoDB had to scan 3365 documents (i.e. all documents in the collection) to find the three matching documents.
    > db.movies.find({}).count()
    3365

createIndex

> db.movies.createIndex( { year: 1 } )
{
  "createdCollectionAutomatically": false,
  "numIndexesBefore": 1,
  "numIndexesAfter": 2,
  "ok": 1
}

Query w/ index

> db.movies.find({ year : { $gt : 1980, $lt: 1985 } }).explain("executionStats")
{
  "queryPlanner": {
    "plannerVersion": 1,
    ...,
    "winningPlan": {
      "stage": "FETCH",
      "inputStage": {
        "stage": "IXSCAN",
        "keyPattern": {
          "year": 1
        },
        ...
      }
    },
    ...
  },
  "executionStats": {
    "executionSuccess": true,
    "nReturned": 104,
    "executionTimeMillis": 5,
    "totalKeysExamined": 104,
    "totalDocsExamined": 104,
    "executionStages": {
      ...
    }
  },
  ...
}
  • queryPlanner.winningPlan.inputStage.stage displays IXSCAN to indicate index use.
  • executionStats.nReturned displays 104 to indicate that the query matches and returns 104 documents.
  • executionStats.totalKeysExamined display 104 to indicate that MongoDB scanned 104 index entries.
  • executionStats.totalDocsExamined display 104 to indicate that MongoDB scanned 104 documents.

dropIndex

> db.movies.dropIndex( { year: 1 } )
{
  "nIndexesWas": 2,
  "ok": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment