The key challenge in data modeling is balancing the needs of the application, the performance characteristics of the database engine, and the data retrieval patterns. When designing data models, always consider the application usage of the data (i.e. queries, updates, and processing of the data) as well as the inherent structure of the data itself.
- document employee -> id of resumee (another collection)
- document resumee -> id of employee (another collection)
- document employee w/ embedded resumee
- document resumee w/ embedded employee
Which one decide?
Depends on:
- Freq of access data
- Size of items (growability)
- Atomicity of data
Ex: addresses for a person.
> db.person.findOne()
{
name: 'Kate Monster',
ssn: '123-456-7890',
addresses : [
{ street: '123 Sesame St', city: 'Anytown', cc: 'USA' },
{ street: '123 Avenue Q', city: 'New York', cc: 'USA' }
]
}
Ex: parts for a product in a replacement parts ordering system.
Each Part would have its own document:
> db.parts.findOne()
{
_id : ObjectID('AAAA'),
partno : '123-aff-456',
name : '#4 grommet',
qty: 94,
cost: 0.94,
price: 3.99
}
Each Product would have its own document, which would contain an array of ObjectID references to the Parts that make up that Product:
> db.products.findOne()
{
name : 'left-handed smoke shifter',
manufacturer : 'Acme Corp',
catalog_number: 1234,
parts : [ // array of references to Part documents
ObjectID('AAAA'), // reference to the #4 grommet above
ObjectID('F17C'), // reference to a different Part
ObjectID('D2AA'),
// etc
]
Use an Application-level join to retrieve the parts for a particular product:
// Fetch the Product document identified by this catalog number
> product = db.products.findOne({catalog_number: 1234});
// Fetch all the Parts that are linked to this Product
> product_parts = db.parts.find({_id: { $in : product.parts } } ).toArray() ;
For efficient operation, you’d need to have an index on ‘products.catalog_number’.
Ex: event logging system that collects log messages for different machines.
> db.hosts.findOne()
{
_id : ObjectID('AAAB'),
name : 'goofy.example.com',
ipaddr : '127.66.66.66'
}
>db.logmsg.findOne()
{
time : ISODate("2014-03-28T09:42:41.382Z"),
message : 'cpu is on fire!',
host: ObjectID('AAAB') // Reference to the Host document
}
Application-level join to find the most recent 5,000 messages for a host:
// find the parent ‘host’ document
> host = db.hosts.findOne({ipaddr : '127.66.66.66'}); // assumes unique index
// find the most recent 5000 log message documents linked to that host
> last_5k_msg = db.logmsg.find({host: host._id}).sort({time : -1}).limit(5000).toArray()
Consider two factors:
- Will the entities on the “N” side of the One-to-N ever need to stand alone?
- What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions?
Based on these factors, you can pick one of the three basic One-to-N schema designs:
- Embed the N side if the cardinality is one-to-few and there is no need to access the embedded object outside the context of the parent object (ex: comments in a blog post)
- Use an array of references to the N-side objects if the cardinality is one-to-many or if the N-side objects should stand alone for any reasons (ex: authors of books)
- Use a reference to the One-side in the N-side objects if the cardinality is one-to-squillions (ex: logging actions of events)
Tasks & Persons system
db.person.findOne()
{
_id: ObjectID("AAF1"),
name: "Kate Monster",
tasks [ // array of references to Task documents
ObjectID("ADF9"),
ObjectID("AE02"),
ObjectID("AE73")
// etc
]
}
db.tasks.findOne()
{
_id: ObjectID("ADF9"),
description: "Write lesson plan",
due_date: ISODate("2014-04-01"),
owner: ObjectID("AAF1") // Reference to Person document
}
Drawback:
- reassign the task to another person => you’ll have to update both the reference from the Person to the Task document, and the reference from the Task to the Person.
### Denormalizing With “One-To-Many” Relationships
Version of the Product document without denormalization...
> db.products.findOne()
{
name : 'left-handed smoke shifter',
manufacturer : 'Acme Corp',
catalog_number: 1234,
parts : [ // array of references to Part documents
ObjectID('AAAA'), // reference to the #4 grommet above
ObjectID('F17C'), // reference to a different Part
ObjectID('D2AA'),
// etc
]
}
Denormalizing would mean that you don’t have to perform the application-level join when displaying all of the part names for the product, but you would have to perform that join if you needed any other information about a part.
> db.products.findOne()
{
name : 'left-handed smoke shifter',
manufacturer : 'Acme Corp',
catalog_number: 1234,
parts : [
{ id : ObjectID('AAAA'), name : '#4 grommet' }, // Part name is denormalized
{ id: ObjectID('F17C'), name : 'fan blade assembly' },
{ id: ObjectID('D2AA'), name : 'power switch' },
// etc
]
}
This new structure add just a bit more of client-side work to the application-level join:
// Fetch the product document
> product = db.products.findOne( { catalog_number: 1234 } );
// Create an array of ObjectID()s containing *just* the part numbers
> part_ids = product.parts.map( part => part.id );
// Fetch all the Parts that are linked to this Product
> product_parts = db.parts.find({_id: { $in : part_ids } } ).toArray() ;
For example: assume the part name changes infrequently, but the quantity on hand changes frequently. This means that while it makes sense to denormalize the part name into the Product document, it does not make sense to denormalize the quantity on hand.
> db.parts.findOne()
{
_id : ObjectID('AAAA'),
partno : '123-aff-456',
name : '#4 grommet',
product_name : 'left-handed smoke shifter', // Denormalized from the ‘Product’ document
product_catalog_number: 1234, // Ditto
qty: 94,
cost: 0.94,
price: 3.99
}
The same considerations than the previous one but a bit worse
> db.logmsg.findOne()
{
time : ISODate("2014-03-28T09:42:41.382Z"),
message : 'cpu is on fire!',
ipaddr : '127.66.66.66',
host: ObjectID('AAAB')
}
In fact, if there’s only a limited amount of information you want to store at the “one” side, you can denormalize it ALL into the “squillions” side and get rid of the “one” collection altogether:
> db.logmsg.findOne()
{
time : ISODate("2014-03-28T09:42:41.382Z"),
message : 'cpu is on fire!',
ipaddr : '127.66.66.66',
hostname : 'goofy.example.com',
}
You can also denormalize into the “one” side. Lets say you want to keep the last 1000 messages from a host in the 'hosts’ document. You could use the $each / $slice functionality introduced in MongoDB 2.4 to keep that list sorted, and only retain the last 1000 messages:
// Get log message from monitoring system
logmsg = get_log_msg();
log_message_here = logmsg.msg;
log_ip = logmsg.ipaddr;
// Get current timestamp
now = new Date()
// Find the _id for the host I’m updating
host_doc = db.hosts.findOne({ ipaddr : log_ip }, { _id:1 }); // Don’t return the whole document
host_id = host_doc._id;
// Insert the log message, the parent reference, and the denormalized data into the ‘many’ side
db.logmsg.save({
time : now,
message : log_message_here,
ipaddr : log_ip,
host : host_id )
});
// Push the denormalized log message onto the ‘one’ side
db.hosts.update(
{ _id: host_id },
{
$push : {
logmsgs : {
$each: [
{
time : now, message :
log_message_here
}
],
$sort: { time : 1 },
// Only keep the latest ones
$slice: -1000 // Only keep the latest 1000
}
}
}
);
Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
- Improved read performance
- One Round trip to the DB
1:1 → Embed 1:Many → Embed (from the many to the one) many:many → Link