gharriso/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Studies in MongoDB schema design - part 1

Anyone who has spent anytime working with MongoDB knows that designing your document structure - your schema is critical to application performance.  Those who came from the relational database world soon realise that the old rules no longer apply - instead of modelling to "third normal form", we model our collections more towards the application object model and the queries that we need to optimize.
Data in a MongoDB database is represented within JSON (JavaScript Object Notation) objects.   JSON documents are built up from a small set of very simple constructs: values, objects and arrays.
There's a lot of patterns that can be applied to MongoDB schema design - I'd recommend in particular  The little Mongo DB Schema Design book and MongoDB Applied Design Patterns.  There are also some good presentations available from MongoDB.
I'm not going to attempt to cover all or most of the possible design patterns, but I did want to compare two extremes of design from a query performance point of view.  The two extremes are:

Embedding everything in a single document.
Linking collections using pointers to data in other collections.  This is roughly equivalent to using a relational database's third normal form model.

There's a lot of room for compromise between the two approaches and a lot of non-performance related reasons for choosing one over the other (atomic updates and the 16M document limit for instance).  Nevertheless, let's look at how the two extremes compare from a performance point of view - at least for a specific workload.
The Schema

I'm using the classic product-customer-orders-lineitem schema for this example.  In the first case, each of these entities was represented by a separate document.  In a relational database, we'd diagram this design like this:

So in this case we have four collections which look like this:
https://gist.github.com/2f5c531bac8efbce888d9e0ac6eeaab7
In the embedded case we put absolutely everything into a single document.  The results looks like this:

For my example, I had 1,000 customers, 1,000 products, 51,116 orders and 891,551 lineitems.   I had the following indexes defined:
https://gist.github.com/0d782e22678739a5aa4d270c10a92a5c
Let's take a look at some typical operations that we might execute against these schemas.
Getting all data for a customer

It's very simple to get all the data for a customer when all the information is embedded in a single document.  We can get all the data from the embedded version with a query like this:
https://gist.github.com/22137307d14f7db692cec677144f6dc1
With an index on email, it's almost instantaneous - 1,000 lookups took only 706ms.
Life is much harder with the fully normalized version.  We need to use an aggregation or custom code to achieve the same result, and we need to be sure we have indexes on the $lookup join conditions.  Here's the aggregation:
https://gist.github.com/234b5247c37377c6e8466e97797b3b5c
Not surprisingly, the aggregation/join takes way longer than the embedded situation: 1,000 lookups took 6,543 seconds
Get all open orders

In a typical order processing scenario, we want to retrieve all the orders that are in an incomplete state.  In our example, this is represented by orderStatus=0.
Of course, we can get customers with open Orders like this:
https://gist.github.com/27da9e6cef60456d53626bc8bee53d30
That does give us all the customers with open order, but if we just want to get the orders themselves, we are going to need to use the aggregation framework:
https://gist.github.com/bb4e862ef34a47032f4a93dd0349cab8
What is with the duplicate $match statements? The first $match gets us customers with open orders, the second $match gets us the orders themselves.  We don't need the first to get the right results, but it does improve performance since it can exploit an index - after we unwind the indexed access path cannot be used.
It's far easier to get the orders in the linked data model.
https://gist.github.com/695f3058ee297762c750b533d2fe9125
Not surprisingly,  the simpler linked query gets the better performance: 60ms vs 5,397ms for the embedded aggreate query.
Top products

Most company wants to know it's top products.  For the embedded model, we need to unwind the lineitems and aggregate by product name:
https://gist.github.com/21c925324386cfb5b821d1985554d6cb
In the linked model we also need to aggreage, with $lookup joins between lineitems and products to get the product names:
https://gist.github.com/52e47499ed32e837b3da138f86fe70bb
Despite having to perform a join, the linked data model performs best.  We only have to join after we get the top 10 products, while in the embedded case we have to scan all of the data in the embedded collection.  The linked case took 4,288 as compared to 9,369ms for the embedded case.
Inserting new orders

In this case, I looked at inserting a new order for an existing customer.  In the embedded case this is simply done by using a $push operation into the customer document:
https://gist.github.com/290ba3ce02a9f6579f68f4bb4aabb50b
In the linked data model we have to insert into the lineitems collection and the orders collection:
https://gist.github.com/b1b93931ddfa2cdc1f5e881d85f3f883
You might think that the single update would easily outperform the multiple inserts required by the linked model.  But actually,  the update is a quite expensive operation - especially if there's not enough spare space in the collection to fit the new data.  The linked inserts - though more numerous - are simpler operations.  As a result the linked model still outperformed the embedded model in this example:  5,722ms v 9,272ms for 500 order inserts.
Updating product prices

What if we want to update the price of a product?   In the embedded case the product prices are embedded into the lineitems itself.  We can do this in a single operation in mongoDB 3.6 using the arrayFilters operator:
https://gist.github.com/8ea3e78b5af113a59f37c0a9007b6669
Of course, in the linked model we can use a very simple update to the products collection:
https://gist.github.com/4c877dee754e9336233e1095edc9c804
The embedded model requires us to touch many more documents than in the linked model.  As a result 10 product code price updates took 5,397ms vs only 12ms for the linked version.
Deletes

If we want to delete all data for a single customer in the embedded model we need to iterate through lineitems, orders and customers collections.  The code would look something like this:
https://gist.github.com/1811e2f575c20cf022033fdbe6033788
Of course, in the embedded case things are a lot easier:
https://gist.github.com/fca75543cfba17eb4722bd6b00f339a8
The linked example performs very poorly - 26,393ms vs 648 for the embedded case.  It might be possible to optimize the linked deletes by deleting all the customers and orders in one hit, but no matter what we do we have over a thousand time more documents to delete than in the embedded case - it's always going to perform relatively poorly.
Summary and conclusion

Lets look at the relative performance of each of these examples graphically:

What we see is that while the embedded model is pretty good at fetching all the data for a single customer or for deleting a customer, it's not superior to the linked alternative in all cases.
The answer to the question "What is the best data model for my application" is - and always has been - "it depends".  In the old relational world we could do an almost mechanical mapping to third normal form as a starting point and then tweak that to application workload.  In the case of a document database, we tend to start with the most embedded model possible and then pull things apart to reduce redundancy and improve performance.  In the case of this example we can see that the linked and embedded extremes each tend to favour very different types of queries.  Which model works best for you will depend on which sort of operations are optimisation targets for your application.
And don't forget in mongoDB you can implement hybrid models that embed a subset of data while linking to the full set.  The classic example is a blog system that embeds the first page of comments while linking to subsequent pages.
Many things have changed in the world of databases since I wrote my first book on database performance 20 years ago.  But the fundamentals of tuning have not changed - the only way to be certain that a data model is going to perform is to test it.  Having a 'schema-free' database does not mean you can get away with a schema-free application, and it is still imperative to get the data model right if you want to achieve optimum application performance.