Skip to content

Instantly share code, notes, and snippets.

@kbullaughey
Created April 29, 2012 03:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kbullaughey/2529968 to your computer and use it in GitHub Desktop.
Save kbullaughey/2529968 to your computer and use it in GitHub Desktop.
MongoDB Benchmarking - Speed of $in for looking up associations - a JOIN alternative.
// Benchmark to test how fast querying subsets of ratings fetching each rating
// in the set individually can be done
use benchmarking;
// For no particular reason, I'll sum up all the ratings
var sum = 0;
var cur = db.sets.find();
// Find user 1's ratings for all items in all sets
cur.forEach(function(set) {
set.items.forEach(function(item_id) {
var rating = db.ratings.findOne({user: 1, item_id: item_id});
sum += rating.rating;
});
});
print("rating sum: "+sum);
use benchmarking;
var num_items = 1000;
var num_sets = 400;
var items_per_set = 20;
var num_users = 10;
db.ratings.ensureIndex({user: 1, item_id: 1}, {unique: true});
var item_ids = [];
/* Create a list of items */
for (var i=0; i < num_items; i++) {
var x = {item: i};
db.items.save(x);
item_ids.push(x._id);
}
/* Create sets of these items */
for (var s=0; s < num_sets; s++) {
var set_items = [];
for (var i=0; i < items_per_set; i++) {
/* Pick a random item and push it into the list of IDs for this set */
var which = (s*items_per_set+i) % num_items;
set_items.push(item_ids[which]);
}
/* Put the set into the database */
var x = {set: s, items: set_items};
db.sets.save(x);
}
/* Create ratings for each user for each item */
for (var u=0; u < num_users; u++) {
var item_curr = db.items.find();
item_curr.forEach(function(item) {
/* Pick a random rating for this item and insert it */
var rating = Math.floor(Math.random()*5);
db.ratings.insert({user: u, item_id: item._id, rating: rating});
});
}
/* END */
// Benchmark to test how fast querying subsets of ratings using the 'in'
// operator can be done
use benchmarking;
// For no particular reason, I'll sum up all the ratings
var sum = 0;
var cur = db.sets.find();
// Find user 1's ratings for all items in all sets
cur.forEach(function(set) {
var res = db.ratings.find({user: 1, item_id: {$in: set.items}});
res.forEach(function(rating) { sum += rating.rating; });
});
print("rating sum: "+sum);
@kbullaughey
Copy link
Author

Suppose you have sets of items, and each item in each set has ratings by various people. This doesn't fit easily into a completely denormalized form. And without JOINs, I was worried about whether it would be slow to query the ratings for all the items of a set. Here I compare two means of querying such an arrangement.

I use the same data in each case, and the end result is the same. There is a collection of items. There is a collection of sets of items. Each set is a document that contains an array of ObjectIDs for items. Then there is a collection of ratings. Each rating is a document listing the person who gave the rating, the item the rating is for, and the rating itself. The question is how fast are various ways to query all the ratings for items in a particular set. In one case, individually.js, I query the ratings collection once for each item in each set. In the other case, I issue one query per set, providing an array, and using the $in operator to get all the ratings for the set of items. While the latter is certainly fewer queries, either way needs to find all the items. Will the one with fewer (but more complicated) queries be faster?

To run the benchmark, run the setup.js script using the Mongo console:

mongo < setup.js

This will use/create a database, benchmarking.

Then run the two benchmark test cases. The first is run as follows, using the time function to keep track of elapsed runtime:

time mongo < individually.js

And the other test case can be run as follows:

time mongo < using_in.js

On my macpro, I get 0.31s for the latter, which uses the $in operator, and 1.76s for the one that performs a query for each rating that is needed.

Thus, for associations like this, $in is clearly the better choice, performing nearly 6x faster. Also in absolute terms, the query is not slow. Each set gets the 20 associated ratings in less than 1ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment