Skip to content

Instantly share code, notes, and snippets.

@boutell
Created February 12, 2012 21:51
Show Gist options
  • Save boutell/1811024 to your computer and use it in GitHub Desktop.
Save boutell/1811024 to your computer and use it in GitHub Desktop.
Executing the same query in MongoDB and MySQL
I'm going to examine a plausible, nontrivial query that comes up for client sites and
consider how it would be implemented in both MongoDB and a MySQL back end that attempts
to support a lot of the capabilities we like in MongoDB.
I'm not going to look at how we retrieve the inner contents of a page (nested contents)
or information about subpages (related contents) because I have a pretty good idea how
we want to cope with those bits in both cases.
The query syntax here is just pseudocode, I'm not proposing it:
(published_at > :now) AND ((category_7 == 1) OR (tag_count > 2)) ORDER BY published_at
Note the use of category_7 as the property we check for to see if a content object has a
particular category. We do this because we don't want to assume a backend that supports
queries on deep structures and/or understands arrays. Similarly tag_count is something we
maintain separate from setting tag_1 and tag_7. We can provide helpers to make these things
more transparent.
What do implementations look like in MongoDB and MySQL, assuming that we
want all implementations to support the loosey-gooseyness of MongoDB as much
as practical?
The MongoDB implementation is:
db.content.find(
{ $and: [
{ published_at: {$gt: now} },
{ $or: [ { category_7: 1}, { tag_count: { $gt: 2 } } ] }
] })
.sort({ published_at: 1 });
Now let's look at the MySQL implementation. Let's assume that we don't have columns for all
of these things in the content table, because we're not requiring people to declare everything
in advance, and because there could be potentially hundreds of properties for different content types.
So we use the entity-attribute-value pattern. Each content object has a row in a content table
and we join that to an attr table which stores all the properties.
It turns out I can automatically translate the whole thing into a series of left joins and a 'where'
clause. It's straightforward even if clauses are nested as in this query.
The first left join is intended to actually return the properties of the chosen objects. The rest
are used to implement the query:
select c.*, a.* from content c left join attr a on c.id = a.content_id
left join attr b on c.id = a.content_id and b.name = 'published_at' and b.value > :now
left join attr c on c.id = a.content_id and c.name = 'category:7' and c.value = 1
left join attr d on d.id = a.content_id and d.name = 'tag_count' and d.value > 2
where (b.id and c.id and (c.id or d.id));
... But for performance we should use INNER JOIN wherever we can to avoid
trolling through more rows than necessary. If the query is simple like this:
(color == 'blue')
Or part of it descends strictly via AND clauses from the main query:
(published_at > :now) AND ((category_7 == 1) OR (tag_count > 2))
^^^ this part
Then that part of the query can be an inner join. Checking it in the 'where' is then
redundant although not harmful:
select c.*, a.* from content c left join attr a on c.id = a.content_id
inner join attr b on c.id = a.content_id and b.name = 'published_at' and b.value > :now
left join attr c on c.id = a.content_id and c.name = 'category:7' and c.value = 1
left join attr d on d.id = a.content_id and d.name = 'tag_count' and d.value > 2
where (b.id and c.id and (c.id or d.id))
I should be able to do an ORDER BY clause here to implement the sort:
ORDER BY b.value
I suspect it's best to put the final join for all the attributes of the
"winning" objects at the end to help MySQL understand what optimizations to make:
select c.*, a.* from content c
inner join attr b on c.id = a.content_id and a.name = 'published_at' and a.value > :date
left join attr c on c.id = a.content_id and c.name = 'category:7' and a.value = 1
left join attr d on d.id = a.content_id and d.name = 'tag_count' and a.value > 2
left join attr a on c.id = a.content_id
where (b.id and c.id and (c.id or d.id));
We should also support IN clauses because each IN clause only needs a single join
whereas a big dumb OR looking for various ids would generate many joins (without really
excessive cleverness anyway).
One big concern is that MySQL is case insensitive, while MongoDB is case sensitive.
This runs pretty deep and you can't change it in MongoDB - you have to explicitly
store lowercase versions of things for sorting purposes if you want to sort and
search that way (which we mostly do). It is possible to force MySQL to be case
sensitive by setting the utf8_general_cs collation (UTF8, General, Case Sensitive).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment