Skip to content

Instantly share code, notes, and snippets.

@aliev
Forked from akaariai/join_promotion
Created April 2, 2018 04:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aliev/d674bb3c307da5430890e2152877444b to your computer and use it in GitHub Desktop.
Save aliev/d674bb3c307da5430890e2152877444b to your computer and use it in GitHub Desktop.
Django join promotion
=========================
Join promotion in the ORM
=========================
[NOTE: We need better terms than promote and demote for changing the join
type. These terms are extremely easy to mix up. Maybe the ORM methods could
be to_inner_joins and to_louter_joins instead of promote_joins and demote_joins?
I tried to clean up the mis-usages of promotion/demotion but there could still
be some cases where these are mixed up]
Join promotion refers to changing inner joins to left outer joins.
Similarly, join demotion is the action of changing left joins to inner joins.
The reason Django ORM uses inner joins is that they are much more performant
(the difference can be orders of magnitude). There is no other reason to
demote joins to inner joins. That is, when using the Django ORM, results will
always be correct even when all joins are set to left outer joins.
Join demotion is done in case the ORM can prove the results will be correct even
when inner joins are used. We can always demote any join that must have a value
on the remote side of the join (that is, join along non-nullable foreign key).
We will also demote joins when filtering in such a way that the filter can only
match when the value is non-null.
Lets assume we have the following models::
class Book(models.Model):
title = models.TextField()
alias = models.TextField(null=True)
author = models.ForeignKey(Author)
class Author(models.Model):
name = models.TextField()
favourite_book = models.ForeignKey(Book, null=True)
first_book = models.ForeignKey(Book, null=True)
When doing `Author.objects.filter(favourite_book__title='Foo')` we *can* use an
inner join. The foreign key is nullable, but if a given author has no favourite
book, its title naturally can't be 'Foo'.
On the other hand, when doing `Author.objects.filter(favourite_book__alias__isnull=True)`
we *must* use a left outer join - both those authors who have no favourite book, and
those author's whose favourite book has no alias must mach. (If you want only authors
who have a favourite book, but the books alias is null, then you must add
favourite_book__isnull=False to the filter().)
Things get a bit more complicated when OR combined queries are considered. For example,
when doing `Author.objects.filter(Q(favourite_book__title='Foo') | Q(first_book__title='Bar')),
we must use left outer joins. Why? Lets explain this with some example cases.
First case is an Author that has a favourite_book with title 'Foo', and no first_book at
all (the author isn't actually an author yet...). If we were to use inner joins, then the
join to first_book would remove the author row from results before filtering was done (this
is simply because inner joins remove any rows from the result set that have no mathing row
for the join). As the second case we can consider an author with first_book with title
'Bar', but no favourite book - again the inner join would remove the result even before the
WHERE clause of the query gets applied.
One the other hand, when doing Q(favourite_book__title='Foo') | Q(favourite_book__title='Bar') we
can use an inner join. The reason is that if an author has no favourite
book, then neither of the above filters can match. Still, if the query was
Q(favourite_book__title='Foo') | Q(favourite_book__title__isnull=True), then we must
again use left outer join, as otherwise the latter condition would miss those cases
where the title is null because the author has no favourite book.
When doing annotations, it's notable that we must use left outer joins for any
new join created for an annotation (keeping in mind that we never demote joins for relations
that must always have results). Otherwise adding an annotation could remove
results by the inner join it creates.
Also, after we have decided a given join must be a left join, all joins "starting" from that
join must be left joins, too. The reason is again that an inner join, even after an louter
join will remove results. So, in a query like::
SELECT *
FROM author
LEFT JOIN book ON author.favourite_book_id = author.id
INNER JOIN author ON book.author_id = author.id
the inner join effectively changes the left join to book into an inner join.
Currently, the only lookup that can match null values is isnull. However, it seems plausible
that custom lookups can be used to create other lookups that match null values. This will be
immediately obvious when expressions are allowed in the filter() clause. For example,
.filter(Equal(Coalesce('favourite_book__title', 'first_book__title'), 'Foo')) must use
left outer joins for the joins, as otherwise an author that has favourite_book with title
'Foo', but no first book would not be matched.
Currently, when doing .annotate(coalesced=Coalesce('favourite_book__title', 'first_book__title')).filter(colesced='Foo')
we will use left outer joins. This is because any annotation will use left outer joins for any
new joins added to the query, and we don't currently do join demotion when filtering on
annotations.
How do we decide which joins to promote in sql.Query?
=====================================================
In Django, each individual filter clause is built with sql.Query.build_filter(). The build_filter()
method returns the lookup object to be added to the query, and it also returns the joins it
would prefer to be inner joins (I just realized I have made a big mistake in naming the return
value "needed_inner", when in fact it should be something like "joins_that_can_be_inner").
The _add_q() method then uses a JoinPromoter object to count "votes" for inner joins. It does this
for each Q-object separately. The way this works is that each child of given Q-object is first
processed by build_filter, then the votes are added to the JoinPromoter.
The JoinPromoter has logic for deciding if a join should be promoted to louter join, or if a join
should be demoted to an inner join.
The join promoter needs votes from all children of a Q object, and it also needs to know how many
children the Q object has, and the connector of the Q object. Still, it needs to know
if we are dealing with a negated lookup (as NOT AND == OR for join promotion [TODO: add reason
why]).
What does the JoinPromoter do? For the OR case, it demotes all joins where each of the Q
object's children reported that that join can be inner join. As we saw before in the example
cases, when all child clauses use the same join, and all child clauses prefer that join to
be an inner join (ie none of the clauses are interested in null values), then those joins
can be demoted to inner joins. If there are less votes than child clauses for given join,
then the promoter decides that the joins need to be promoted. Finally those joins that
have no votes at all (that is, they weren't used by the Q-object the promoter is dealing
with), then that join will be left as is.
So, getting back to our two original examples. Case `Q(favourite_book__title='Foo') | Q(first_book__title='Bar')`.
The build_filter() for the first condition reports that the favourite_book join can be an inner join,
and for the second condition similarly the first_book join can be an inner join. But, as this is an
OR filter, and both of the joins have only one vote, we decide that both of those joins must
be left outer joins.
On the other hand, for the case `Q(favourite_book__title='Foo') | Q(favourite_book__title='Bar')`,
both filters report that favourite_book can be an inner join. The join promoter sees there are
as many votes for favourite_book join to be inner join as there are conditions, and thus it decides
the join can be demoted to inner join.
[To be continued later on...]
Other notes:
- When ANDin (or NOT ORing) conditions, a single vote for inner join is enough to turn the
join into inner join. If a given clause of ANDed or NOT ORed condition does not match,
then the whole clause can not match.
- Each Q-object reports the joins is prefers to be inner joins separately to its parent, and
Q-object children are treated just like build_filter() children.
- So, for example (Q(favourite_book__title='Foo') | Q(first_book__title='Bar')) & Q(favourite_book__title__icontains='o')
will produce an inner join for favourite_book. First, join promoter handles the ORed children
of the whole Q-object as we have seen above. So, both joins are promoted to left outer
joins, and that children reports that no joins are can be demoted to inner joins (from
that child's perspective). Next, the icontains lookup is handled, and that lookup
says that it prefers the join to be inner join. And, as it is AND connected, a single
vote for inner join is enough to demote the join. Note that this is correct, as if
the icontain lookup gets a null value, that lookup can't match.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment