koniiiik/proposal.rst

## proposal.rst

      
    Raw
  

              proposal.rst
            
          
    Composite Fields, vol. 2


Abstract

Two years ago, as part of GSoC 2011, I started working on an
implementation of composite fields as a means to support multi-column
primary keys in models. While the project was successful, the
timeframe wasn't sufficient to finish it and get it into a state where it
could be merged into Django. This year I propose to take the last
remaining steps for this project to be ready and hopefully even extend it
with some extra features which were left out of the initial project.

Aim of this project

Since I started working on this two years ago, I managed (with the help of
a few people) to get most of the hard work done. The biggest part that I
didn't get around to implementing was support in the ORM for multi-column
joins. This has, however, been implemented recently by Jeremy Tillman and
Anssi, which means there are only a few things left to be done.
First of all, the code sitting in my github repo is badly out of date,
which means it needs to be updated to the current state of master. While
I'm at it, I also want to clean up the revision history to get it as close
to a state where it could be just reviewed and merged directly as
possible.
Beside this, there are only the juicier features left that we initially
wanted to leave unsupported and get back to them later. Those are the
following (in no particular order):

GenericForeignKey support for CompositeField
more intelligent handling of __in lookups
database instrospection and inspectdb
detection of a change of the primary key in model instances


Summary of CompositeField

This section summarizes the basic API as established in the proposal for
GSoC 2011 [1].
A CompositeField requires a list of enclosed regular model fields as
positional arguments, as shown in this example:
class SomeModel(models.Model):
    first_field = models.IntegerField()
    second_field = models.CharField(max_length=100)
    composite = models.CompositeField(first_field, second_field)

The model class then contains a descriptor for the composite field, which
returns a CompositeValue which is a customized namedtuple, the
descriptor accepts any iterable of the appropriate length. An example
interactive session:
>>> instance = new SomeModel(first_field=47, second_field="some string")
>>> instance.composite
CompositeObject(first_field=47, second_field='some string')
>>> instance.composite.first_field
47
>>> instance.composite[1]
'some string'
>>> instance.composite = (74, "other string")
>>> instance.first_field, instance.second_field
(74, 'other string')

CompositeField supports the following standard field options:
unique, db_index, primary_key. The first two will simply add a
corresponding tuple to model._meta.unique_together or
model._meta.index_together. Other field options don't make much sense
in the context of composite fields.
Supported QuerySet filters will be exact and in. The former
should be clear enough, the latter is elaborated in a separate section.
It will be possible to use a CompositeField as a target field of
ForeignKey, OneToOneField and ManyToManyField. This is
described in more detail in the following section.

Changes in ForeignKey

Currently ForeignKey is a regular concrete field which manages both
the raw value stored in the database and the higher-level relationship
semantics. Managing the raw value is simple enough for simple
(single-column) targets. However, in the case of a composite target field,
this task becomes more complex. The biggest problem is that many parts of
the ORM work under the assumption that for each database column there is a
model field it can assign the value from the column to. While it might be
possible to lift this restriction, it would be a really complex project by
itself.
On the other hand, there is the abstraction of virtual fields working on
top of other fields which is required for this project anyway. The way
forward would be to use this abstraction for relationship fields.
Currently, ForeignKey (and by extension OneToOneField) is the only
field whose name and attname differ, where name stores the
value dictated by the semantics of the field and attname stores the
raw value from the database.
We can use this to our advantage and put an auxiliary field into the
attname of each ForeignKey, which would be of the same database
type as the target field, and turn ForeignKey into a virtual field on
top of the auxiliary field. This solution has the advantage that it
offloads the need to manage the raw database value off ForeignKey and
uses a field specifically intended for the task.
In order to keep this backwards compatible and avoid the need to
explicitly create two fields for each ForeignKey, the auxiliary field
needs to be created automatically during the phase where a model class is
created by its metaclass. Initially I implemented this as a method on
ForeignKey which takes the target field and creates its copy, touches
it up and adds it to the model class. However, this requires performing
special tasks with certain types of fields, such as AutoField which
needs to be turned into an IntegerField or CompositeField which
requires copying its enclosed fields as well.
A better approach is to add a method such as create_auxiliary_copy on
Field which would create all new field instances and add them to the
appropriate model class.
One possible problem with these changes is that they change the contents
of _meta.fields in each model out there that contains a relationship
field. For example, if a model contains the following fields:
['id',
 'name',
 'address',
 'place_ptr',
 'rating',
 'serves_hot_dogs',
 'serves_pizza',
 'chef']

where place_ptr is a OneToOneField and chef is a
ForeignKey, after the change it will contain the following list:
['id',
 'name',
 'address',
 'place_ptr',
 'place_ptr_id',
 'rating',
 'serves_hot_dogs',
 'serves_pizza',
 'chef',
 'chef_id']

This causes a lot of failures in the Django test suite, because there are
a lot of tests relying on the contents of _meta.fields or other
related attributes/properties. (Actually, this example is taken from one
of these tests,
model_inheritance.tests.ModelInheritanceTests.test_multiple_table.)
Fixing these is fairly simple, all they need is to add the appropriate
__id fields. However, this raises a concern of how _meta is
regarded. It has always been a private API officially, but everyone uses
it in their projects anyway. I still think the change is worth it, but it
might be a good idea to include a note about the change in the release
notes.

Porting previous work on top of master

The first major task of this project is to take the code I wrote as part
of GSoC 2011 and sync it with the current state of master. The order in
which I implemented things two years ago was to implement
CompositeField first and then I did a refactor of ForeignKey which
is required to make it support CompositeField. This turned out to be
inefficient with respect to the development process, because some parts of
the refactor broke the introduced CompositeField functionality,
meaning I had to effectively reimplement parts of it again. Also, some
abstractions introduced by the refactor made it possible to rewrite
certain parts in a cleaner way than what was necessary for
CompositeField alone (e.g. database creation or certain features of
model._meta).
In light of these findings I am convinced that a better approach would be
to first do the required refactor of ForeignKey and implement
CompositeField as the next step. This will result in a better maintainable
development branch and a cleaner revision history, making it easier to
review the work before its eventual inclusion into Django.

__in lookups for CompositeField

The existing implementation of CompositeField handles __in lookups
in the generic, backend-independent WhereNode class and uses a
disjunctive normal form expression as in the following example:
SELECT a, b, c FROM tbl1, tbl2
WHERE (a = 1 AND b = 2 AND c = 3) OR (a = 4 AND b = 5 AND c = 6);

The problem with this solution is that in cases where the list of values
contains tens or hundreds of tuples, this DNF expression will be extremely
long and the database will have to evaluate it for each and every row,
without a possibility of optimizing the query.
Certain database backends support the following alternative:
SELECT a, b, c FROM tbl1, tbl2
WHERE (a, b, c) IN [(1, 2, 3), (4, 5, 6)];

This would probably be the best option, but it can't be used by SQLite,
for instance. This is also the reason why the DNF expression was
implemented in the first place.
In order to support this more natural syntax, the DatabaseOperations
needs to be extended with a method such as composite_in_sql.
However, this leaves the issue of the inefficient DNF unresolved for
backends without support for tuple literals. For such backends, the
following expression is proposed:
SELECT a, b, c FROM tbl1, tbl2
WHERE EXISTS (SELECT a1, b1, c1, FROM (SELECT 1 as a, 2 as b, 3 as c
                                       UNION SELECT 4, 5, 6)
              WHERE a1=1 AND b1=b AND c1=c);

Since both syntaxes are rather generic and at least one of them should fit
any database backend directly, a new flag will be introduced,
DatabaseFeatures.supports_tuple_literals which the default
implementation of composite_in_sql will consult in order to choose
between the two options.

contenttypes and GenericForeignKey

Two years ago we gave this some thought and decided to leave this for
a later stage. I think the later stage is here.
It's fairly easy to represent composite values as strings. Given an
escape function which uniquely escapes commas, something like the
following works quite well:
",".join(escape(value) for value in composite_value)

However, in order to support JOINs generated by GenericRelation, we
need to be able to reproduce exactly the same encoding using an SQL
expression which would be used in the JOIN condition.
Luckily, while thus encoded strings need to be possible to decode in
Python (for example, when retrieving the related object using
GenericForeignKey or when the admin decodes the primary key from URL),
this isn't necessary at the database level. Using SQL we only ever need to
perform this in one direction, that is from a tuple of values into a
string.
That means we can use a generalized version of the function
django.contrib.admin.utils.quote which replaces each unsafe
character with its ASCII value in hexadecimal base, preceded by an escape
character. In this case, only two characters are unsafe -- comma (which is
used to separate the values) and an escape character (which I arbitrarily
chose as '~').
To reproduce this encoding, all values need to be cast to strings and then
for each such string two calls to the replace functions are made:
replace(replace(CAST (`column` AS text), '~', '~7E'), ',', '~2C')

According to available documentation, all four supported database backends
provide the replace function. [2] [3] [4] [5]
Even though the replace function seems to be available in all major
database servers (even ones not officially supported by Django, including
MSSQL, DB2, Informix and others), this is still probably best left to the
database backend and will be implemented as
DatabaseOperations.composite_value_to_text_sql.
One possible pitfall of this implementation might be that it may not work
with any column type that isn't an integer or a text string due to a
simple fact – the string the database would cast it to will probably
differ from the one Python will use. However, I'm not sure there's
anything we can do about this, especially since the string representation
chosen by the database may be specific for each database server. Therefore
I'm inclined to declare GenericRelation unsupported for models with a
composite primary key containing any special columns. This should be
extremely rare anyway.

Database introspection, inspectdb

There are three main goals concerning database introspection in this
project. The first is to ensure the output of inspectdb remains the
same as it is now for models with simple primary keys and simple foreign
key references, or at least equivalent. While this shouldn't be too
difficult to achieve, it will still be regarded with high importance.
The second goal is to extend inspectdb to also create a
CompositeField in models where the table contains a composite primary
key. This part shouldn't be too difficult,
DatabaseIntrospection.get_primary_key_column will be renamed to
get_primary_key which will return a tuple of columns and in case the
tuple contains more than one element, an appropriate CompositeField
will be added. This will also require updating
DatabaseWrapper.check_constraints for certain backends since it uses
get_primary_key_column.
The third goal is to also make inspectdb aware of composite foreign
keys. This will need a rewrite of get_relations which will have to
return a mapping between tuples of columns instead of single columns. It
should also ensure each tuple of columns pointed to by a foreign key gets
a CompositeField. This part will also probably require some changes in
other backend methods as well, especially since each backend has a unique
tangle of introspection methods.
This part requires a tremendous amount of work, because practically every
single change needs to be done four times and needs separate research of
the specific backend in question. Therefore I can't promise to deliver full support
for all features mentioned in this section for all backends. I'd say
backwards compatibility is a requirement, recognition of composite primary
keys is a highly wanted feature that I'll try to implement for as many
backends as possible and recognition of composite foreign keys would be a
nice extra to have for at least one or two backends.
I'll be implementing the features for the individual backends in the
following order: PostgreSQL, MySQL, SQLite and Oracle. I put PostgreSQL
first because, well, this is the backend with the best support in Django
(and also because it is the one where I'd actually use the features I'm
proposing). Oracle comes last because I don't have any way to test it and
I'm afraid I'd be stabbing in the dark anyway. Of the two remaining
backends I put MySQL first for two reasons. First, I don't think people
need to run inspectdb on SQLite databases too often (if ever). Second,
on MySQL the task seems marginally easier as the database has
introspection features other than just “give me the SQL statement used to
create this table”, whose parsing is most likely going to be a complete
mess.
All in all, extending inspectdb features is a tedious and difficult
task with shady outcome, which I'm well aware of. Still, I would like to
try to at least implement the easier parts for the most used backends. It
might quite possibly turn out that I won't manage to implement more than
composite primary key detection for PostgreSQL. This is the reason I keep
this as one of the last features I intend to work on, as shown in the
timeline. It isn't a necessity, we can always just add a note to the docs
that inspectdb just can't detect certain scenarios and ask people to
edit their models manually.

Updatable primary keys in models

The algorithm that determines what kind of database query to issue on
model.save() is a fairly simple and well-documented one [6]. If a row
exists in the database with the value of its primary key equal to the
saved object, it is updated, otherwise a new row is inserted. This
behavior is intuitive and works well for models where the primary key is
automatically created by the framework (be it an AutoField or a parent
link in the case of model inheritance).
However, as soon as the primary key is explicitly created, the behavior
becomes less intuitive and might be confusing, for example, to users of the
admin. For instance, say we have the following model:
class Person(models.Model):
    first_name = models.CharField(max_length=47)
    last_name = models.CharField(max_length=47)
    shoe_size = models.PositiveSmallIntegerField()

    full_name = models.CompositeField(first_name, last_name,
                                      primary_key=True)

Then we register the model in the admin using the standard one-liner:
admin.site.register(Person)

Since we haven't excluded any fields, all three fields will be editable in
the admin. Now, suppose there's an instance whose full_name is
CompositeValue(first_name='Darth', last_name='Vadur'). A user decides
to fix the last name using the admin, hits the “Save” button and instead
of fixing an existing record, a new one will appear with the new value,
while the old one remains untouched.  This behavior is clearly broken from
the point of view of the user.
It can be argued that it is the developer's fault that the database schema
is poorly chosen and that they expose the primary key to their users.
While this may be true in some cases, it is still to some extent a
subjective matter.
Therefore I propose a new behavior for model.save() where it would
detect a change in the instance's primary key and in that case issue an
UPDATE for the right row, i.e. WHERE primary_key = previous_value.
Of course, just going ahead and changing the behavior in this way for all
models would be backwards incompatible. To do this properly, we would need
to make this an opt-in feature. This can be achieved in multiple ways.

add a keyword argument such as update_pk to Model.save
add a new option to Model.Meta, updatable_pk
make this a project-wide setting

Option 3 doesn't look pleasant and I think I can safely eliminate that.
Option 2 is somewhat better, although it adds a new Meta option.
Option 1 is the most flexible solution, however, it does not change the
behavior of the admin, at least not by default. This can be worked around
by overriding the save method to use a different default:
class MyModel(models.Model):
    def save(self, update_pk=True, **kwargs):
        kwargs['update_pk'] = update_pk
        return super(MyModel, self).save(**kwargs)

To avoid the need to repeat this for each model, a class decorator might
be provided to perform this automatically.
In order to implement this new behavior a little bit of extra complexity
would have to be added to models. Model instances would need to store the
last known value of the primary key as retrieved from the database. On
save it would just find out whether the last known value is present and in
that case issue an UPDATE using the old value in the WHERE
condition.
So far so good, this could be implemented fairly easily. However, the
problem becomes considerably more difficult as soon as we take into
account the fact that updating a primary key value may break foreign key
references. In order to avoid breaking references the on_delete
mechanism of ForeignKey would have to be extended to support updates
as well. This means that the collector used by deletion will need to be
extended as well.
The problem becomes particularly nasty if we realize that a ForeignKey
might be part of a primary key, which means the collector needs to keep
track of which field depends on which in a graph of potentially unlimited
size. Compared to this, deletion is simpler as it only needs to find a
list of all affected model instances as opposed to having to keep track of
which field to update using which value.
Given the complexity of this problem and the fact that it is not directly
related to composite fields, this is left as the last feature which will
be implemented only if I manage to finish everything else on time.

Timeline

Our exam period starts on May 20. and ends on June 28., which means that I
can't guarantee I will be able to fully dedicate myself to the project for
the first two weeks, however, if nothing goes wrong, I should be able to
pass all exams before June 17.
Also, I intend to go to EuroPython, which means the first week of July
won't be as fruitful as the others, but otherwise I'm ready to work full
time on the project.

Week  1 (Jun 17. - Jun 23.)


porting the required virtual field changes, like a VirtualField
class and django.db.models.options changes
revisiting the documentation I wrote two years ago to reflect the
evolution this project has gone through


Week  2 (Jun 24. - Jun 30.)


porting the virtual ForeignKey patchset on top of master to get most
of the functionality right


Week  3 (Jul  1. - Jul  7.)


EuroPython 2013, Florence
I intend to spend the full two days of sprints working on this but I
can't promise much more during this week.
running through the tests and fixing those that need updating, ironing
out the remaining wrinkles


Week  4 (Jul  8. - Jul 14.)


basic CompositeField functionality, CompositeValue and
descriptor protocol


Week  5 (Jul 15. - Jul 21.)


db_index, unique and primary_key for CompositeField
backend-dependent SQL for __in lookups on CompositeField


Week  6 (Jul 22. - Jul 28.)


the few patches required to make admin work with composite primary keys


Week  7 (Jul 29. - Aug  4.)


composite ForeignKey: basic functionality, descriptors, database
JOINs


Week  8 (Aug  5. - Aug 11.)


composite ForeignKey: customization of auxiliary fields, adapting
ModelForms and the admin in case it is necessary


Week  9 (Aug 12. - Aug 18.)


GenericForeignKey and GenericRelation support


Week 10 (Aug 19. - Aug 25.)


database introspection: create composite fields where necessary


Week 11 (Aug 26. - Sep  1.)


database introspection: composite ForeignKey


Week 12 (Sep  2. - Sep  8.)


better handling of modifications to primary keys: detecting the fact and
issuing an appropriate DB query


Week 13 (Sep  9. - Sep 15.)


revision log cleanup
better handling of modifications to primary keys: cascading primary key
updates


Week 14 (Sep 16. - Sep 22.)


this is after the suggested “soft pencils down” deadline, therefore I'll
keep this week as a reserve


Deliverables and fallbacks

As the timeline shows, the first few weeks will be dedicated to internal
refactors of the relationship handling parts of the ORM, which means the
outcome won't be entirely evident or useful by itself. By the end of the
sixth week, I expect the refactor of ForeignKey to be stable for
concrete target fields. Also, the non-relationship part of
CompositeField functionality should be ported on top of this.
Next, there is support for composite foreign keys. This is the core of the
work and the absolute minimum I want to squeeze out of this project to
consider it at least partially successful.
All the features that appear later in the timeline are optional, ordered
by the priority with which I would like to implement them. In the unlikely
case of extreme difficulties with the previous parts of this project some
(or possibly all) of them may be left out.

A note about the history of this project

During GSoC 2011 the project progressed steadily, especially during the
first half. The second half was a little bit more blurry because I had no
idea how difficult the task of implementing composite foreign keys would
be before I started to actually implement them. Still, even though the
second half didn't yield a working implementation (just a half-baked
refactor), it was crucial for me to understand how it can be done.
Unfortunately, soon thereafter school kicked in and along with it
homework, assignments, thesis and other projects related to programming
contests we organize throughout the academic year. This left me with
barely enough time to sync the code with master from time to time, let
alone make any significant progress.
I didn't apply for GSoC 2012 because I knew I wouldn't have enough time to
work on the project because of my thesis and state exams. This would make
it a poor candidate for a GSoC project and even if it did get chosen, it
wouldn't feel fair to other GSoC participants.
Without any deadlines ahead of me and without the GSoC participation, I
was lacking the motivation to keep working on this. I admit, the lack of
progress throughout last summer was largely due to my laziness. I did sync
the code from time to time but I stopped around the time when Aymeric
started pushing all the autogenerated Py3k commits. Sorting out the merge
conflicts and updating a patchset of this scope to the new Py3k
compatibility standards was a really painful process.
I managed to at least get the code through the Py3k changes sometime in
October before the RuPy.eu conference, putting aside everything else. At
that time I was really confident I could keep working on it, but
obviously, that didn't happen.
All in all, the reasons why there hasn't been any significant progress can
be summarized as these two:

lack of free time (don't we all know this one?)
lack of short-term incentive (a.k.a. laziness)

Why I think the upcoming GSoC will help?
Regarding the first one, having planned half a year in advance for a
Summer of Code project means that I don't have any other plans for the
summer beside this. (Well, except for EuroPython, which I hope will help
this project advance anyway.)
As far as motivation is concerned, there's the obvious one -- a nice
paycheck from Google if I succeed to deliver. However, that's not the only
one, I think the fact that I would have a mentor supervising my work, whom
I respond to throughout the summer, has a really great effect in this
regard.

References


[1] GSoC 2011 proposal: Composite Fields
(https://gist.github.com/koniiiik/5450625)


[2] PostgreSQL documentation: Other String Functions
(http://www.postgresql.org/docs/9.1/static/functions-string.html#FUNCTIONS-STRING-OTHER)


[3] SQLite documentation: Core Functions
(http://www.sqlite.org/lang_corefunc.html#replace)


[4] Oracle® Database SQL Reference: REPLACE
(http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions134.htm)


[5] MySQL Reference Manual: String Functions
(http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_replace)


[6] Django documentation: Model instance reference
(https://docs.djangoproject.com/en/dev/ref/models/instances/#how-django-knows-to-update-vs-insert)