Skip to content

Instantly share code, notes, and snippets.

@koniiiik
Last active March 20, 2019 20:30
Show Gist options
  • Save koniiiik/5408673 to your computer and use it in GitHub Desktop.
Save koniiiik/5408673 to your computer and use it in GitHub Desktop.
Google Summer of Code 2013 proposal: Composite Fields, vol. 2

Composite Fields, vol. 2

Abstract

Two years ago, as part of GSoC 2011, I started working on an implementation of composite fields as a means to support multi-column primary keys in models. While the project was successful, the timeframe wasn't sufficient to finish it and get it into a state where it could be merged into Django. This year I propose to take the last remaining steps for this project to be ready and hopefully even extend it with some extra features which were left out of the initial project.

Aim of this project

Since I started working on this two years ago, I managed (with the help of a few people) to get most of the hard work done. The biggest part that I didn't get around to implementing was support in the ORM for multi-column joins. This has, however, been implemented recently by Jeremy Tillman and Anssi, which means there are only a few things left to be done.

First of all, the code sitting in my github repo is badly out of date, which means it needs to be updated to the current state of master. While I'm at it, I also want to clean up the revision history to get it as close to a state where it could be just reviewed and merged directly as possible.

Beside this, there are only the juicier features left that we initially wanted to leave unsupported and get back to them later. Those are the following (in no particular order):

  • GenericForeignKey support for CompositeField
  • more intelligent handling of __in lookups
  • database instrospection and inspectdb
  • detection of a change of the primary key in model instances

Summary of CompositeField

This section summarizes the basic API as established in the proposal for GSoC 2011 [1].

A CompositeField requires a list of enclosed regular model fields as positional arguments, as shown in this example:

class SomeModel(models.Model):
    first_field = models.IntegerField()
    second_field = models.CharField(max_length=100)
    composite = models.CompositeField(first_field, second_field)

The model class then contains a descriptor for the composite field, which returns a CompositeValue which is a customized namedtuple, the descriptor accepts any iterable of the appropriate length. An example interactive session:

>>> instance = new SomeModel(first_field=47, second_field="some string")
>>> instance.composite
CompositeObject(first_field=47, second_field='some string')
>>> instance.composite.first_field
47
>>> instance.composite[1]
'some string'
>>> instance.composite = (74, "other string")
>>> instance.first_field, instance.second_field
(74, 'other string')

CompositeField supports the following standard field options: unique, db_index, primary_key. The first two will simply add a corresponding tuple to model._meta.unique_together or model._meta.index_together. Other field options don't make much sense in the context of composite fields.

Supported QuerySet filters will be exact and in. The former should be clear enough, the latter is elaborated in a separate section.

It will be possible to use a CompositeField as a target field of ForeignKey, OneToOneField and ManyToManyField. This is described in more detail in the following section.

Changes in ForeignKey

Currently ForeignKey is a regular concrete field which manages both the raw value stored in the database and the higher-level relationship semantics. Managing the raw value is simple enough for simple (single-column) targets. However, in the case of a composite target field, this task becomes more complex. The biggest problem is that many parts of the ORM work under the assumption that for each database column there is a model field it can assign the value from the column to. While it might be possible to lift this restriction, it would be a really complex project by itself.

On the other hand, there is the abstraction of virtual fields working on top of other fields which is required for this project anyway. The way forward would be to use this abstraction for relationship fields. Currently, ForeignKey (and by extension OneToOneField) is the only field whose name and attname differ, where name stores the value dictated by the semantics of the field and attname stores the raw value from the database.

We can use this to our advantage and put an auxiliary field into the attname of each ForeignKey, which would be of the same database type as the target field, and turn ForeignKey into a virtual field on top of the auxiliary field. This solution has the advantage that it offloads the need to manage the raw database value off ForeignKey and uses a field specifically intended for the task.

In order to keep this backwards compatible and avoid the need to explicitly create two fields for each ForeignKey, the auxiliary field needs to be created automatically during the phase where a model class is created by its metaclass. Initially I implemented this as a method on ForeignKey which takes the target field and creates its copy, touches it up and adds it to the model class. However, this requires performing special tasks with certain types of fields, such as AutoField which needs to be turned into an IntegerField or CompositeField which requires copying its enclosed fields as well.

A better approach is to add a method such as create_auxiliary_copy on Field which would create all new field instances and add them to the appropriate model class.

One possible problem with these changes is that they change the contents of _meta.fields in each model out there that contains a relationship field. For example, if a model contains the following fields:

['id',
 'name',
 'address',
 'place_ptr',
 'rating',
 'serves_hot_dogs',
 'serves_pizza',
 'chef']

where place_ptr is a OneToOneField and chef is a ForeignKey, after the change it will contain the following list:

['id',
 'name',
 'address',
 'place_ptr',
 'place_ptr_id',
 'rating',
 'serves_hot_dogs',
 'serves_pizza',
 'chef',
 'chef_id']

This causes a lot of failures in the Django test suite, because there are a lot of tests relying on the contents of _meta.fields or other related attributes/properties. (Actually, this example is taken from one of these tests, model_inheritance.tests.ModelInheritanceTests.test_multiple_table.) Fixing these is fairly simple, all they need is to add the appropriate __id fields. However, this raises a concern of how _meta is regarded. It has always been a private API officially, but everyone uses it in their projects anyway. I still think the change is worth it, but it might be a good idea to include a note about the change in the release notes.

Porting previous work on top of master

The first major task of this project is to take the code I wrote as part of GSoC 2011 and sync it with the current state of master. The order in which I implemented things two years ago was to implement CompositeField first and then I did a refactor of ForeignKey which is required to make it support CompositeField. This turned out to be inefficient with respect to the development process, because some parts of the refactor broke the introduced CompositeField functionality, meaning I had to effectively reimplement parts of it again. Also, some abstractions introduced by the refactor made it possible to rewrite certain parts in a cleaner way than what was necessary for CompositeField alone (e.g. database creation or certain features of model._meta).

In light of these findings I am convinced that a better approach would be to first do the required refactor of ForeignKey and implement CompositeField as the next step. This will result in a better maintainable development branch and a cleaner revision history, making it easier to review the work before its eventual inclusion into Django.

__in lookups for CompositeField

The existing implementation of CompositeField handles __in lookups in the generic, backend-independent WhereNode class and uses a disjunctive normal form expression as in the following example:

SELECT a, b, c FROM tbl1, tbl2
WHERE (a = 1 AND b = 2 AND c = 3) OR (a = 4 AND b = 5 AND c = 6);

The problem with this solution is that in cases where the list of values contains tens or hundreds of tuples, this DNF expression will be extremely long and the database will have to evaluate it for each and every row, without a possibility of optimizing the query.

Certain database backends support the following alternative:

SELECT a, b, c FROM tbl1, tbl2
WHERE (a, b, c) IN [(1, 2, 3), (4, 5, 6)];

This would probably be the best option, but it can't be used by SQLite, for instance. This is also the reason why the DNF expression was implemented in the first place.

In order to support this more natural syntax, the DatabaseOperations needs to be extended with a method such as composite_in_sql.

However, this leaves the issue of the inefficient DNF unresolved for backends without support for tuple literals. For such backends, the following expression is proposed:

SELECT a, b, c FROM tbl1, tbl2
WHERE EXISTS (SELECT a1, b1, c1, FROM (SELECT 1 as a, 2 as b, 3 as c
                                       UNION SELECT 4, 5, 6)
              WHERE a1=1 AND b1=b AND c1=c);

Since both syntaxes are rather generic and at least one of them should fit any database backend directly, a new flag will be introduced, DatabaseFeatures.supports_tuple_literals which the default implementation of composite_in_sql will consult in order to choose between the two options.

contenttypes and GenericForeignKey

Two years ago we gave this some thought and decided to leave this for a later stage. I think the later stage is here.

It's fairly easy to represent composite values as strings. Given an escape function which uniquely escapes commas, something like the following works quite well:

",".join(escape(value) for value in composite_value)

However, in order to support JOINs generated by GenericRelation, we need to be able to reproduce exactly the same encoding using an SQL expression which would be used in the JOIN condition.

Luckily, while thus encoded strings need to be possible to decode in Python (for example, when retrieving the related object using GenericForeignKey or when the admin decodes the primary key from URL), this isn't necessary at the database level. Using SQL we only ever need to perform this in one direction, that is from a tuple of values into a string.

That means we can use a generalized version of the function django.contrib.admin.utils.quote which replaces each unsafe character with its ASCII value in hexadecimal base, preceded by an escape character. In this case, only two characters are unsafe -- comma (which is used to separate the values) and an escape character (which I arbitrarily chose as '~').

To reproduce this encoding, all values need to be cast to strings and then for each such string two calls to the replace functions are made:

replace(replace(CAST (`column` AS text), '~', '~7E'), ',', '~2C')

According to available documentation, all four supported database backends provide the replace function. [2] [3] [4] [5]

Even though the replace function seems to be available in all major database servers (even ones not officially supported by Django, including MSSQL, DB2, Informix and others), this is still probably best left to the database backend and will be implemented as DatabaseOperations.composite_value_to_text_sql.

One possible pitfall of this implementation might be that it may not work with any column type that isn't an integer or a text string due to a simple fact – the string the database would cast it to will probably differ from the one Python will use. However, I'm not sure there's anything we can do about this, especially since the string representation chosen by the database may be specific for each database server. Therefore I'm inclined to declare GenericRelation unsupported for models with a composite primary key containing any special columns. This should be extremely rare anyway.

Database introspection, inspectdb

There are three main goals concerning database introspection in this project. The first is to ensure the output of inspectdb remains the same as it is now for models with simple primary keys and simple foreign key references, or at least equivalent. While this shouldn't be too difficult to achieve, it will still be regarded with high importance.

The second goal is to extend inspectdb to also create a CompositeField in models where the table contains a composite primary key. This part shouldn't be too difficult, DatabaseIntrospection.get_primary_key_column will be renamed to get_primary_key which will return a tuple of columns and in case the tuple contains more than one element, an appropriate CompositeField will be added. This will also require updating DatabaseWrapper.check_constraints for certain backends since it uses get_primary_key_column.

The third goal is to also make inspectdb aware of composite foreign keys. This will need a rewrite of get_relations which will have to return a mapping between tuples of columns instead of single columns. It should also ensure each tuple of columns pointed to by a foreign key gets a CompositeField. This part will also probably require some changes in other backend methods as well, especially since each backend has a unique tangle of introspection methods.

This part requires a tremendous amount of work, because practically every single change needs to be done four times and needs separate research of the specific backend in question. Therefore I can't promise to deliver full support for all features mentioned in this section for all backends. I'd say backwards compatibility is a requirement, recognition of composite primary keys is a highly wanted feature that I'll try to implement for as many backends as possible and recognition of composite foreign keys would be a nice extra to have for at least one or two backends.

I'll be implementing the features for the individual backends in the following order: PostgreSQL, MySQL, SQLite and Oracle. I put PostgreSQL first because, well, this is the backend with the best support in Django (and also because it is the one where I'd actually use the features I'm proposing). Oracle comes last because I don't have any way to test it and I'm afraid I'd be stabbing in the dark anyway. Of the two remaining backends I put MySQL first for two reasons. First, I don't think people need to run inspectdb on SQLite databases too often (if ever). Second, on MySQL the task seems marginally easier as the database has introspection features other than just “give me the SQL statement used to create this table”, whose parsing is most likely going to be a complete mess.

All in all, extending inspectdb features is a tedious and difficult task with shady outcome, which I'm well aware of. Still, I would like to try to at least implement the easier parts for the most used backends. It might quite possibly turn out that I won't manage to implement more than composite primary key detection for PostgreSQL. This is the reason I keep this as one of the last features I intend to work on, as shown in the timeline. It isn't a necessity, we can always just add a note to the docs that inspectdb just can't detect certain scenarios and ask people to edit their models manually.

Updatable primary keys in models

The algorithm that determines what kind of database query to issue on model.save() is a fairly simple and well-documented one [6]. If a row exists in the database with the value of its primary key equal to the saved object, it is updated, otherwise a new row is inserted. This behavior is intuitive and works well for models where the primary key is automatically created by the framework (be it an AutoField or a parent link in the case of model inheritance).

However, as soon as the primary key is explicitly created, the behavior becomes less intuitive and might be confusing, for example, to users of the admin. For instance, say we have the following model:

class Person(models.Model):
    first_name = models.CharField(max_length=47)
    last_name = models.CharField(max_length=47)
    shoe_size = models.PositiveSmallIntegerField()

    full_name = models.CompositeField(first_name, last_name,
                                      primary_key=True)

Then we register the model in the admin using the standard one-liner:

admin.site.register(Person)

Since we haven't excluded any fields, all three fields will be editable in the admin. Now, suppose there's an instance whose full_name is CompositeValue(first_name='Darth', last_name='Vadur'). A user decides to fix the last name using the admin, hits the “Save” button and instead of fixing an existing record, a new one will appear with the new value, while the old one remains untouched. This behavior is clearly broken from the point of view of the user.

It can be argued that it is the developer's fault that the database schema is poorly chosen and that they expose the primary key to their users. While this may be true in some cases, it is still to some extent a subjective matter.

Therefore I propose a new behavior for model.save() where it would detect a change in the instance's primary key and in that case issue an UPDATE for the right row, i.e. WHERE primary_key = previous_value.

Of course, just going ahead and changing the behavior in this way for all models would be backwards incompatible. To do this properly, we would need to make this an opt-in feature. This can be achieved in multiple ways.

  1. add a keyword argument such as update_pk to Model.save
  2. add a new option to Model.Meta, updatable_pk
  3. make this a project-wide setting

Option 3 doesn't look pleasant and I think I can safely eliminate that. Option 2 is somewhat better, although it adds a new Meta option. Option 1 is the most flexible solution, however, it does not change the behavior of the admin, at least not by default. This can be worked around by overriding the save method to use a different default:

class MyModel(models.Model):
    def save(self, update_pk=True, **kwargs):
        kwargs['update_pk'] = update_pk
        return super(MyModel, self).save(**kwargs)

To avoid the need to repeat this for each model, a class decorator might be provided to perform this automatically.

In order to implement this new behavior a little bit of extra complexity would have to be added to models. Model instances would need to store the last known value of the primary key as retrieved from the database. On save it would just find out whether the last known value is present and in that case issue an UPDATE using the old value in the WHERE condition.

So far so good, this could be implemented fairly easily. However, the problem becomes considerably more difficult as soon as we take into account the fact that updating a primary key value may break foreign key references. In order to avoid breaking references the on_delete mechanism of ForeignKey would have to be extended to support updates as well. This means that the collector used by deletion will need to be extended as well.

The problem becomes particularly nasty if we realize that a ForeignKey might be part of a primary key, which means the collector needs to keep track of which field depends on which in a graph of potentially unlimited size. Compared to this, deletion is simpler as it only needs to find a list of all affected model instances as opposed to having to keep track of which field to update using which value.

Given the complexity of this problem and the fact that it is not directly related to composite fields, this is left as the last feature which will be implemented only if I manage to finish everything else on time.

Timeline

Our exam period starts on May 20. and ends on June 28., which means that I can't guarantee I will be able to fully dedicate myself to the project for the first two weeks, however, if nothing goes wrong, I should be able to pass all exams before June 17.

Also, I intend to go to EuroPython, which means the first week of July won't be as fruitful as the others, but otherwise I'm ready to work full time on the project.

Week 1 (Jun 17. - Jun 23.)

  • porting the required virtual field changes, like a VirtualField class and django.db.models.options changes
  • revisiting the documentation I wrote two years ago to reflect the evolution this project has gone through

Week 2 (Jun 24. - Jun 30.)

  • porting the virtual ForeignKey patchset on top of master to get most of the functionality right

Week 3 (Jul 1. - Jul 7.)

  • EuroPython 2013, Florence
  • I intend to spend the full two days of sprints working on this but I can't promise much more during this week.
  • running through the tests and fixing those that need updating, ironing out the remaining wrinkles

Week 4 (Jul 8. - Jul 14.)

  • basic CompositeField functionality, CompositeValue and descriptor protocol

Week 5 (Jul 15. - Jul 21.)

  • db_index, unique and primary_key for CompositeField
  • backend-dependent SQL for __in lookups on CompositeField

Week 6 (Jul 22. - Jul 28.)

  • the few patches required to make admin work with composite primary keys

Week 7 (Jul 29. - Aug 4.)

  • composite ForeignKey: basic functionality, descriptors, database JOINs

Week 8 (Aug 5. - Aug 11.)

  • composite ForeignKey: customization of auxiliary fields, adapting ModelForms and the admin in case it is necessary

Week 9 (Aug 12. - Aug 18.)

  • GenericForeignKey and GenericRelation support

Week 10 (Aug 19. - Aug 25.)

  • database introspection: create composite fields where necessary

Week 11 (Aug 26. - Sep 1.)

  • database introspection: composite ForeignKey

Week 12 (Sep 2. - Sep 8.)

  • better handling of modifications to primary keys: detecting the fact and issuing an appropriate DB query

Week 13 (Sep 9. - Sep 15.)

  • revision log cleanup
  • better handling of modifications to primary keys: cascading primary key updates

Week 14 (Sep 16. - Sep 22.)

  • this is after the suggested “soft pencils down” deadline, therefore I'll keep this week as a reserve

Deliverables and fallbacks

As the timeline shows, the first few weeks will be dedicated to internal refactors of the relationship handling parts of the ORM, which means the outcome won't be entirely evident or useful by itself. By the end of the sixth week, I expect the refactor of ForeignKey to be stable for concrete target fields. Also, the non-relationship part of CompositeField functionality should be ported on top of this.

Next, there is support for composite foreign keys. This is the core of the work and the absolute minimum I want to squeeze out of this project to consider it at least partially successful.

All the features that appear later in the timeline are optional, ordered by the priority with which I would like to implement them. In the unlikely case of extreme difficulties with the previous parts of this project some (or possibly all) of them may be left out.

A note about the history of this project

During GSoC 2011 the project progressed steadily, especially during the first half. The second half was a little bit more blurry because I had no idea how difficult the task of implementing composite foreign keys would be before I started to actually implement them. Still, even though the second half didn't yield a working implementation (just a half-baked refactor), it was crucial for me to understand how it can be done.

Unfortunately, soon thereafter school kicked in and along with it homework, assignments, thesis and other projects related to programming contests we organize throughout the academic year. This left me with barely enough time to sync the code with master from time to time, let alone make any significant progress.

I didn't apply for GSoC 2012 because I knew I wouldn't have enough time to work on the project because of my thesis and state exams. This would make it a poor candidate for a GSoC project and even if it did get chosen, it wouldn't feel fair to other GSoC participants.

Without any deadlines ahead of me and without the GSoC participation, I was lacking the motivation to keep working on this. I admit, the lack of progress throughout last summer was largely due to my laziness. I did sync the code from time to time but I stopped around the time when Aymeric started pushing all the autogenerated Py3k commits. Sorting out the merge conflicts and updating a patchset of this scope to the new Py3k compatibility standards was a really painful process.

I managed to at least get the code through the Py3k changes sometime in October before the RuPy.eu conference, putting aside everything else. At that time I was really confident I could keep working on it, but obviously, that didn't happen.

All in all, the reasons why there hasn't been any significant progress can be summarized as these two:

  1. lack of free time (don't we all know this one?)
  2. lack of short-term incentive (a.k.a. laziness)

Why I think the upcoming GSoC will help?

Regarding the first one, having planned half a year in advance for a Summer of Code project means that I don't have any other plans for the summer beside this. (Well, except for EuroPython, which I hope will help this project advance anyway.)

As far as motivation is concerned, there's the obvious one -- a nice paycheck from Google if I succeed to deliver. However, that's not the only one, I think the fact that I would have a mentor supervising my work, whom I respond to throughout the summer, has a really great effect in this regard.

References

[1]GSoC 2011 proposal: Composite Fields (https://gist.github.com/koniiiik/5450625)
[2]PostgreSQL documentation: Other String Functions (http://www.postgresql.org/docs/9.1/static/functions-string.html#FUNCTIONS-STRING-OTHER)
[3]SQLite documentation: Core Functions (http://www.sqlite.org/lang_corefunc.html#replace)
[4]Oracle® Database SQL Reference: REPLACE (http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions134.htm)
[5]MySQL Reference Manual: String Functions (http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_replace)
[6]Django documentation: Model instance reference (https://docs.djangoproject.com/en/dev/ref/models/instances/#how-django-knows-to-update-vs-insert)
@adepue
Copy link

adepue commented Apr 22, 2013

The proposal sounds good to @jtillman and myself. We would be willing to help as well if you'd be ok splitting the work up a bit and we can help contribute.

@koniiiik
Copy link
Author

Thanks for the feedback. I just managed to extend the proposal to its full scope with a lot of gory details. I'd appreciate any feedback on the added sections if you ever manage to find enough time to scan through it. (-;

Regarding splitting the work… I'm all for it, the problem is, I'm not sure the GSoC rules allow this. As far as I'm aware, all participants have to produce the code themselves. However, this still makes it possible to chop off one or two of the features from the end of the timeline and work on them separately, not as part of the GSoC. I think the updatable primary keys are a good candidate for this. I suspect this feature alone would be sufficient to fill a large part of a Summer of Code project if it is to be done properly. I just put it into my proposal as a reserve

By the way, @jtillman, great work with the multi-column joins. (And sorry about mangling your name, brainfart on my part…)

@charettes
Copy link

Great proposal, sounds good to me as well.

Concerning the opt-in mechanism for the update_pk in the admin I'd say we could default to True if the field is composite.

i.e.

class MyModelAdmin(ModelAdmin):
    def save_model(self, request, obj, form, change):
        obj.save(update_pk=obj._meta.primary_key.composite)

This would maintain backward compatibility while respecting the principle of least astonishment.

@koniiiik
Copy link
Author

@charettes: That idea crossed my mind as well, the problem is, the issue manifests in any model with an explicit primary key, not just those with a composite PK. Somehow I dislike it a bit because I find it a little bit inconsistent, but I admit that's completely subjective.

@aaugustin
Copy link

It doesn't matter, but FYI, the port to Python 3 was mostly performed manually.

I used a 2to3 fixer to apply the @python_2_unicode_compatible decorator; besides this, the closest to automation was our text editor's "replace-all" function :)

@jtillman
Copy link

I wanted to chime in on the idea of the updatable primary key.

I get the idea but it really scares me. I don't see any case where updating the fields of the primary key would be wise. Building a feature for it would promote bad practice and cause performance and database issues for those who use it. Knowing that we would be able to work around this with Model.objects.filter(...).update(<pk_fields>), I wouldn't add it as its own feature.

@koniiiik
Copy link
Author

koniiiik commented Jun 2, 2013

I'll just leave a note here that I just sent an email to django-developers with a few questions, also tackling the concerns raised by @jtillman, the thread is available here.

@aaugustin: Wow, that's impressive. (-:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment