Skip to content

Instantly share code, notes, and snippets.

@auvipy
Last active April 17, 2017 04:46
Show Gist options
  • Save auvipy/33a57f6166549fd7c4c8 to your computer and use it in GitHub Desktop.
Save auvipy/33a57f6166549fd7c4c8 to your computer and use it in GitHub Desktop.
SQLAlchemy Integration with django: GSOC 2016 proposal

SQL-alchemy integration with django

Abstract

A common request over the entire life of Django has been to use Django's forms (and in particular, Django's Admin) with data stores that aren't Django's ORM. SQLAlchemy is a popular choice for those using SQL databases. With the formalization of django Meta API it is now possible to use that formalize API for interfacing with other non-django-orm data stores. So SQLalchemy integration to django refers to the work arounds to make the SQLalchemy table definitions works and quaks like django and to be used with modelforms and django admin.

Motivation

Hi all,

A good while back I put forward the idea of using SQLAlchemy Core in Django [1]. Having had more experience working with SQLAlchemy, I'm putting that idea forward as a formal proposal, as I mentioned in a more recent thread here.

Apologies in advance for the length! I've included a few 'TL;DR' summaries and headings in the different sections which you might want to scan first.

=== Proposal ===

We should re-write our query generation code to use SQLAlchemy Core. This includes DDL queries as well as DML (e.g. CREATE TABLE as well as SELECT, INSERT etc).

This would also involve replacing our database connection objects with SQLAlchemy's. In this proposal, our high-level ORM, with model definition and query API, would remain the same - we wouldn't be using the SQLAlchemy ORM.

This is a "Django 2.0" proposal i.e. not immediate future, and not fully backwards compatible.

=== Background - Django 2.0 philosophy ===

TL;DR: "evolution not revolution, some backwards incompatibilities"

Although we haven't really discussed the timing of Django 2.0, or its philosophy, I think we should be doing. My own assumption is that it should be evolution, not revolution, similar to Python 3, where we break stuff that has dogged us in the past, and will hamper us going forward, but don't make radical changes where not necessary. This proposal fits that basic assumption.

Also, in Django to date we've eschewed external dependencies. That has been partly due to the poor and confusing state of Python packaging, which is hopefully improving. I think it will make us a very poor Python citizen if we don't reverse this policy at some point, and Django 2.0 is an obvious point to do it.

This proposal does not represent a move away from being a 'full stack' framework. "Full stack" does not mean "no dependencies".

Our current recommended installation method is via pip [2], and we would be doing our users a disservice to recommend otherwise. Installation via pip actually makes the instructions shorter - manual methods require things like removing old versions manually - and for anything beyond trivial use of Django you have to know pip anyway.

So, with our recommended installation method, adding a dependency doesn't make things any more difficult at all.

=== Background - ORM philosophy ===

TL;DR: "Let's make Django's DB layer the best it can be for relational databases."

Whether we call it "the ORM" or "the model layer" as some people prefer, I think it's fairly certain that the overwhelming majority of our users are using relational databases.

Many of the things that make Django a compelling choice, including the admin and re-usable apps, either don't work or are of little use if you are not using a relational database.

So my philosophy is that we should aim to provide a really excellent ORM that will take users as far as possible.

This doesn't preclude having non-relational support in Django. But it seems very strange to make that the focus, when we've had little-to-no support for it so far, or to allow that support to limit how well we can cater for the 99%.

=== Motivation ===

== Motivation 1: Django's ORM leaves you high and dry when you reach its limits.

While the ORM can do a surprising number of queries, there are plenty it can't, and in all the medium-to-large projects I've worked on I've gone beyond what the ORM can do.

At this point, you've got a few options, from easiest to hardest:

A) Do the aggregation/filtering etc in Python.

B) Write raw SQL.

C) Use SQLAlchemy or some other SQL generation tool.

D) Write a patch to extend the ORM.

None of these is great:

A) Data manipulation in Python, when it could be done in SQL, is obviously a bad idea since it is usually very inefficient. But I've seen a lot of code that does this, because it was hard/impossible to get Django's ORM to do the query needed.

This anti-pattern will also give Django applications the reputation for being slow. Obviously, we can point the finger at the developer, but if we've made it hard for the developer to do the right thing, that is unfair.

B) Raw SQL fails if you have dynamic queries i.e. where the shape of the query can vary.

Example 1: you are writing library code e.g. a re-usable app that knows nothing about the tables it is actually querying, and may have been given any arbitrary QuerySet as an input to manipulate in some way.

Example 2: even if you have full knowledge of the tables, you might have additional WHERE clauses/JOINs/sub queries in some cases, that you want to programmatically add to the query.

I've come across both these types in projects I've been involved in, and I know I'm far from the only one.

Raw SQL can also fail if you are manually writing 'static' queries but need compatibility with multiple DB backends.

C) For SQL generation, SQLAlchemy is the best, but for good reason it comes with its own database connection objects. Having two sources of connection objects causes problems, such as queries not seeing what was done in the transaction established by the other connection.

This can be especially painful for tests even if you're not making much use of transactions in your live system, since it leads to needing the much slower TransactionTestCase and hacks to make it work.

D) This is basically too hard for most of us, including core developers. One reason is that the the code involved is not in the best shape. (More below. This might change if Anssi's work succeeds, but I still have big doubts about how far we can push our ORM).

So, for these reasons, when you reach the end of what the ORM can provide, you are stuck with a lot of poor or painful options. It sucks to try to explain to a customer that although the feature they want is relatively straightforward, and easily supportable by the major component (the database), the limitations of the tool you are using make it impractical. It really sucks when the tool in question is Django.

If SQLAlchemy was used for our query generation, however, it would be easy to get a query object for 'the query so far' and start from there, using the full power of SQLAlchemy Expression Language to build whatever query you need.

== Motivation 2: More db backends

SQLAlchemy has a load of code to cope with different DB dialects, and we have some too. This is largely overlapping, which is duplicated work. It's not entirely overlapping though, and there are some people who can't use one of Django/SQLAlchemy due to choice of DB. If we combined work, both projects would benefit - SQLAlchemy would get a large number of people now interested in fixing whatever backend support is missing in their code, and we'd get all their existing backends for free.

Mike Bayer is enthusiastic about this possibility [3].

== Motivation 3: Code cleanup

As mentioned, our own query generation has been pushed to the limit and beyond. (I'm talking about classes like Query, SQLCompiler). It has grown and grown, so that the Query class is now a 2000 line behemoth, containing a constructor with over 40 assignments to 'self'.

Most of the core developers are scared to touch this stuff, AFAICS, myself included. It has no virtually no unit tests. While it has very high test coverage, it is tested only by tests that check QuerySet and other high-level APIs.

As such, it's very difficult to change, and it may well be beyond our ability to successfully refactor.

Switching to SQLAlchemy would force us to rewrite this code, which is for our own good. In addition, large chunks of it can be dropped entirely (i.e. most of database specific stuff). This will reduce our maintenance load going forward (eventually).

(BTW, I'm not saying that we should let the existing code continue to rot, we should of course try to clean it up as best we can, and that effort is not wasted - I'm talking about a longer term strategy here. If we can refactor this code, great - this motivation can be dropped from the list, but I think the others still stand).

== Motivation 4: Python community

Having another project as a dependency would be a good thing. It would prevent Django becoming a walled garden, and give us greater exposure to the broader talent in Python-world. SQLAlchemy itself is a fantastic library, from which we can learn a lot.

=== Cost/risk for us ===

There is no doubt this would require a lot of work, but I think it is achievable. Both SQLAlchemy and Django have connection objects that 'mostly' implement the DBAPI standard, which will help with the database connection stuff.

Also, not doing this may represent a greater cost for us and our users in the long run. There is no doubt that this layer of our code represents a large amount of technical debt that we need to deal with at some point.

We may be able to re-use a lot of the work that has already been done in django-sqlalchemy [4], which is similar to this proposal in some ways, or at least has done some of mapping.

It's possible that we won't be able to remove as much code as I'm suggesting, leading to us having our own database-compatibility layer on top of SQLA's, which might make our code ugly and harder to maintain than doing it all ourselves.

For example, Oracle doesn't support LIMIT/OFFSET, and we have our own workaround. SQLA also has a workaround that works transparently (I think it is basically the same SQL), so we can delete all our code for this. However, if there is a workaround we've got that SQLA doesn't have, or doesn't apply transparently, or works differently to ours in an important way, we might still need our own 'dialect' classes. It may be possible to get SQLA to accept patches for this in some cases, or use SQLA's extension points to modify it's behaviour e.g. [5]. Mike Bayer would definitely work with us on this (I've emailed him about it).

Finally, this change would probably require us to start helping out with SQLAlchemy, at the very least investing enough to be able to effectively contribute patches etc.

=== Implementation ===

(You can skip this section, it's just rough ideas I've come up with in trying to assess the amount of work needed).

There are different strategies I can think of to break it down into steps that will leave Django roughly working after each major step. Here are some:

== Strategy 1:

The most direct, and in some ways the simplest, but requires a few huge jumps.

  1. Replace all our connection objects with SQLAlchemy's. Each individual backend (sqlite, postgres etc) would return SQLA connection objects. All SQL generation would still be done in Django.

  2. Replace DDL SQL generation with SQLA Core.

  3. Replace DML SQL generation with SQLA Core.

  4. Replace backend/*/introspection.py

== Strategy 2:

  1. Rewrite one of our backends (probably SQLite) so that it uses SQLA to connect to SQLite. SQLite is a good choice because we tend to use it for running the test suite fast, but not for production, maximising the number of people who might be willing to try out this branch.

  2. Rewrite its DDL SQL generation so it uses SQLA Expression Language.

  3. Write a SQLCompiler class for SQLite, in the same way that we have a MySQL/Oracle SQLCompiler, but rewriting all the methods, to use SQLA Expression Language to build the queries.

  4. Allow backends to provide their own QuerySet, and give SQLite its own. The implementation delegates to a new Query class that uses SQLA to build query objects.

  5. One-by-one, migrate all the other backends so that they inherit from the SQLite backend and classes, added fixes in subclasses where necessary. Move the fixes down into the backend where possible.

  6. Eliminate the whole 'backend' concept, pulling up the SQLite implementation as the only implementation, renaming it etc, and removing our own connection wrappers.

(Steps 3 and 4 might need to be combined/reversed).

== Strategy 3:

Same as strategy 2 except start by creating a 'SQLAlchemy' backend (which can connect to any of the DBs via configuration). It might be more confusing as it involves an extra layer of indirection. However, it might be a better strategy, because you could actually develop this in a branch that can be used in production, because just by a config change you can use the new method. You could even merge this branch into master regularly, which would encourage development.

Also, in this strategy, the first step - a SQLAlchemy backend that still generates all queries the old way but has SQLAlchemy connection objects

  • would be useful by itself. It would solve the problem of transactions and a slow test suite for the case of a mixed Django/SQLA code base (see above).

When the SQLAlchemy backend can fully pass the test suite for all different DBs, and is using all SLQAlchemy stuff for SQL generation, it will become the only backend, and the extra layer of indirection can be removed, along with all the old code we are no longer using.

In all strategies, I haven't really thought about the contrib.gis and the backends needed there. There are some SQLAlchemy libraries with GIS extensions - e.g. GeoAlchemy [6]

=== Negative impact for users ===

There would be backwards incompatibilities for anyone using internals, although quite a lot of them could be kept to a minimum with some shims (probably temporary shims, which would be removed eventually). Some documented APIs may be problematic e.g. QuerySet.extra(). These already have various bugs, and it is possible that the re-write may expose the bugs as being symptoms of fundamental flaws, rather than things we can fix. We probably want to drop extra() for 2.0 anyway.

The level of incompatibilities, and the need for external dependency, make me think that this is best suited for Django 2.0,

=== Further work or alternatives ===

This may or may not lead to adopting the SQLAlchemy ORM. This proposal makes sense with or without that. That would be a much bigger change, because it would change a lot publicly documented and highly used APIs, and would introduce more subtle semantic changes due to things like SQLA's Unit-of-Work etc.

I've emailed Mike Bayer about this proposal, and he is certainly happy with the idea in general. He would want us to consider actually an alternative, which is to use the SQLAlchemy ORM, not just Core, with some backwards compatible wrappers for things like QuerySet. I wouldn't be against that proposal, as I've heard great things about the SQLA ORM, but I'm not in a position to recommend it myself, and it does introduce some major semantic changes, even with a compatibility shim for syntax.

Another change this might lead to is getting rid of Q objects and F objects, which are a poor man's way of referring to chunks of SQL, and can at least be better implemented using SQLA's Expression Language, and possibly removed altogether.

Finally, I should say thanks to Mike Bayer for the time he took just looking at this proposal, and for his feedback, some of which I've incorporated into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment