dmpeters/foo.md

## foo.md

      
    Raw
  

              foo.md
            
          
    The Perils of CouchDB / Postgres to the rescue

CouchDB as an application database kind of sucks.  Basically the only reason to use it is because it easily supports unstructured data.  Even so, its limitations have a ton of negative ripple effects on basically the entire development process.
Luckily, Postgres has an hstore column type and has recently added a json column type.  json supports arbitrary JSON but is stored as text so operations are slower. hstore stores only a flat mapping of string keys and values in a fast binary format, but nesting is on tap for 9.4 (Q3 2014) and non-string types are on the roadmap.  In the meantime, nesting can be simulated by storing the full path to a value as the string key, and types can be handled in the application layer including with a query building abstraction that automatically casts hstore values to their expected types in database queries.
Postgres supports a number of other features that all together make it look like a great solution for scalably storing multi-tenant semi-structured data.
Because of how much better Postgres could be than CouchDB, I think it's important to fully explore Postgres and determine what the right solution is.  Even if we go with something other than the right solution, it's important to know what tradeoffs have been made so they can be learned from and potentially addressed in the future.
Why CouchDB sucks

CouchDB has a lot of very big drawbacks that pervasively depress productivity and power.
Lack of Schemas

For data that actually has a schema, it's ideal for the schema to live in the database.  Otherwise, the schema lives in the code or even worse someone's head, which is bad for maintainability.
Schema in the database eliminates a whole class of bugs.
Lack of ACID

Lack of ACID means that when doing complex multi-step operations with CouchDB, you lose guarantees that your data is correct.
Difficulty of complex queries

Inability to do many types of complex queries with CouchDB leads to pervasive slowness due to the N+1 query problem.
Map-reduce views are hard to write and allow you to do a limited subset of the things you can do with SQL.
Reindexing

Reindexing is probably the worst.
First of all, it's a much better development experience to be able to change a query and refresh a page than having to reindex first.
Secondly, reindexing is a serious impediment to continuous deployment.  It requires a fairly complicated setup that checks the currently preindexing views periodically to see if they're done indexing, and then finishes deploying.
As we get more and more data and reindexing takes longer, this problem will become more acute.
Continuous deployment is pretty important.

It forces you to automate dependencies
It prevents you from having to do manual deploys.  So much time and stress is spent on that.
For changes that don't get QA'd and can be deployed to prod immediately, it makes it easier to pinpoint what change caused a bug
Knowing that as soon as you merge something it's going to get deployed has a psychological effect of making people review changes more attentively

Code review is the most effective way of catching bugs and imparting knowledge, so continuous deployment therefore can lead to a nice virtuous cycle


For a change that needs to be QA'd, it makes it easier to set up a bunch of automated deployments - staging, beta, etc. - that can minimize QA overhead for certain types of changes, offload more testing to staff and users, and ensure that QA happens after code review, which is vital for incentivizing quality incremental changes

Current solutions

Currently we have an ETL setup to duplicate some data into Postgres, and we duplicate some data in ElasticSearch.
Both of these require custom application code, which is fairly complex and requires expertise to maintain.
They add even more complexity to any potential continuous deployment setup, and they seem to have issues that require manual intervention fairly regularly.
Additionally, Fluff reports are harder to write than SQL as I've argued in the past.
Additionally, this is quite a lot of duplicated data and for the ElasticSearch case it's probably going to be quite expensive to duplicate everything.
In order to do the "super case list" with ElasticSearch, we use a really ugly hack to convert every key value pair into some sort of dictionary in order to avoid the fact that different domains might have different types for different properties.  This makes queries hard to read/write manually and requires that you use a particular abstraction in Python to write queries.
Better solution: Postgres

Some features of Postgres should be sufficient to do everything that CommCare HQ currently does at least as well as its done now.
Here's the basic idea (hstore syntax may not be exactly correct):
Store data using hstore.
CREATE TABLE forms (data hstore);
We solve the problem of different domains having different types for the same property by using partial indexes.
CREATE INDEX foo_form_property_bar ON forms (data->bar)
	WHERE domain = 'foo';
We can create a view, which will be fast because of the indexes we've created.
CREATE VIEW foo_form_type_baz AS
	SELECT data->bar AS bar
        FROM forms
	WHERE domain = 'foo'
		AND data->type = 'baz';


Then someone (such as a standard BI tool) can query this like
SELECT * from foo_form_type_baz
	WHERE bar = 'some_value';
We can update these indexes and views for all the possible form properties every time an app definition changes.
Additionally, if we know that a form property is a GPS value, we can do something like this in order to have access to the advanced features of PostGIS
CREATE INDEX foo_form_property_gpsval ON form (ST_PointFromText(data->gpsval))
	WHERE domain = 'foo'
But what about horizontal scalability?
This can be accomplished with partitions and tablespaces
CREATE TABLESPACE tablespace1 LOCATION '/mnt/sda1/postgresql/data';   -- networked drive on one machine
CREATE TABLESPACE tablespace2 LOCATION '/mnt/sda2/postgresql/data';   -- networked drive on another machine

CREATE TABLE forms_1 ( 
	CHECK ( domain = 'domain_one'
) INHERITS forms TABLESPACE tablespace1;

CREATE TABLE forms_2 (
	CHECK ( domain = 'domain_two'
) INHERITS forms TABLESPACE tablespace2;
This provides transparent database-level sharding.  Probably not worth automating changes to the sharding function / would require some manual monitoring and moving of data when a machine got full.
It seems like you also might be supposed to change indexes to be on the partition tables instead of the main table.
Advantages to this approach


Minimal index size, reindexing can be done online, meaning preindexing before a deploy is not necessary.


No complicated data duplication necessary.  If you want to reduce reporting load on the main database, you can just duplicate the entire database once.  Very simple. Probably not necessary.


Migration

Start duplicating all data in CouchDB and Postgres back in forth.  (Postgres to CouchDB is probably going to be a lot of work with unfamiliar stuff.)
Stop writing new features in CouchDB and ElasticSearch.  Gradually migrate old code to Postgres.
Using plv8js and the fact that CouchDB presents a relatively small interface, it's not completely inconceivable to modify Couchdbkit to use Postgres instead of CouchDB.  You would lose caching of reduce views, and some things might be impossible, and it would probably be hard for any map view that does more than one emit.  However, I doubt this is worth the effort over just porting old code.
There's a reason they call Postgres the open source version of Oracle.