nolanlawson/README.md

## README.md

      
    Raw
  

              README.md
            
          
    PouchDB map/reduce view specification

This document describes the proposed implementation for persisted map/reduce views in PouchDB.  This is based on the ongoing discussion in pouchdb/mapreduce#12.
I decided to make this a formal-ish spec, because the dicussion was starting to go long and it was becoming unclear what implementation we were talking about.
Required API

Basically we need to support the following operations:
saveQuery(viewName, queryDefinition)

(Or createIndex or whatever.)  Basically this corresponds to PUTing a _design doc in CouchDB; it starts the whole index-building process.  queryDefinition is the object with map/reduce functions, e.g. {map : ..., reduce: ...}.
query(viewName, opts)

Works the same as the current query(), but faster because everything is a range query in the database.  Need to support standard key, keys, startkey, endkey, startkey_docid, endkey_docid, skip, and offset options.
Docs can emit multiple key/value combos, and the results are returned in the order [key, docid, value] using the standard CouchDB collation ordering.
TODO: How to we know when the index is stale?
onDocDeleted(doc)

When a doc is deleted, all its associated map/reduce outputs get deleted too.
onDocUpdated(doc, revId)

When a doc is updated, all its associated map/reduce outputs get deleted and replaced with the new ones.
Proposed implementation

Thanks to the indexable string implementation, we can map every key to a string that maintains CouchDB's collation ordering regardless of the type of the key - string, number, object, array, whatever.  (Of course we'll miss some intricacies of libicu ordering, case-sensitivity etc., but it's close enough.)
We'll leverage toIndexableString() to make the output of every emit its own doc, where the _id is toIndexableString([key, docid, value]).  Notice that this also allows us to easily perform range queries in the database itself (!) regardless of whether it's WebSQL, IDB, or LevelDB.  It's a huge performance win, which is the whole point of caching these queries.
However, we'll also need to be able to look up these documents by docid alone (when a doc is deleted or updated).  So we'll store two types of documents: one that simply contains the emitted key/value output, and one that contains a list of references to the previous type.  To distinguish the two, we'll again leverage toIndexableString and simply mark the first type as 1 and the second type as 2 - e.g. toIndexableString([1, key, docid, value]) for the first type and toIndexableString([2, docid]) for the second type.
Example

Consider a database with two documents:
[
	{
	    "_id": "doc1", 
	    "_rev": "1-0348cd6cc49cb4cacdb9b94c87c83808", 
	    "foo": 1 
	}, 
	{
	    "_id": "doc2", 
	    "_rev": "2-4145685b040d90543ed2704a0727be5c", 
	    "foo": 2
	}
]
Assume the user saves the map function:
function (doc) {
  emit(doc.foo, "bar");
  emit(doc.foo, "baz");
}
Using the proposed implementation, we create a new PouchDB (either namespaced by the view name or its md5sum, tbd) and save four map/reduce output docs (2 emits × 2 docs):
[
    {
        "_id": "5_3_2_324_1\u00003_2_324_1\u00004_doc1\u00004_bar\u0000\u0000", 
        "docid": "doc1", 
        "docrev": "1-0348cd6cc49cb4cacdb9b94c87c83808", 
        "key": 1, 
        "value": "bar"
    }, 
    {
        "_id": "5_3_2_324_1\u00003_2_324_1\u00004_doc1\u00004_baz\u0000\u0000", 
        "docid": "doc1", 
        "docrev": "1-0348cd6cc49cb4cacdb9b94c87c83808", 
        "key": 1, 
        "value": "baz"
    }, 
    {
        "_id": "5_3_2_324_1\u00003_2_324_2\u00004_doc2\u00004_bar\u0000\u0000", 
        "docid": "doc2", 
        "docrev": "2-4145685b040d90543ed2704a0727be5c", 
        "key": 2, 
        "value": "bar"
    }, 
    {
        "_id": "5_3_2_324_1\u00003_2_324_2\u00004_doc2\u00004_baz\u0000\u0000", 
        "docid": "doc2", 
        "docrev": "2-4145685b040d90543ed2704a0727be5c", 
        "key": 2, 
        "value": "baz"
    }
]
In each case, the _id is simply toIndexableString([1, key, docid, value]).  Rev is included to be used with include_docs on later queries.
We also index two of the second type of document (one per user doc) to map back to these docs:
[
    {
        "1-0348cd6cc49cb4cacdb9b94c87c83808": [
            "5_3_2_324_1\u00004_doc1\u00004_bar\u0000\u0000", 
            "5_3_2_324_1\u00004_doc1\u00004_baz\u0000\u0000"
        ], 
        "_id": "5_3_2_324_2\u00004_doc1\u0000\u0000"
    }, 
    {
        "2-4145685b040d90543ed2704a0727be5c": [
            "5_3_2_324_2\u00004_doc2\u00004_bar\u0000\u0000", 
            "5_3_2_324_2\u00004_doc2\u00004_baz\u0000\u0000"
        ], 
        "_id": "5_3_2_324_2\u00004_doc2\u0000\u0000"
    }
]
The first one corresponds to doc1, the second to doc2 (the ids are just toIndexableString([2, docid]).  All they do is map revIds back to the ids of the previous type of documents.  (We may or may not actually have to include the revId itself, TBD)
How this works

For query(viewName, opts), we just do a range query where startkey, endkey, etc. have been converted to their indexableString equivalents.  If the user specifies include_docs=true, then we already have the docid and docrev, so we can easily pull that info out of the database. (TODO: can we handle joined docs in a similar way?)
For onDocDeleted(doc), we do a database-level transaction (WebSQL and IDB support this, LevelDB...?) that deletes the second type of doc and all its associated docs.
For onDocUpdated(doc, revId), we do a database-level transaction that modifies the second type of doc, deletes all instances of the first type of doc (corresponding to the old revid), and adds new instances of the first type of doc for the new emitted keys/values.  (And maybe if the emitted keys/values didn't change, we don't do anything.)
TODOs


Joined docs?
Reduce?
LevelDB?