This document describes the proposed implementation for persisted map/reduce views in PouchDB. This is based on the ongoing discussion in pouchdb/mapreduce#12.
I decided to make this a formal-ish spec, because the dicussion was starting to go long and it was becoming unclear what implementation we were talking about.
Basically we need to support the following operations:
saveQuery(viewName, queryDefinition)
(Or createIndex
or whatever.) Basically this corresponds to PUT
ing a _design
doc in CouchDB; it starts the whole index-building process. queryDefinition
is the object with map/reduce functions, e.g. {map : ..., reduce: ...}
.
query(viewName, opts)
Works the same as the current query()
, but faster because everything is a range query in the database. Need to support standard key
, keys
, startkey
, endkey
, startkey_docid
, endkey_docid
, skip
, and offset
options.
Docs can emit multiple key/value combos, and the results are returned in the order [key, docid, value]
using the standard CouchDB collation ordering.
TODO: How to we know when the index is stale
?
onDocDeleted(doc)
When a doc is deleted, all its associated map/reduce outputs get deleted too.
onDocUpdated(doc, revId)
When a doc is updated, all its associated map/reduce outputs get deleted and replaced with the new ones.
Thanks to the indexable string implementation, we can map every key to a string that maintains CouchDB's collation ordering regardless of the type of the key - string, number, object, array, whatever. (Of course we'll miss some intricacies of libicu ordering, case-sensitivity etc., but it's close enough.)
We'll leverage toIndexableString()
to make the output of every emit
its own doc, where the _id
is toIndexableString([key, docid, value])
. Notice that this also allows us to easily perform range queries in the database itself (!) regardless of whether it's WebSQL, IDB, or LevelDB. It's a huge performance win, which is the whole point of caching these queries.
However, we'll also need to be able to look up these documents by docid alone (when a doc is deleted or updated). So we'll store two types of documents: one that simply contains the emitted key/value output, and one that contains a list of references to the previous type. To distinguish the two, we'll again leverage toIndexableString
and simply mark the first type as 1
and the second type as 2
- e.g. toIndexableString([1, key, docid, value])
for the first type and toIndexableString([2, docid])
for the second type.
Consider a database with two documents:
[
{
"_id": "doc1",
"_rev": "1-0348cd6cc49cb4cacdb9b94c87c83808",
"foo": 1
},
{
"_id": "doc2",
"_rev": "2-4145685b040d90543ed2704a0727be5c",
"foo": 2
}
]
Assume the user saves the map function:
function (doc) {
emit(doc.foo, "bar");
emit(doc.foo, "baz");
}
Using the proposed implementation, we create a new PouchDB (either namespaced by the view name or its md5sum, tbd) and save four map/reduce output docs (2 emits × 2 docs):
[
{
"_id": "5_3_2_324_1\u00003_2_324_1\u00004_doc1\u00004_bar\u0000\u0000",
"docid": "doc1",
"docrev": "1-0348cd6cc49cb4cacdb9b94c87c83808",
"key": 1,
"value": "bar"
},
{
"_id": "5_3_2_324_1\u00003_2_324_1\u00004_doc1\u00004_baz\u0000\u0000",
"docid": "doc1",
"docrev": "1-0348cd6cc49cb4cacdb9b94c87c83808",
"key": 1,
"value": "baz"
},
{
"_id": "5_3_2_324_1\u00003_2_324_2\u00004_doc2\u00004_bar\u0000\u0000",
"docid": "doc2",
"docrev": "2-4145685b040d90543ed2704a0727be5c",
"key": 2,
"value": "bar"
},
{
"_id": "5_3_2_324_1\u00003_2_324_2\u00004_doc2\u00004_baz\u0000\u0000",
"docid": "doc2",
"docrev": "2-4145685b040d90543ed2704a0727be5c",
"key": 2,
"value": "baz"
}
]
In each case, the _id
is simply toIndexableString([1, key, docid, value])
. Rev is included to be used with include_docs
on later queries.
We also index two of the second type of document (one per user doc) to map back to these docs:
[
{
"1-0348cd6cc49cb4cacdb9b94c87c83808": [
"5_3_2_324_1\u00004_doc1\u00004_bar\u0000\u0000",
"5_3_2_324_1\u00004_doc1\u00004_baz\u0000\u0000"
],
"_id": "5_3_2_324_2\u00004_doc1\u0000\u0000"
},
{
"2-4145685b040d90543ed2704a0727be5c": [
"5_3_2_324_2\u00004_doc2\u00004_bar\u0000\u0000",
"5_3_2_324_2\u00004_doc2\u00004_baz\u0000\u0000"
],
"_id": "5_3_2_324_2\u00004_doc2\u0000\u0000"
}
]
The first one corresponds to doc1, the second to doc2 (the ids are just toIndexableString([2, docid])
. All they do is map revIds back to the ids of the previous type of documents. (We may or may not actually have to include the revId itself, TBD)
For query(viewName, opts)
, we just do a range query where startkey
, endkey
, etc. have been converted to their indexableString
equivalents. If the user specifies include_docs=true
, then we already have the docid
and docrev
, so we can easily pull that info out of the database. (TODO: can we handle joined docs in a similar way?)
For onDocDeleted(doc)
, we do a database-level transaction (WebSQL and IDB support this, LevelDB...?) that deletes the second type of doc and all its associated docs.
For onDocUpdated(doc, revId)
, we do a database-level transaction that modifies the second type of doc, deletes all instances of the first type of doc (corresponding to the old revid), and adds new instances of the first type of doc for the new emitted keys/values. (And maybe if the emitted keys/values didn't change, we don't do anything.)
- Joined docs?
- Reduce?
- LevelDB?