Edit: we had to roll back this feature. It ended up introducing conflicts in PouchDB (#2685). The TouchDB implementation is still the fastest I'm aware of.
Up until PouchDB 3.0.x, our replication algorithm was largely the same as TouchDB's, as described here.
However, recently we have discovered an improvement to this algorithm, which results in a nearly 20x speedup, as measured by using the npm-browser project. Details are in these GitHub issues: #2472, #2475, and #2647.
Obviously this "20x" number will vary depending on latency and connection speed, but we're definitely in the "decimal order of magnitude" area, which is great. I cannot take credit for this discovery; it was Dale Harvey who proposed it (#2466).
Our new algorithm is almost identical to the original TouchDB algorithm, with one crucial difference. At step 5, notice it says:
5. GET each such revision from the source database. Use the ?revs=true URL parameter to include its list of parent revisions, so the target database can update its revision tree.
Ouch, a GET
request for every document you want to replicate! That's a helluva lot of HTTP requests. And with CORS it's even worse, because it ends up being two requests per document (one OPTIONS
for each GET
).
However, let's imagine you replace Step 3, which reads:
3. Fetch the source database’s _changes feed, starting just past the last source sequence ID (if any).
with this instead:
3. Fetch the source database’s _changes feed, starting just past the last source sequence ID (if any). Use
include_docs=true
to fetch the documents at the same time that you fetch the changes.
If you use include_docs=true
to get the latest version of each document in the _changes
feed itself, then you can actually skip step 5 for most documents, meaning you don't need to do a separate GET
for each document in the changes feed. There are only three exceptional cases, where you do need to issue a separate GET
:
_deleted
documents (because CouchDB doesn't let you post them to the end of the rev tree withnew_edits=false
, unless you include at least the most recent non-deleted revision)- documents with
_attachments
(because you don't get these from_changes
) - documents with more than one missing revision, or where the revision is not what you got from the
_changes
feed.
This is actually a very minor change to the algorithm, which results in a huge boost in performance. You can see our own implementation here, and you may note that it's only about ~20 lines of code.
The other change is that we no longer need the TouchDB optimization of fetching the generation-1 revisions using all_docs
(as described in #993). Since we just get those documents anyway in the _changes
feed, that step is no longer necessary.
We discovered the correct implementation of this algorithm through a painful process of trial-and-error. (The hasty PouchDB 3.0.2 release, which followed only a week after 3.0.0, is a testament to this.) So hopefully this writeup can prove useful to other members of the Couch family.
Our celebrations turned out to be premature. This "improvement" introduced bogus conflicts: pouchdb/pouchdb#2685.
We're going to have to revert this, sadly.