Skip to content

Instantly share code, notes, and snippets.

@nolanlawson
Last active August 29, 2015 14:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nolanlawson/340cb898f8ed9f3db8a0 to your computer and use it in GitHub Desktop.
Save nolanlawson/340cb898f8ed9f3db8a0 to your computer and use it in GitHub Desktop.
Improvements to the TouchDB replication algorithm in PouchDB 3.0.2

Improvements to the TouchDB replication algorithm in PouchDB 3.0.2

Edit: we had to roll back this feature. It ended up introducing conflicts in PouchDB (#2685). The TouchDB implementation is still the fastest I'm aware of.

Up until PouchDB 3.0.x, our replication algorithm was largely the same as TouchDB's, as described here.

However, recently we have discovered an improvement to this algorithm, which results in a nearly 20x speedup, as measured by using the npm-browser project. Details are in these GitHub issues: #2472, #2475, and #2647.

Obviously this "20x" number will vary depending on latency and connection speed, but we're definitely in the "decimal order of magnitude" area, which is great. I cannot take credit for this discovery; it was Dale Harvey who proposed it (#2466).

Our new algorithm is almost identical to the original TouchDB algorithm, with one crucial difference. At step 5, notice it says:

5. GET each such revision from the source database. Use the ?revs=true URL parameter to include its list of parent revisions, so the target database can update its revision tree.

Ouch, a GET request for every document you want to replicate! That's a helluva lot of HTTP requests. And with CORS it's even worse, because it ends up being two requests per document (one OPTIONS for each GET).

However, let's imagine you replace Step 3, which reads:

3. Fetch the source database’s _changes feed, starting just past the last source sequence ID (if any).

with this instead:

3. Fetch the source database’s _changes feed, starting just past the last source sequence ID (if any). Use include_docs=true to fetch the documents at the same time that you fetch the changes.

If you use include_docs=true to get the latest version of each document in the _changes feed itself, then you can actually skip step 5 for most documents, meaning you don't need to do a separate GET for each document in the changes feed. There are only three exceptional cases, where you do need to issue a separate GET:

  1. _deleted documents (because CouchDB doesn't let you post them to the end of the rev tree with new_edits=false, unless you include at least the most recent non-deleted revision)
  2. documents with _attachments (because you don't get these from _changes)
  3. documents with more than one missing revision, or where the revision is not what you got from the _changes feed.

This is actually a very minor change to the algorithm, which results in a huge boost in performance. You can see our own implementation here, and you may note that it's only about ~20 lines of code.

The other change is that we no longer need the TouchDB optimization of fetching the generation-1 revisions using all_docs (as described in #993). Since we just get those documents anyway in the _changes feed, that step is no longer necessary.

We discovered the correct implementation of this algorithm through a painful process of trial-and-error. (The hasty PouchDB 3.0.2 release, which followed only a week after 3.0.0, is a testament to this.) So hopefully this writeup can prove useful to other members of the Couch family.

@AdrianRossouw
Copy link

Man. I don't actively follow pouchdb, but I thought this is just how it always worked.
good job finding this and giving me another reminder that I really need to build something
with it soon =)

@nolanlawson
Copy link
Author

Our celebrations turned out to be premature. This "improvement" introduced bogus conflicts: pouchdb/pouchdb#2685.

We're going to have to revert this, sadly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment