thomcc/remerge-design.md

## remerge-design.md

      
    Raw
  

              remerge-design.md
            
          
    Remerge design

Caveat: These are fairly rough notes.
Some team in the future pulls in the remerge android component/xpcom component/jsm/etc, authors a schema and initializes a remerge database.
They insert/update records in the database, and these changes are all checked for validity versus the schema.
The schema provides:

The name of the collection.
The version of this schema.
The minimum version a client must understand in order to sync with this schema (used in the case that this schema is downloaded by an older client), allowing them to evolve their schema while only locking out clients they no longer care about supporting
The set of "remerge features" used by the schema/collection (which allows us to evolve remerge without forcing compatibility breaks on older )
Any keys (potentially compound) for the collection (this is, perhaps confusingly, known as "dedupe_on").
Description of the data types of every field for records belonging to the collection.

Schema updates are synced to older clients in addition to record changes. In the case that the schema is newer than the schema version that a client was initialized with (the "native" schema), both schemas are kept active, however the "newer-than-native" schema is essentially only used when syncing (There are a few exceptions here -- if the newer schema specifies that there should be a creation or last-update timestamp on a record, things of that sort will be applied, read the documentation for details).
Clients use pairs of (vector clocks, server timestamps) to detect and resolve conflicts, and perform proper three way merges when possible, falling back to two way merging when it's not. They maintain a mirror of what's expected to be on the server.
Issues with remerge design

Some issues with the design of remerge have become apparent when implementing it, and for the sake of transparency for the upcoming architecture review, I figured I'll go over them.


Remerge is not particularly robust in the face of mistakes in the schema definition, and it lacks methods of detecting these mistakes in advance in many cases.

Even things as innocuous as accidentally forgetting to bump the schema version number,
The outcome here depends on what the flaw in the schema in question is, and ranges from "some clients will have sync errors indefinitely" to "no clients can successfully sync until an updated schema is put out and propogated to all syncing clients".


Remerge is not robust against various forms of data corruption.

In particular, it might be worth storing checksums with the schemas, to protect against the sorts of corruptions we know happen in practice. A big question would be how to handle bad clients.


Vector clocks grow without bound.

Interval tree clocks are likely the best solution, however they have substantial complexity

I spent some time sketching out what would be required here, and ended up with a 2.5kloc rust crate that implements 80% of what we need, compared to the 250 line, including tests, module that was required for vector clocks. Moreover, while there is a conceptual beauty to the interval tree clock algorithm, it's very math-heavy (Client identifiers are specially encoded functions of real in [0, 1) -> bool -- the intervals in question are the sections of these the input where the function returns true). Ultimately I came to the conclusion that it was too complex and hard to reason about for our purposes... At least for now.


Short of this, arbitrary pruning was planned, but it's not 100% clear how to do this in practice. Ideally we'd do it in such a way that a pruned client would be considered conflicting on return


No ability for adding indices, or optimized queries.

Desires for these things have cropped up with logins since the initial design of remerge. Some form of caching could be added, but it's unclear how far we would want to go here. In particular, caching would not work well with a multiprocess client.


Unclear how to handle clients which are known to be bad. For example, the system permits detecting that client consistently uploads an invalid schema, but it's unclear what to do with it.


I think these are the major ones. On the bright side, remerge has a system builtin for fundamental changes like the ones that might be required to fix these issues (remerge_features).
Additionally, we expect chrome.storage.sync to be a relatively static system (no schema migration), so it shouldn't be impacted by these issues.
I'll send out the document for the architecture review itself shortly.