thomcc/GenericRecordSync.md Secret

## GenericRecordSync.md

      
    Raw
  

              GenericRecordSync.md
            
          
    Generic Data Sync

Note: Text may be stale, see this google doc for latest.
Unfortunately, with the (indefinite) "pause" of Mentat, there's no obvious path forward for new synced data types beyond 'the Sync team implements a new component'. Presumably, at some point we decided this was both desirable, but unworkable, hence designing Mentat. After some thought, I've come up with a plan that gets us some of the benefits of Mentat with the following major benefits (compared to mentat)

Works on top of Sync 1.5

A couple of extensions to the Sync 1.5 server protocol would help, but are not necessary.


Doesn't change the sync data model substantially.
Doesn't require us to implement a complex database system.

Background/Goals/Etc

In one of the AllHand, Lina had a presentation which defined three different types of sync data stores.

Tree stores (bookmarks). The defining features of these stores are that:

They represent a tree.
They are considered corrupt if tree constraints are invaldiated.


Log stores (history). The defining features of these stores are that:

Typically too large to fit in memory.
We expect to only sync a subset of the records in them.


Record stores (logins, addresses, credit cards, addons, etc)

This document describes a plan for syncing "Type 3" data stores in a generic way, however extended to allow the following additional features not present in the current system:

Some degree of schema evolution.
Inter-record references (even across collections).

Description

Basic Type-3 Store support

We'll start with how to support type-3 stores without the two extra features, and I'll then explain how to add those.
Essentially, the Logins module serves as something of a template for the basic idea. It implements proper sync with three-way-merge, and most of it can be done relatively independently of the data storage. Additionally, the API exposed over the FFI has very little dependence on on the type of data stored -- It returns JSON blobs.
We'd have a schema something like this:
CREATE TABLE IF NOT EXISTS local_records (
    -- Row ID
    id             INTEGER PRIMARY KEY,
    -- Sync GUID
    guid           TEXT NOT NULL UNIQUE CHECK(length(guid) == 12),
    -- The record payload
    record_json    TEXT NOT NULL CHECK(json_valid(record_json)),
    -- Local modification timestamp in milliseconds
    local_modified INTEGER NOT NULL DEFAULT 0 CHECK(local_modified >= 0),
    -- Is this a tombstone
    is_deleted     TINYINT NOT NULL DEFAULT 0,
    -- Sync status, one
    sync_status    TINYINT NOT NULL DEFAULT 0 CHECK(sync_status BETWEEN 0 AND 3),
    -- Support for storing multiple collections in the same database.
    coll_id        INTEGER NOT NULL,
    FOREIGN KEY(coll_id) REFERENCES collections(id)
);

CREATE TABLE IF NOT EXISTS mirror_records (
    -- Row ID
    id             INTEGER PRIMARY KEY,
    -- Sync GUID
    guid           TEXT NOT NULL UNIQUE CHECK(length(guid) == 12),
    -- The payload
    record_json    TEXT NOT NULL        CHECK(json_valid(record_json)),
    -- in milliseconds (a sync15::ServerTimestamp multiplied by 1000 and truncated)
    server_modified INTEGER NOT NULL CHECK(server_modified >= 0),
    -- Whether or not the item in localRecords overrides this
    is_overridden   TINYINT NOT NULL DEFAULT 0,
    -- Support for storing multiple collections in the same database.
    coll_id         INTEGER NOT NULL,
    FOREIGN KEY(coll_id) REFERENCES collections(id)
);

-- Fairly simple, exists so that we don't need 1 database per collection.
CREATE TABLE IF NOT EXISTS collections (
    id        INTEGER PRIMARY KEY,
    name      TEXT NOT NULL UNIQUE,
    -- Server last sync timestamp (1000 * sync15::ServerTimestamp),
    -- or null if we've never synced.
    last_sync INTEGER
);
Most of these fields are the same as in logins, which has good documentation for them in it's header comment in schema.rs. Some subtle differences are:

we're using CHECK constraints heavily.

It's possible we wouldn't use all of these, as they could cause extensibility problems in the future (for example, if we need to add a new sync status).


sync_status is BETWEEN 0 AND 3 when it's checked (in Rust code) that it's BETWEEN 0 and 2 in logins. The last value is a new SyncStatus::Unknown, which logins doesn't support, but we have supported on other collections in the past.
most modified timestamps are 0 and not NULL when unset.
The existance of collections and coll_id, which are just a way to avoid requiring a large number of database files.

However, this doesn't get us syncing yet. In logins, to perform two or three way merges, we need some ability to reconcile changes, which requires knowlege of the data we're syncing. To support this, we'd have an API for creating a collection that would produce an object similar to the following:
/// How to merge a given field. It's possible this would somehow be combined
/// with the above to statically check some of the 'Numeric fields only' stuff,
/// but that's not the pont.
pub enum FieldMergeStrategy {
    /// Take the value for the field that was changed most recently.
    ///
    /// The default, and recommended value for most fields.
    ///
    /// Allowed for any type of field.
    TakeNewest,

    /// Take the value for the field that was changed least recently.
    ///
    /// Use this for things like creation metadata, or other things which should
    /// not change once set.
    ///
    /// Allowed for any type of field.
    TakeOldest,

    /// Use to indicate that this field is conceptually part of another field.
    ///
    /// Use this for cases like address pt 1/pt 2, where splitting the field
    /// naively will result in corruption.
    ///
    /// Allowed for any type of field.
    ///
    /// Note: Poorly thought out, may not be necessary even for cases like addresses
    /// (Not sure we'd ever split the fields up incorrectly if we're always doing
    /// proper 3WM...)
    TakeComposite { other_field_name: String },

    /// Numeric fields only: Take the maximum value between the two fields.
    ///
    /// Use this for things like last use timestamps.
    TakeNumMax,

    /// Numeric fields only: Take the minimum valuee between the two fields.
    ///
    /// Use this for things like creation timestamps.
    TakeNumMin,

    /// Numeric fields only: Treat the value as if it's a rolling sum. This actually does
    /// something like `out.field += max(remote.field - mirror.field, 0)` (e.g.
    /// it does the right thing).
    ///
    /// Use this for things like use counters.
    TakeNumSum,

    /// Boolean fields only: Merge as `true` if *any* of the fields are set to true
    TakeBoolOr,

    /// Boolean fields only: Merge as `true` if *all* of the fields are set to true
    TakeBoolAnd,

    /// Possibly more. Custom behaviors are possible but will lead to problems, as described later.
}

pub enum FieldType {
    /// Indicates that this field must be a string.
    Text,
    /// Indicates that this field is numeric (timestamps count here).
    Number,
    /// Indicates that this field is a boolean flag.
    Boolean,
}

pub struct Field {
    /// The name of the field.
    pub name: String,
    /// Whether or not the field is required.
    /// Note: This is probably a bad idea to allow in synced collections,
    /// unless we auto-populate an empty default value.
    pub required: bool,
    /// The type of the field. Note that `None` means any type of value is allowed here.
    pub field_type: Option<FieldType>,
    /// How to merge the field.
    pub merge_strategy: FieldMergeStrategy,
}

pub struct MergeSchema {
    /// How to merge each field.
    ///
    /// Note: Unknown fields are preserved, are merged by TakeNewest,
    /// have no type constraints, etc.
    ///
    /// Poorly thought out: It's possible we could allow `field.sub_prop.etc`
    /// for nested fields?
    pub fields: HashMap<String, Field>,

    /// List of field names where if all values match, then the records should
    /// be considered 'duplicates' and deduped. Examples:
    ///
    /// - `url` for history entries
    /// - The combination of `hostname`, `username` for logins
    /// - addon id for addons
    /// - etc.
    pub dedupe_on: Vec<String>,

    /// If false, we'll just take the newer record and perform no merging.
    /// Note: Poorly thought out, but it seems prudent to allow an
    /// escape hatch to this behavior.
    pub unmergable: bool,
}
We'd then use this to perform two and three way merges, and store it in the database under a new table.
Additionally, if we sync a record containing this information to the server (possibly as a $collection_name/schema record, or something), we get the property of 'some degree of schema evolution' for free. We'd need to come up with versioning, and would likely need to provide guidence about which schema migrations are safe (e.g. adding new fields is fine, removing is not fine, etc)
Inter-record references

This has historically been a problem for us, but could be solved, I think, by adding a new Guid type to the FieldType enum above.
The flaw for that strategy is record ID changes. The solution to that is to record them on the server somewhere (either in a new collection, or in a new server API that took a list of IDs and returned the live versions of those IDs). Then, after syncing all records we'd fixup all the guids. Note that this wouldn't work for bookmarks primarially because of ordering.
Unsolved issues


Indices
Perf of deduping
Array types
Handling evolution of the $collection_name/schema definition format.

Conclusion

I sort of ran out of steam writing this towards the end, so apologies if I didn't elaborate well enough. Let me know.
Something like this would probably not take much more work than a new collection type, and would offer many of the benefits that mentat had promised, without many of the problems.