Skip to content

Instantly share code, notes, and snippets.

@b5
Created October 14, 2019 18:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save b5/6b03d9bd6d7c7b89764d6e6889b19b82 to your computer and use it in GitHub Desktop.
Save b5/6b03d9bd6d7c7b89764d6e6889b19b82 to your computer and use it in GitHub Desktop.

Log

Qri is a version control system (VCS) for datasets. Since most VCS's provide some system for collaborating, we aim to do the same. Collaboration within the context of a version control system means coordinating who has done what. Providing tools to syncronize collaborator's actions is a prerequisite for working together effectively.

The characteristics we want:

  • provide a foundation for about version histories.
  • trak & map to human names
  • Decentralized.
  • Offline first. Accomodate sync lags that may be years in length
  • as-small-as-possible storage footprint, size must be calculatable
  • Cannot require manual intervention for sync

We need to be able to use a logbook to detect states of conflict, and how to fix them

Overview

Logs have three primary data types. From smallest to largest they are operations, logs, and books.

  1. an operation is a record of an action taken.
  • operations have a method that classifies the type of action (eg: init,amend,delete)
  • operations have a model that defines the type of resource being acted upon
  • operations have prev & path fields for building causal histories
  • operations have meta fields for notes specific to an operation
  1. A log is an ordered set of operations authored by a single user
  • logs are append-only, and can only be written to by it's author
  • reducing a set of operations for a given model returns the model state from a log
  • All logs have a human-readable name
  • The name and author of a log can be changed by
  • Logs can be arranged into hierarchies (logs containing other logs)
  1. A book is a collection of logs.
  • logbooks can be queried for logs by named paths that traverse log hierarchies
  • logbooks are encrypted at rest for storage with the author's private key
  • a logbook can merge logs from collaborators by adding foreign logs to their book
  • merged logs from other authors are replicas

So long as the rules are followed, logbooks cannot fall into conflict. logbooks are a CRDT Coordination in a distributed context is notoriously difficult. Recent research in Conflict-Free Replicated Data Types looks very promising for this. Logs are an example of a CRDT

Example

Let's use the following collaboration story for illustration purposes. Nour & Amber are our two collaborators.

  1. Nour creates a profile
  2. Nour initializes a new dataset named nour/population
  3. Nour saves a version to nour/population
  4. Nour amends that version to a new commit hash
  5. Amber creates a profile
  6. Amber clones nour/population, syncing to nour's log
  7. Nour renames nour/population to nour/world_bank_population
  8. Nour saves a new version of nour/world_bank_population
  9. Nour deletes a version of nour/world_bank_population
  10. Amber syncs her log to the tip of nour/world_bank_population
  11. Amber initializes a new dataset named: amber/white_wine_quality
  12. Amber renames herself to renamed_amber

An abbreviated look at amber's logbook at step 12 would would look something like this:

book:
  renamed_amber:
    ops: InitProfile, AmendProfile
    white_wine_quality:
      ops: InitName
  nour:
    ops: InitProfile
    world_bank_population:
      ops: InitName, InitCommit, AmendCommit, AmendName, Initcommit, RemoveCommit

In the above renamed_amber, nour, white_wine_quality and world_bank_population are all logs. ops is the list of ops

Operations

An operation has a defined set of fields:

// Op is an operation, a single atomic unit in a log that describes a state
// change
type Op struct {
  // author information
	AuthorID  string   // identifier for author
  Name      string   // human-readable name for the reference
  
  // method indicates the action performed on a given a given model
	Method    OpMethod   // type of operation
  Model     uint32   // data model to operate on
  
  // 
	Ref       string   // identifier of data this operation is documenting
  Prev      string   // previous reference in a causal history
	Relations []string // references this operation relates to. usage is operation type-dependant

  // meta fields that have no informative value to operations
	Timestamp int64  // operation timestamp, for annotation purposes only
	Size      uint64 // size of the referenced value in bytes
	Note      string // operation annotation for users. eg: commit title
}

By using a predetermined set of fields we can keep size garuntees in place. Software

Naming

Academic literature distinguishes between operation-based and state-based CRDTs. In an operation-based CRDT the data that gets passed around for replication is a list of "operations" that refer to outside logic needed to calculate replicated state. In a state-based CRDT, the state itself is passed around, and no logic is needed to determine state.

State-based CRDTs are generally easier to work with, but come at the cost of being more expensive in terms of space.

Uniform fields in an op-based CRDT form a state-based CRDT.

Qri Example

Names are an emergent property of an Opset. For this to work logs need to be consistent, meaning two remove operations can't follow a single init operation . A single actor runs qri init, which creates a profile, so we record the operation in an opset that doesn't yet have a home:

initProfile := Op{
  Method: MethodInit,
  Model: Profile,
  Author: AuthorHashIdentifier,
  Name: periwinkle_blue_welsh_corgi,
}

This operation is a special intialization operation. init ops have no causal predecessor, and always have the author as the prev field, and the name as the subject field. Init ops are the only type of operation that can kick off an opset. They also determine the log model of the opset. Since this opset starts with an OpInitProfile this is said to be a Profile type opset, and it's label will be based on the profile operation type. So we put it in an Opset as entry zero:

Opset[0:initProfile]

Init ops are the only type of operation able to create an opset because initialization operations contain the ingredients for determining a name. Mapping prev to label.Author and subject to label.Name gives us a labelled Opset:

Label[QmB5,periwinkle_blue_welsh_corgi]: Opset[0:initProfile]

We then create a Log to hold the Label:Opset pair.

Log:
  Label[QmB5,periwinkle_blue_welsh_corgi]: Opset[0:initProfile]

Here we've derived a log by playing the sequence of operations. The sequence is exactly one operation long. Eveything at the end of the days is a sequence of operations.

I'm going to call this opset the user opset, because it's the one that creates a user. So the type of init operation is special to the opset, which is how we get the label, but we have another problem: This label is special to the log. Eventually this log will contain other labels. It's the "master" opset that determines the ownership of this log. We need to indicate somehow in the log that this particular label is special. We get there by making a strange statement: this log has no label. Right now it looks like we sould just echo the label we just made up to be the label of the log itself, but that would be arbitrary. We need all labels to derive from a sequence of operations. So, I'd posit that because this log is self-soverign, it has no label, which we can model like this:

Access Control

Logs intentionally push the issue of access control out to another system. Permission

Changes to the Access Control List are

Log Conflicts

Despite having "conflict free" written all over this documents, it's very possible two logbook logs will fall into conflict. These are a few scenarios where this will happen:

  1. A user copies their profile to two computers, makes edits to the same dataset on both machines
  2. A user's private key is compromised, and a malicious actor uses their Private Key to take actions on the same dataset.

In both the same author has taken action on the same log from distinct places, and replicated log has divergent operations appended to it's opset.

In the second scenario the user should immideately proceed to key rotation, a process outside of concern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment