Skip to content

Instantly share code, notes, and snippets.

@philandstuff
Created June 1, 2015 11:09
Show Gist options
  • Save philandstuff/8a0ec39f56a2077632ea to your computer and use it in GitHub Desktop.
Save philandstuff/8a0ec39f56a2077632ea to your computer and use it in GitHub Desktop.
data modelling for registers

data model

needs

  • fetch a particular version of a record (permalink)
  • fetch the latest version of a record
  • view all (current) records
  • export whole register inc history
  • export last N versions of history
    • update my local copy of register with changes (streaming)
  • download a proof that a record exists in the register
  • verify a whole register inc history
  • verify a backup
  • search for a record
  • search for a record through history
  • have confidence that history hasn’t been rewritten
  • compare two versions of a record
  • view versions prior to a given record
  • find previous and next versions of a record

threat model

  • restaurant owner wants to fake food hygiene results
  • drunk/speeding driver wants to not lose licence
  • criminals want to create fake driving licences

defences

  • verification: verify a record really came from a register
  • freshness check: check a record hasn’t been superseded

verification model

  • download a file which is a “proof”
  • supporting code which checks a proof
  • proof contains:
    • signature of version of register
    • sufficient other information to tie hash of record to signature
    • cf log proofs from certificate transparency

worries

  • key rotation
    • old proofs? eg from 1992?

two integrity models

git

  • file contents are “blobs” and accessed by hash of content
  • directories are “trees”:
    • a tree is a map from filenames to trees or blobs
    • file permissions (in particular, is execute bit set?)
  • commits are:
    • metadata (author, committer, date etc)
    • tree
    • previous commit (multiple allowed for merges)
  • integrity is achieved by signing a commit:
    • from signature, know commit is good
    • from commit, know tree is good
    • from tree, know blob is good
    • verifying old records involves chasing the “parent” links

certificate transparency

  • store events in a log
  • log is a balanced binary tree, with each node labelled by hash of its children
  • appending to a log involves rewriting the tree path down to the new nodes
  • integrity has two parts:
    • how do we know a newer tree hasn’t dropped any records? (“consistency proof”)
    • how do we know a particular event appears in the log? (“audit proof”)

comparison

  • consistency proofs
    • git: here they are relatively easy: to show that commit abcd123 includes everything in version hash 1234cba, you just follow abcd123’s parent version chain until you find 1234cba (or run out of parents).
    • certificate transparency: here the server is expected to provide a consistency proof to the client. Because events are appended to the log via path rewriting, the new version doesn’t necessarily contain the tree head of the old version. the proof needs to find a set of intermediate nodes which exist in both the old and new log versions, such that the set of nodes can reach all leaf nodes in the old version.
  • audit proofs
    • git: given a commit and a file within that commit, can we get a guarantee that it really originated from the authentic repository?
      • yes, if the commit is signed
    • certificate transparency: given an event and a log, can we get a guarantee that the event exists in the log?

implementation

A proposed model has been implemented in a branch, using the following table structure:

Record table:

hashentry
abcd123{“postcode”:”WC1A 1AA”,”latlong”:”0.5W,51N”}
1357bde{“postcode”:”SAN TA1”,”latlong”:”0E,90N”}
654321f{“postcode”:”WC1A 1AA”,”latlong”:”0.51W,51.1N”}

Version table:

versionhashparentsignaturetimestamprecords
1ae1f432015-05-30T12:00:00+000{“WC1A 1AA”:”abcd123”}
21ef492ae1f432015-05-30T12:30:00+000{“WC1A 1AA”:”abcd123”, “SAN TA1”:”1357bde”}
35617a81ef4922015-05-30T13:00:00+000{“WC1A 1AA”:”654321f”, “SAN TA1”:”1357bde”}

analysis

  • The records column in the version table will have values which will get very big. This will not scale to large registers:
    • Suppose you have a version with 1,000,000 records. Then an update which changes or adds a record needs to copy all 1,000,000 entries in the records column.
    • This could be ameliorated by adopting a more treelike structure such as certificate transparency’s merkle trees; then only the path through the tree to the updated record needs to be rewritten, taking logarithmic space rather than linear space.
  • the data model has some derived data. In particular, the version number column isn’t necessary because it duplicates information provided by the parent column chain.
    • the column is still useful for performing range queries which would be difficult to perform using the parent chain
    • it can be seen as an index to support certain queries
  • retracting records is supported by removing them from the records column.
  • streaming updates are supported by range queries on the version number
  • we can support “downloading a proof” for a single record without leaking information about other records by supplying all the hashes needed to make up the hash of the version, similar to certificate transparency audit proofs.

alternatives

We could use something like irmin, a database with a git-based model, to manage this for us.

We could use a more explicit event sourcing model, where record updates are events. That might look something like this:

Record table:

hashentry
abcd123{“postcode”:”WC1A 1AA”,”latlong”:”0.5W,51N”}
1357bde{“postcode”:”SAN TA1”,”latlong”:”0E,90N”}
654321f{“postcode”:”WC1A 1AA”,”latlong”:”0.51W,51.1N”}

Event table:

versionhashparentsignaturetimestamp“primary key”record_hash
1ae1f432015-05-30T12:00:00+000WC1A 1AAabcd123
21ef492ae1f432015-05-30T12:30:00+000SAN TA11357bde
35617a81ef4922015-05-30T13:00:00+000WC1A 1AA654321f

This has the same information but takes less space as the records column is now implicit. This means there’s more effort involved in finding the latest version of a record (need to find the highest versioned event with the given primary key) and more indexes might be needed to support search. Record retraction can be supported by making the record_hash column nullable. This model enforces exactly one record update per register version; we could support more by adding a separate “version” table which defined a version by recording the record high water mark.

The hash chaining in the event table could easily be replaced by a merkle tree as an alternative integrity guarantee implementation.

conclusions

  • we can definitely produce a data model which satisfies the needs documented above.
  • there are several alternative implementations for structuring record history and integrity guarantees, with tradeoffs in ease of query implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment