- fetch a particular version of a record (permalink)
- fetch the latest version of a record
- view all (current) records
- export whole register inc history
- export last N versions of history
- update my local copy of register with changes (streaming)
- download a proof that a record exists in the register
- verify a whole register inc history
- verify a backup
- search for a record
- search for a record through history
- have confidence that history hasn’t been rewritten
- compare two versions of a record
- view versions prior to a given record
- find previous and next versions of a record
- restaurant owner wants to fake food hygiene results
- drunk/speeding driver wants to not lose licence
- criminals want to create fake driving licences
- verification: verify a record really came from a register
- freshness check: check a record hasn’t been superseded
- download a file which is a “proof”
- supporting code which checks a proof
- proof contains:
- signature of version of register
- sufficient other information to tie hash of record to signature
- cf log proofs from certificate transparency
- key rotation
- old proofs? eg from 1992?
- file contents are “blobs” and accessed by hash of content
- directories are “trees”:
- a tree is a map from filenames to trees or blobs
- file permissions (in particular, is execute bit set?)
- commits are:
- metadata (author, committer, date etc)
- tree
- previous commit (multiple allowed for merges)
- integrity is achieved by signing a commit:
- from signature, know commit is good
- from commit, know tree is good
- from tree, know blob is good
- verifying old records involves chasing the “parent” links
- store events in a log
- log is a balanced binary tree, with each node labelled by hash of its children
- appending to a log involves rewriting the tree path down to the new nodes
- integrity has two parts:
- how do we know a newer tree hasn’t dropped any records? (“consistency proof”)
- how do we know a particular event appears in the log? (“audit proof”)
- consistency proofs
- git: here they are relatively easy: to show that commit abcd123 includes everything in version hash 1234cba, you just follow abcd123’s parent version chain until you find 1234cba (or run out of parents).
- certificate transparency: here the server is expected to provide a consistency proof to the client. Because events are appended to the log via path rewriting, the new version doesn’t necessarily contain the tree head of the old version. the proof needs to find a set of intermediate nodes which exist in both the old and new log versions, such that the set of nodes can reach all leaf nodes in the old version.
- audit proofs
- git: given a commit and a file within that commit, can we get
a guarantee that it really originated from the authentic
repository?
- yes, if the commit is signed
- certificate transparency: given an event and a log, can we
get a guarantee that the event exists in the log?
- yes, if the server can provide a proof
- git: given a commit and a file within that commit, can we get
a guarantee that it really originated from the authentic
repository?
A proposed model has been implemented in a branch, using the following table structure:
Record table:
hash | entry |
---|---|
abcd123 | {“postcode”:”WC1A 1AA”,”latlong”:”0.5W,51N”} |
1357bde | {“postcode”:”SAN TA1”,”latlong”:”0E,90N”} |
654321f | {“postcode”:”WC1A 1AA”,”latlong”:”0.51W,51.1N”} |
Version table:
version | hash | parent | signature | timestamp | records |
---|---|---|---|---|---|
1 | ae1f43 | 2015-05-30T12:00:00+000 | {“WC1A 1AA”:”abcd123”} | ||
2 | 1ef492 | ae1f43 | 2015-05-30T12:30:00+000 | {“WC1A 1AA”:”abcd123”, “SAN TA1”:”1357bde”} | |
3 | 5617a8 | 1ef492 | 2015-05-30T13:00:00+000 | {“WC1A 1AA”:”654321f”, “SAN TA1”:”1357bde”} |
- The
records
column in the version table will have values which will get very big. This will not scale to large registers:- Suppose you have a version with 1,000,000 records. Then an
update which changes or adds a record needs to copy all
1,000,000 entries in the
records
column. - This could be ameliorated by adopting a more treelike structure such as certificate transparency’s merkle trees; then only the path through the tree to the updated record needs to be rewritten, taking logarithmic space rather than linear space.
- Suppose you have a version with 1,000,000 records. Then an
update which changes or adds a record needs to copy all
1,000,000 entries in the
- the data model has some derived data. In particular, the
version number column isn’t necessary because it duplicates
information provided by the
parent
column chain.- the column is still useful for performing range queries which would be difficult to perform using the parent chain
- it can be seen as an index to support certain queries
- retracting records is supported by removing them from the
records
column. - streaming updates are supported by range queries on the version number
- we can support “downloading a proof” for a single record without leaking information about other records by supplying all the hashes needed to make up the hash of the version, similar to certificate transparency audit proofs.
We could use something like irmin, a database with a git-based model, to manage this for us.
We could use a more explicit event sourcing model, where record updates are events. That might look something like this:
Record table:
hash | entry |
---|---|
abcd123 | {“postcode”:”WC1A 1AA”,”latlong”:”0.5W,51N”} |
1357bde | {“postcode”:”SAN TA1”,”latlong”:”0E,90N”} |
654321f | {“postcode”:”WC1A 1AA”,”latlong”:”0.51W,51.1N”} |
Event table:
version | hash | parent | signature | timestamp | “primary key” | record_hash |
---|---|---|---|---|---|---|
1 | ae1f43 | 2015-05-30T12:00:00+000 | WC1A 1AA | abcd123 | ||
2 | 1ef492 | ae1f43 | 2015-05-30T12:30:00+000 | SAN TA1 | 1357bde | |
3 | 5617a8 | 1ef492 | 2015-05-30T13:00:00+000 | WC1A 1AA | 654321f |
This has the same information but takes less space as the records
column is now implicit. This means there’s more effort involved in
finding the latest version of a record (need to find the highest
versioned event with the given primary key) and more indexes might be
needed to support search. Record retraction can be supported by
making the record_hash
column nullable. This model enforces exactly
one record update per register version; we could support more by
adding a separate “version” table which defined a version by recording
the record high water mark.
The hash chaining in the event table could easily be replaced by a merkle tree as an alternative integrity guarantee implementation.
- we can definitely produce a data model which satisfies the needs documented above.
- there are several alternative implementations for structuring record history and integrity guarantees, with tradeoffs in ease of query implementation.