johnchandlerburnham/lurk_content_addressing.md Secret

## lurk_content_addressing.md

      
    Raw
  

              lurk_content_addressing.md
            
          
    Content-Addressing Lurk Data

Overview

From the Lurk README:

Lurk is a statically scoped dialect of Lisp, influenced by Scheme and Common
Lisp. A language specification and reference implementation focused on
describing and developing the core language can be found in the
lurk repo.
Lurk's distinguishing feature relative to most programming languages is that
correct execution of Lurk programs can be directly proved using zk-SNARKs. The
resulting proofs are succinct: they are relatively small, can be verified
quickly, and they reveal only the information explicitly contained in the
statement to be proved.
Lurk's distinguishing feature relative to most zk-SNARK authoring languages is
that Lurk is Turing complete, so arbitrary computational claims can be made and
proved (subject to resource limitations, obviously). Because Lurk is a Lisp,
its code is simply Lurk data, and any Lurk data can be directly evaluated as a
Lurk program. Lurk constructs compound data using SNARK-friendly Poseidon
hashes (provided by Neptune), so
its data is naturally content-addressable.

Lurk's natural content-addressibility raises an interesting question: "How does
Lurk data integrate with the Interplanetary Linked Data (IPLD)
ecosystem of content-addressed data formats?
There are two primary ways to content address Lurk data, based on two distinct
representations:

The Lurk Syntax Tree representation, where data is stored as a recursive tree
The Lurk Scalar Graph representations, where data is stored as a
content-addressed adjacency-list

These two representations are similar, but are not identical and used in
different contexts: The Syntax representation corresponds better to Lurk
programs as written by developers in source files, whereas the Scalar Graph
representation corresponds better to the backend evaluation of Lurk programs.
Unlike the Syntax Tree representation, the Scalar Graph can represent programs
in a state of partial execution as
"thunks".
Lurk Syntax Tree

At the source code level, Lurk consists of S-expressions:
; A partial EBNF grammar for Lurk Expressions

<expr> ::= <atom>
        |  <list>
        |  <improper-list>

<list> ::= "(" {<expr>} ")"
<improper-list> ::= "(" {<expr>} . <expr> ")"

<atom> ::= <symbol>         ; an alphanumeric identifier
        |  <string-literal> ; an escaped unicode string literal
        |  <scalar-number>  ; an element of a specific finite field

...

For brevity the definitions of <symbol> <string-literal> and
<scalar-number> are not included, as their specific implementation details at
the parsing level are not relevant for the purposes of this document. A full
parser for the Lurk syntax can be found here:
https://github.com/lurk-lang/lurk-rs/blob/997c0033e0f393846b279b9472e76a663a61cfdd/src/parser/syntax.rs
Here is an example of what Lurk source code looks like is:
(letrec ((next (lambda (a b n target)
                 (if (eq n target)
                     a
                     (next b
                           (+ a b)
                           (+ 1 n)
                           target))))
         (fib (next 0 1 0)))
        (current-env))

Rust datatype representation

Once parsed, this becomes represented in the lurk-rs Rust implementations as:
pub enum Print<F: PrimeField> {
    List(Vec<Print<F>>),
    ImproperList(Vec<Print<F>>, Box<Print<F>>),
    Sym(String),
    Num(Num<F>),
    Str(String),
}
which can be transformed into
pub enum Term<F: PrimeField> {
    Nil,
    Cons(Box<Print<F>>, Box<Print<F>>),
    Sym(String),
    Num(Num<F>),
    Str(String),
}
by unfolding all proper lists (x y z) into cons-cells (x . (y . (z . nil))),
using the reserved symbol nil to indicate termination (The variant Term::Nil
is is a factored out Term::Sym("nil") for implementation reasons). Improper
lists are unfolded in the same way, except by using the provided terminator
element instead of nil, so that (x y . z) becomes (x . (y . z))
IPLD representation

The Lurk source programs can be transformed into the IPLD data model (or JSON)
using the following schema:
type Term union {
  | Cons Cons
  | Sym string
  | Literal Literal
} representation kinded

type Cons struct {
  car Term
  cdr Term
} representation tuple

type Literal union {
  tag int
  val string
} representation tuple

(NB: This requires relaxing the IPLD schema condition that structs cannot be
used as representation kinds.)
Where Literal consists of a unique integer tag and the base64 bytes
serialization of Term::Str, Term::Num, or any future literal variant which
might be added to Lurk expressions. The value of this unique integer tag for
literals is taken from Lurk's tagged scalar pointers
(https://github.com/lurk-lang/lurk-rs/blob/274add6bb7f3159e42465110e4c7fe73608e8a17/src/store.rs#L538)
and is explained further below.
For illustration of the above mapping, the Lurk expression (x 42 "foo") would
map to the IPLD (or JSON) representation: ["x", [[5, "Kg=="], [6, "Zm9v"]]]
assuming the bytestring representation of 42 is 0x42 in the field byte
serialization.
Some notes about this schema: We explicitly unfold all Lurk lists into
cons-pairs so that every lurk expression has a single canonical representation
in the IPLD data model.  We also specifically eschew IPLD maps in our schema to
avoid potential codec specific ordering
issues.
Lurk Scalar Graph

The above textual syntax can be seen as a direct tree representation of Lurk
data or Lurk programs (which are the same, since Lurk is homoiconic). However,
Lurk's evaluation model requires an indirect graph representation in order to
efficiently delay or reorder execution between different "frames" of its zkSNARK
backend, as well as to allow for the possiblity of hiding specific Lurk data
behind a type-tagged content-addressed pointer: One can think of this graph at
an abstract level as hash-consing the Lurk syntax tree, and then storing the
nodes of that tree separately as a directed adjacency list.
For illustration on works, suppose we take our expression (x 42 "foo") from
the previous section, but rendered as a tree:
  cons
  /  \
 x   cons
      / \
    42  cons
        /  \
      "foo" nil

In the direct representation, we have all the nodes of this tree immediately
represented, with parents recursively containing children. However, we can also
represent this tree as a graph adjacency list, by numbering the nodes:
  0: cons
  /       \
 1: x   2: cons
       /      \
    3: 42  4: cons
          /      \
      5: "foo"  6: nil

And then separating:
0: cons 1 2
1: x
2: cons 3 4
3: 42
4: cons 5 6
5: "foo"
6: nil

However, assigning numeric indices in this way is arbitrary. We can generate
content-addressed indices by hashing each node:
<hash-0>: cons <hash-1> <hash-2>
<hash-1>: x
<hash-2>: cons <hash-3> <hash-4>
<hash-3>: 42
<hash-4>: cons <hash-5> <hash-6>
<hash-5>: "foo"
<hash-6>: nil

In Lurk we use the zkSNARK-friendly hash algorithm Poseidon, which outputs a
finite-field element, rather than a bytestring digest. So each of the above
hashes is a field element F for whatever PrimeField the Lurk backend is
operating over.
Addtionally, we want to tag each of these hashes with information describing the
type of node they were generated from:
(<cons>, <hash-0>): cons (<sym>, <hash-1>) (<cons>, <hash-2>)
(<sym> , <hash-1>): x
(<cons>, <hash-2>): cons (<num>, <hash-3>) (<cons>, <hash-4>)
(<num> , <hash-3>): 42
(<cons>, <hash-4>): cons (<str>, <hash-5>) (<nil>, <hash-6>)
(<str> , <hash-5>): "foo"
(<nil> , <hash-6>): nil

We include this tag in our pointers for several reasons:
First, our Poseidon hash function for each node depends its type, so hashing
cons uses an arity 4 hash, hashing a number literal uses the scalar directly
without any hashing, and hashing a string uses a particular
implementation-specific
algorithm.
This information is needed in order to dereference our hash pointer.
Secondly, some nodes of our syntax tree the same contents, and differ only by
their type, such as symbols and string literals. The type-tag disambiguates
collisions, such as between symbol foo and string "foo". While in principle
we could also solve this by always including the tag in the preimage of our
hash, that would add additional cost or complexity.
Finally, and most importantly, Lurk's evaluation model adds a notion of delayed
computation, called a continuation. These continuations are also
content-addressed, and in order to correctly execute them Lurk needs the
additional context provided by the type-tag
(https://github.com/lurk-lang/lurk-rs/blob/274add6bb7f3159e42465110e4c7fe73608e8a17/src/store.rs#L562).
Rust datatype representation

As described in the previous section, the three components of our expression
Graph are (with their corresponding Rust types):

Tagged hash-pointers (ScalarPtr)
Graph Nodes (ScalarExpression)
A map of pointers to nodes (ScalarStore)

As this document is focused on content-addressing, we are only interested in the
canonical representation of Lurk Scalar Graphs in Rust. There are other
non-canonical datatypes (such as Ptr, Expression, and Store) which are
used to optimize evaluation in the lurk-rs implementation, but these can be
ignored for present purposes.
ScalarPtr: Tagged hash-pointer

The canonical representation of our tagged hash-pointers is:
pub struct ScalarPtr<F: PrimeField>(F, F);
Where the first field element corresponds to a 16-bit tag describing the type of
the node:
#[repr(u16)]
pub enum Tag {
    Nil = 0b0000_0000_0000_0000,
    Cons,
    Sym,
    Fun,
    Num,
    Thunk,
    Str,
}
And the second element is a Poseidon hash of some ScalarExpression of the type
described by the tag.
ScalarExpression: Nodes in an expression graph

The corresponding canonical node representation of the above is:
pub enum ScalarExpression<F: PrimeField> {
  Nil,
  Cons(ScalarPtr<F>, ScalarPtr<F>),
  Sym(String),
  Fun {
    arg: ScalarPtr<F>,
    body: ScalarPtr<F>,
    closed_env: ScalarPtr<F>,
  },
  Num(Num<F>),
  Str(String),
  Thunk(ScalarThunk<F>)
}
Where Nil again the factored symbol nil, Fun is a factored out
representation of lambdas e.g.  (lambda (x) x). Those forms are separated for
ease of conversion/evaluation.
ScalarStore: A content-addressed expression graph

The containing structure for a full canonical Lurk expression graph is:
pub struct ScalarStore<F: PrimeField> {
    scalar_map: BTreeMap<ScalarPtr<F>, ScalarExpression<F>>,
    ...
}
which is a map of tagged-hashes to individual nodes of our expression.
ScalarThunk, ScalarContPtr, ContTag

The content addressing of these datatypes are not included in this document for
conciseness, but proceed on essentially the same lines as ScalarExpression,
ScalarPtr, and Tag.
IPLD representation

When transforming our graph to IPLD, the most crucial step is to construct our
mapping such that our Lurk pointers correspond to IPLD Links (or CIDs).  This is
necessary to allow individual entries or sub-graphs in our ScalarStore to be
independently serialized and shared over e.g. IPFS without complex and costly
traversal and re-mapping of identifiers.
In other words, we would like the following diagram to correctly commute:
Ipfs: CID  --> IPLD
       ^         ^
       |         |
       V         V
Lurk: Ptr ---> Expr

ScalarPtr as CID

The challenge with this requirement is that ScalarPtr contains a tag, and
CID generally does not. Without a way to embed our Tag into CID we would
be unable to correctly read individual nodes in our expression graph without
re-traversing the entire graph (which we may not be able to do in the cases of
opaque expressions, i.e. private zkSNARK inputs).
For instance, let's consider our earlier example (x 42 "foo"), but focus only
the subexpression ("foo" . nil):
(<cons>, <hash-4>): cons (<str>, <hash-5>) (<nil>, <hash-6>)
(<str> , <hash-5>): "foo"
(<nil> , <hash-6>): nil

In order to compute <hash-4>, we need to know the tags of the expressions
inside the cons, in this case <str> and <nil>, since the Lurk algorithm
for hashing a cons node is (roughly):
poseidon-4(car-tag, car-hash, cdr-tag, cdr-hash)

where car and cdr are respectively the left-hand and right-hand contents of
the cons.
Fortunately, we can solve this problem by using the CID Multicodec field to
contain our tags. The CID codec field is currently limited to 64-bits, which
is generally too small to store a full field-element, but fortunately Lurk tags
can be represented as 16-bit unsigned integers.
In order to represent ScalarPtr as a CID, we can therefore perform the following
operation:
/// Convert a field element `F` into a 64-bit multicodec
fn to_multicodec<F: LurkField>(f: F) -> Option<u64> {

let tag: u16 = F::to_tag(f)?; // Convert a field element to a 16-bit tag

let codec = 
  F::LURK_PREFIX << 48 |  /// a unique 16-bit prefix in the multicodec table
  F::LURK_VERSION << 32 | /// a 16-bit version number
  F::LURK_FIELD << 16 |   /// a 16-bit field identifier
  u64::from(tag))      /// the 16-bit tag
Some(codec)
}
With this embedding of Lurk tags as multicodecs, we can then embed our ScalarPtrs as CIDs:
fn scalar_ptr_to_cid<F: LurkField>(x: ScalarPtr<F>) -> Option<Cid> {
  let codec  = x.0.to_multicodec()?;
  let digest = Multihash::wrap(F::LURK_HASH, x.1.to_repr().as_ref())?;
  let cid = Cid::new_v1(codec, digest);
  Some(cid)
}
This requires reserving all possible entries in the current multicodec table
with a given 16-bit prefix (e.g. currently no multicodec begins with 0xC0DE or
0xDA7A), in order to allow for future breaking changes to Lurk, and so to
prevent mixing graphs generated over incompatible fields.
If it does not prove possible to reserve this range on the current multicodec,
the two alternatives are:

Propose an extension to the Multicodec and CID specification which allows for
arbitrary precision codecs. This would solve a potential issue of multicodec
scarcity, but would require additional support from the multiformats
standards. This is described in Appendix A of this document.
Use a single Lurk multicodec and the identity multihash codec to allow the
concatenation of the tag and poseidon hash to be expressed the CID digest
field. This has the advantage of being easily implementable without any
multiformats standards changes, but has disadvantage of potentially making it
more difficult in for IPFS implementations to store/host/traverse Lurk data
in the future. In the other options, the Multihash codec F::LURK_HASH would
correspond to the correct multihash codec for the Poseidon hash function used
to construct the pointers or links.

ScalarExpression as IPLD

We can express the encoding of ScalarExpression as an IPLD Schema, treating the
above embedding of ScalarPtr to CID as an advanced type (and as previously in
the Syntax schame, we must allow struct to be used as discriminant in the
kinded representation):
advanced ScalarPtr # Using the above tagged pointer representation

type ScalarExpression union{
  | Nil Nil
  | Cons Cons
  | Sym Sym
  | Fun Fun
  | Num Num
  | Str Str
  | Thunk Thunk
} representation kinded

type Nil struct {
  tag int            # value always equals Tag::Nil
} representation tuple

type Cons struct {
  tag int            # value always equals Tag::Cons
  car ScalarPtr
  cdr ScalarPtr
} representation tuple

type Fun struct {
  tag int            # value always equals Tag::Fun
  arg ScalarPtr
  bod ScalarPrt
  env ScalarPtr
} representation tuple

type Sym struct {
  tag int            # value always equals Tag::Sym
  sym ScalarPtr
} representation tuple

type Num struct {
  tag int            # value always equals Tag::Num
  num ScalarPtr
} representation tuple

type Str struct {
  tag int            # value always equals Tag::Str
  str ScalarPtr
} representation tuple

advanced ScalarThunk # Elided for brevity

type Thunk struct {
  tag int            # value always equals Tag::Thunk
  thunk ScalarThunk
} representation tuple

A property-tested implementation of the above can be found here:
https://github.com/lurk-lang/lurk-rs/blob/012279bb693b5d919dda1b249e5fbdf1756ee2d6/src/scalar_store.rs#L191
In principle, including the tag in the IPLD schema for each variant of
ScalarExpression is unnecessary, as long as we always have a prior tagged
ScalarPtr as context. However, since one of our goals is to allow
ScalarExpressions to be independently serializable, we do not want to always
assume this.
ScalarStore

Given the above schemas for ScalarPtr and ScalarExpression, the schema for
ScalarStore is quite simple:

type Node struct {
  ptr ScalarPtr
  expr ScalarExpression
}

type ScalarStore [Node]

An implementation can be found here: https://github.com/lurk-lang/lurk-rs/blob/012279bb693b5d919dda1b249e5fbdf1756ee2d6/src/scalar_store.rs#L101
Appendix A: CIDv2 and arbitrary Precision Multicodec

Currently multicodec is defined to use unsigned-varint, whose specification
has the folllowing constraint:

For security, to avoid memory attacks, we use a "practical max" of 9 bytes.
Though there is no theoretical limit, and future specs can grow this number if
it is truly necessary to have code or length values equal to or larger than
2^63.

And indeed, implementations of CIDv1 generally store their codec as u64,
as in the rust-cid
implementation:
pub struct Cid<const S: usize> {
    /// The version of CID.
    version: Version,
    /// The codec of CID.
    codec: u64,
    /// The multihash of CID.
    hash: Multihash<S>,
}
which is encoded as:
<cidv1> ::= <multibase-prefix><multicodec-cidv1><multicodec-content-type><multihash-content-address>

We can relax this constraint, however, with a new CID version Cidv2:
pub struct Cidv2<const S: usize> {
    /// The codec of CID.
    codec: num_bigint::BigUInt,
    /// The multihash of CID.
    hash: Multihash<S>,
}
which is encoded as:
<cidv2> ::= <multibase-prefix><multicodec-cidv2><multicodec-size><multicodec-content-type><multihash-content-address>

where <multicodec-size> is a single byte describing the number of bytes needed to store the codec.
This increases the size of our multicodec-space from 2^64 possible codecs, to
2^(8 * 2^64), which is for all practical purposes is infiinite.
Additionally, because the size parameter is encoded as an unsigned-varint, this
costs only one additional byte of size for codecs between 8 and 127 bytes, which
is more than enough, to e.g. fit a 256-bit or 512 bit public key into a CID.
Appendix B: dag-lurk, mapping the IPLD data model to Lurk

Many possibles mappings of the IPLD data model to Lurk data exist, here is one such possible:
Ipld::Bool => Lurk booleans `t` and `nil`
Ipld::String => Lurk string
Ipld::Integer => Lurk Numbers for positive integers, and the pair of a boolean and Number for negative integers
Ipld::Null => the Lurk symbol `null` (`nil` can be used, but breaks bijection)
Ipld::List => Lurk lists `(a b c)`
Ipld::Float => ("float" <base64-bytes>)
Ipld::Bytes => ("bytes" <base64-bytes>)
Ipld::Map => Lurk lists of key-value pairs: `(("foo" a) ("bar" b))`
Ipld::Link => ("link" <base64-bytes>)

This mapping could be improved by adding bytestring literals to Lurk, as well as
a way to directly represent ScalarPtr at the syntax level.