Skip to content

Instantly share code, notes, and snippets.

@johnchandlerburnham
Last active April 27, 2022 09:56
Show Gist options
  • Save johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7 to your computer and use it in GitHub Desktop.
Save johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7 to your computer and use it in GitHub Desktop.
Content Addressing Lurk Data

Content-Addressing Lurk Data

Overview

From the Lurk README:

Lurk is a statically scoped dialect of Lisp, influenced by Scheme and Common Lisp. A language specification and reference implementation focused on describing and developing the core language can be found in the lurk repo.

Lurk's distinguishing feature relative to most programming languages is that correct execution of Lurk programs can be directly proved using zk-SNARKs. The resulting proofs are succinct: they are relatively small, can be verified quickly, and they reveal only the information explicitly contained in the statement to be proved.

Lurk's distinguishing feature relative to most zk-SNARK authoring languages is that Lurk is Turing complete, so arbitrary computational claims can be made and proved (subject to resource limitations, obviously). Because Lurk is a Lisp, its code is simply Lurk data, and any Lurk data can be directly evaluated as a Lurk program. Lurk constructs compound data using SNARK-friendly Poseidon hashes (provided by Neptune), so its data is naturally content-addressable.

Lurk's natural content-addressibility raises an interesting question: "How does Lurk data integrate with the Interplanetary Linked Data (IPLD) ecosystem of content-addressed data formats?

There are two primary ways to content address Lurk data, based on two distinct representations:

  1. The Lurk Syntax Tree representation, where data is stored as a recursive tree
  2. The Lurk Scalar Graph representations, where data is stored as a content-addressed adjacency-list

These two representations are similar, but are not identical and used in different contexts: The Syntax representation corresponds better to Lurk programs as written by developers in source files, whereas the Scalar Graph representation corresponds better to the backend evaluation of Lurk programs. Unlike the Syntax Tree representation, the Scalar Graph can represent programs in a state of partial execution as "thunks".

Lurk Syntax Tree

At the source code level, Lurk consists of S-expressions:

; A partial EBNF grammar for Lurk Expressions

<expr> ::= <atom>
        |  <list>
        |  <improper-list>

<list> ::= "(" {<expr>} ")"
<improper-list> ::= "(" {<expr>} . <expr> ")"

<atom> ::= <symbol>         ; an alphanumeric identifier
        |  <string-literal> ; an escaped unicode string literal
        |  <scalar-number>  ; an element of a specific finite field

...

For brevity the definitions of <symbol> <string-literal> and <scalar-number> are not included, as their specific implementation details at the parsing level are not relevant for the purposes of this document. A full parser for the Lurk syntax can be found here: https://github.com/lurk-lang/lurk-rs/blob/997c0033e0f393846b279b9472e76a663a61cfdd/src/parser/syntax.rs

Here is an example of what Lurk source code looks like is:

(letrec ((next (lambda (a b n target)
                 (if (eq n target)
                     a
                     (next b
                           (+ a b)
                           (+ 1 n)
                           target))))
         (fib (next 0 1 0)))
        (current-env))

Rust datatype representation

Once parsed, this becomes represented in the lurk-rs Rust implementations as:

pub enum Print<F: PrimeField> {
    List(Vec<Print<F>>),
    ImproperList(Vec<Print<F>>, Box<Print<F>>),
    Sym(String),
    Num(Num<F>),
    Str(String),
}

which can be transformed into

pub enum Term<F: PrimeField> {
    Nil,
    Cons(Box<Print<F>>, Box<Print<F>>),
    Sym(String),
    Num(Num<F>),
    Str(String),
}

by unfolding all proper lists (x y z) into cons-cells (x . (y . (z . nil))), using the reserved symbol nil to indicate termination (The variant Term::Nil is is a factored out Term::Sym("nil") for implementation reasons). Improper lists are unfolded in the same way, except by using the provided terminator element instead of nil, so that (x y . z) becomes (x . (y . z))

IPLD representation

The Lurk source programs can be transformed into the IPLD data model (or JSON) using the following schema:

type Term union {
  | Cons Cons
  | Sym string
  | Literal Literal
} representation kinded

type Cons struct {
  car Term
  cdr Term
} representation tuple

type Literal union {
  tag int
  val string
} representation tuple

(NB: This requires relaxing the IPLD schema condition that structs cannot be used as representation kinds.)

Where Literal consists of a unique integer tag and the base64 bytes serialization of Term::Str, Term::Num, or any future literal variant which might be added to Lurk expressions. The value of this unique integer tag for literals is taken from Lurk's tagged scalar pointers (https://github.com/lurk-lang/lurk-rs/blob/274add6bb7f3159e42465110e4c7fe73608e8a17/src/store.rs#L538) and is explained further below.

For illustration of the above mapping, the Lurk expression (x 42 "foo") would map to the IPLD (or JSON) representation: ["x", [[5, "Kg=="], [6, "Zm9v"]]] assuming the bytestring representation of 42 is 0x42 in the field byte serialization.

Some notes about this schema: We explicitly unfold all Lurk lists into cons-pairs so that every lurk expression has a single canonical representation in the IPLD data model. We also specifically eschew IPLD maps in our schema to avoid potential codec specific ordering issues.

Lurk Scalar Graph

The above textual syntax can be seen as a direct tree representation of Lurk data or Lurk programs (which are the same, since Lurk is homoiconic). However, Lurk's evaluation model requires an indirect graph representation in order to efficiently delay or reorder execution between different "frames" of its zkSNARK backend, as well as to allow for the possiblity of hiding specific Lurk data behind a type-tagged content-addressed pointer: One can think of this graph at an abstract level as hash-consing the Lurk syntax tree, and then storing the nodes of that tree separately as a directed adjacency list.

For illustration on works, suppose we take our expression (x 42 "foo") from the previous section, but rendered as a tree:

  cons
  /  \
 x   cons
      / \
    42  cons
        /  \
      "foo" nil

In the direct representation, we have all the nodes of this tree immediately represented, with parents recursively containing children. However, we can also represent this tree as a graph adjacency list, by numbering the nodes:

  0: cons
  /       \
 1: x   2: cons
       /      \
    3: 42  4: cons
          /      \
      5: "foo"  6: nil

And then separating:

0: cons 1 2
1: x
2: cons 3 4
3: 42
4: cons 5 6
5: "foo"
6: nil

However, assigning numeric indices in this way is arbitrary. We can generate content-addressed indices by hashing each node:

<hash-0>: cons <hash-1> <hash-2>
<hash-1>: x
<hash-2>: cons <hash-3> <hash-4>
<hash-3>: 42
<hash-4>: cons <hash-5> <hash-6>
<hash-5>: "foo"
<hash-6>: nil

In Lurk we use the zkSNARK-friendly hash algorithm Poseidon, which outputs a finite-field element, rather than a bytestring digest. So each of the above hashes is a field element F for whatever PrimeField the Lurk backend is operating over.

Addtionally, we want to tag each of these hashes with information describing the type of node they were generated from:

(<cons>, <hash-0>): cons (<sym>, <hash-1>) (<cons>, <hash-2>)
(<sym> , <hash-1>): x
(<cons>, <hash-2>): cons (<num>, <hash-3>) (<cons>, <hash-4>)
(<num> , <hash-3>): 42
(<cons>, <hash-4>): cons (<str>, <hash-5>) (<nil>, <hash-6>)
(<str> , <hash-5>): "foo"
(<nil> , <hash-6>): nil

We include this tag in our pointers for several reasons:

First, our Poseidon hash function for each node depends its type, so hashing cons uses an arity 4 hash, hashing a number literal uses the scalar directly without any hashing, and hashing a string uses a particular implementation-specific algorithm. This information is needed in order to dereference our hash pointer.

Secondly, some nodes of our syntax tree the same contents, and differ only by their type, such as symbols and string literals. The type-tag disambiguates collisions, such as between symbol foo and string "foo". While in principle we could also solve this by always including the tag in the preimage of our hash, that would add additional cost or complexity.

Finally, and most importantly, Lurk's evaluation model adds a notion of delayed computation, called a continuation. These continuations are also content-addressed, and in order to correctly execute them Lurk needs the additional context provided by the type-tag (https://github.com/lurk-lang/lurk-rs/blob/274add6bb7f3159e42465110e4c7fe73608e8a17/src/store.rs#L562).

Rust datatype representation

As described in the previous section, the three components of our expression Graph are (with their corresponding Rust types):

  • Tagged hash-pointers (ScalarPtr)
  • Graph Nodes (ScalarExpression)
  • A map of pointers to nodes (ScalarStore)

As this document is focused on content-addressing, we are only interested in the canonical representation of Lurk Scalar Graphs in Rust. There are other non-canonical datatypes (such as Ptr, Expression, and Store) which are used to optimize evaluation in the lurk-rs implementation, but these can be ignored for present purposes.

ScalarPtr: Tagged hash-pointer

The canonical representation of our tagged hash-pointers is:

pub struct ScalarPtr<F: PrimeField>(F, F);

Where the first field element corresponds to a 16-bit tag describing the type of the node:

#[repr(u16)]
pub enum Tag {
    Nil = 0b0000_0000_0000_0000,
    Cons,
    Sym,
    Fun,
    Num,
    Thunk,
    Str,
}

And the second element is a Poseidon hash of some ScalarExpression of the type described by the tag.

ScalarExpression: Nodes in an expression graph

The corresponding canonical node representation of the above is:

pub enum ScalarExpression<F: PrimeField> {
  Nil,
  Cons(ScalarPtr<F>, ScalarPtr<F>),
  Sym(String),
  Fun {
    arg: ScalarPtr<F>,
    body: ScalarPtr<F>,
    closed_env: ScalarPtr<F>,
  },
  Num(Num<F>),
  Str(String),
  Thunk(ScalarThunk<F>)
}

Where Nil again the factored symbol nil, Fun is a factored out representation of lambdas e.g. (lambda (x) x). Those forms are separated for ease of conversion/evaluation.

ScalarStore: A content-addressed expression graph

The containing structure for a full canonical Lurk expression graph is:

pub struct ScalarStore<F: PrimeField> {
    scalar_map: BTreeMap<ScalarPtr<F>, ScalarExpression<F>>,
    ...
}

which is a map of tagged-hashes to individual nodes of our expression.

ScalarThunk, ScalarContPtr, ContTag

The content addressing of these datatypes are not included in this document for conciseness, but proceed on essentially the same lines as ScalarExpression, ScalarPtr, and Tag.

IPLD representation

When transforming our graph to IPLD, the most crucial step is to construct our mapping such that our Lurk pointers correspond to IPLD Links (or CIDs). This is necessary to allow individual entries or sub-graphs in our ScalarStore to be independently serialized and shared over e.g. IPFS without complex and costly traversal and re-mapping of identifiers.

In other words, we would like the following diagram to correctly commute:

Ipfs: CID  --> IPLD
       ^         ^
       |         |
       V         V
Lurk: Ptr ---> Expr

ScalarPtr as CID

The challenge with this requirement is that ScalarPtr contains a tag, and CID generally does not. Without a way to embed our Tag into CID we would be unable to correctly read individual nodes in our expression graph without re-traversing the entire graph (which we may not be able to do in the cases of opaque expressions, i.e. private zkSNARK inputs).

For instance, let's consider our earlier example (x 42 "foo"), but focus only the subexpression ("foo" . nil):

(<cons>, <hash-4>): cons (<str>, <hash-5>) (<nil>, <hash-6>)
(<str> , <hash-5>): "foo"
(<nil> , <hash-6>): nil

In order to compute <hash-4>, we need to know the tags of the expressions inside the cons, in this case <str> and <nil>, since the Lurk algorithm for hashing a cons node is (roughly):

poseidon-4(car-tag, car-hash, cdr-tag, cdr-hash)

where car and cdr are respectively the left-hand and right-hand contents of the cons.

Fortunately, we can solve this problem by using the CID Multicodec field to contain our tags. The CID codec field is currently limited to 64-bits, which is generally too small to store a full field-element, but fortunately Lurk tags can be represented as 16-bit unsigned integers.

In order to represent ScalarPtr as a CID, we can therefore perform the following operation:

/// Convert a field element `F` into a 64-bit multicodec
fn to_multicodec<F: LurkField>(f: F) -> Option<u64> {

let tag: u16 = F::to_tag(f)?; // Convert a field element to a 16-bit tag

let codec = 
  F::LURK_PREFIX << 48 |  /// a unique 16-bit prefix in the multicodec table
  F::LURK_VERSION << 32 | /// a 16-bit version number
  F::LURK_FIELD << 16 |   /// a 16-bit field identifier
  u64::from(tag))      /// the 16-bit tag
Some(codec)
}

With this embedding of Lurk tags as multicodecs, we can then embed our ScalarPtrs as CIDs:

fn scalar_ptr_to_cid<F: LurkField>(x: ScalarPtr<F>) -> Option<Cid> {
  let codec  = x.0.to_multicodec()?;
  let digest = Multihash::wrap(F::LURK_HASH, x.1.to_repr().as_ref())?;
  let cid = Cid::new_v1(codec, digest);
  Some(cid)
}

This requires reserving all possible entries in the current multicodec table with a given 16-bit prefix (e.g. currently no multicodec begins with 0xC0DE or 0xDA7A), in order to allow for future breaking changes to Lurk, and so to prevent mixing graphs generated over incompatible fields.

If it does not prove possible to reserve this range on the current multicodec, the two alternatives are:

  1. Propose an extension to the Multicodec and CID specification which allows for arbitrary precision codecs. This would solve a potential issue of multicodec scarcity, but would require additional support from the multiformats standards. This is described in Appendix A of this document.
  2. Use a single Lurk multicodec and the identity multihash codec to allow the concatenation of the tag and poseidon hash to be expressed the CID digest field. This has the advantage of being easily implementable without any multiformats standards changes, but has disadvantage of potentially making it more difficult in for IPFS implementations to store/host/traverse Lurk data in the future. In the other options, the Multihash codec F::LURK_HASH would correspond to the correct multihash codec for the Poseidon hash function used to construct the pointers or links.

ScalarExpression as IPLD

We can express the encoding of ScalarExpression as an IPLD Schema, treating the above embedding of ScalarPtr to CID as an advanced type (and as previously in the Syntax schame, we must allow struct to be used as discriminant in the kinded representation):

advanced ScalarPtr # Using the above tagged pointer representation

type ScalarExpression union{
  | Nil Nil
  | Cons Cons
  | Sym Sym
  | Fun Fun
  | Num Num
  | Str Str
  | Thunk Thunk
} representation kinded

type Nil struct {
  tag int            # value always equals Tag::Nil
} representation tuple

type Cons struct {
  tag int            # value always equals Tag::Cons
  car ScalarPtr
  cdr ScalarPtr
} representation tuple

type Fun struct {
  tag int            # value always equals Tag::Fun
  arg ScalarPtr
  bod ScalarPrt
  env ScalarPtr
} representation tuple

type Sym struct {
  tag int            # value always equals Tag::Sym
  sym ScalarPtr
} representation tuple

type Num struct {
  tag int            # value always equals Tag::Num
  num ScalarPtr
} representation tuple

type Str struct {
  tag int            # value always equals Tag::Str
  str ScalarPtr
} representation tuple

advanced ScalarThunk # Elided for brevity

type Thunk struct {
  tag int            # value always equals Tag::Thunk
  thunk ScalarThunk
} representation tuple

A property-tested implementation of the above can be found here: https://github.com/lurk-lang/lurk-rs/blob/012279bb693b5d919dda1b249e5fbdf1756ee2d6/src/scalar_store.rs#L191

In principle, including the tag in the IPLD schema for each variant of ScalarExpression is unnecessary, as long as we always have a prior tagged ScalarPtr as context. However, since one of our goals is to allow ScalarExpressions to be independently serializable, we do not want to always assume this.

ScalarStore

Given the above schemas for ScalarPtr and ScalarExpression, the schema for ScalarStore is quite simple:


type Node struct {
  ptr ScalarPtr
  expr ScalarExpression
}

type ScalarStore [Node]

An implementation can be found here: https://github.com/lurk-lang/lurk-rs/blob/012279bb693b5d919dda1b249e5fbdf1756ee2d6/src/scalar_store.rs#L101

Appendix A: CIDv2 and arbitrary Precision Multicodec

Currently multicodec is defined to use unsigned-varint, whose specification has the folllowing constraint:

For security, to avoid memory attacks, we use a "practical max" of 9 bytes. Though there is no theoretical limit, and future specs can grow this number if it is truly necessary to have code or length values equal to or larger than 2^63.

And indeed, implementations of CIDv1 generally store their codec as u64, as in the rust-cid implementation:

pub struct Cid<const S: usize> {
    /// The version of CID.
    version: Version,
    /// The codec of CID.
    codec: u64,
    /// The multihash of CID.
    hash: Multihash<S>,
}

which is encoded as:

<cidv1> ::= <multibase-prefix><multicodec-cidv1><multicodec-content-type><multihash-content-address>

We can relax this constraint, however, with a new CID version Cidv2:

pub struct Cidv2<const S: usize> {
    /// The codec of CID.
    codec: num_bigint::BigUInt,
    /// The multihash of CID.
    hash: Multihash<S>,
}

which is encoded as:

<cidv2> ::= <multibase-prefix><multicodec-cidv2><multicodec-size><multicodec-content-type><multihash-content-address>

where <multicodec-size> is a single byte describing the number of bytes needed to store the codec.

This increases the size of our multicodec-space from 2^64 possible codecs, to 2^(8 * 2^64), which is for all practical purposes is infiinite.

Additionally, because the size parameter is encoded as an unsigned-varint, this costs only one additional byte of size for codecs between 8 and 127 bytes, which is more than enough, to e.g. fit a 256-bit or 512 bit public key into a CID.

Appendix B: dag-lurk, mapping the IPLD data model to Lurk

Many possibles mappings of the IPLD data model to Lurk data exist, here is one such possible:

Ipld::Bool => Lurk booleans `t` and `nil`
Ipld::String => Lurk string
Ipld::Integer => Lurk Numbers for positive integers, and the pair of a boolean and Number for negative integers
Ipld::Null => the Lurk symbol `null` (`nil` can be used, but breaks bijection)
Ipld::List => Lurk lists `(a b c)`
Ipld::Float => ("float" <base64-bytes>)
Ipld::Bytes => ("bytes" <base64-bytes>)
Ipld::Map => Lurk lists of key-value pairs: `(("foo" a) ("bar" b))`
Ipld::Link => ("link" <base64-bytes>)

This mapping could be improved by adding bytestring literals to Lurk, as well as a way to directly represent ScalarPtr at the syntax level.

@vmx
Copy link

vmx commented Apr 14, 2022

I have questions especially about the (Lurk Scalar Graph section)[https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7#lurk-scalar-graph].

I'll reiterate and try to explain some things with my own words. I've the feeling I misunderstand something, so I hope it'll be easy to spot.

I'll start where the separation from the graph is happening. I think of:

0: cons 1 2
1: x
2: cons 3 4
3: 42
4: cons 5 6
5: "foo"
6: nil

as:

0: [1 2]
1: x
2: [3 4]
3: 42
4: [5 6]
5: "foo"
6: nil

As all things have implicit types/tags. Or more explicit:

0: cons 1 2
1: sym x
2: cons 3 4
3: num 42
4: cons 5 6
5: str "foo"
6: nil

The next one is then:

<hash-0>: cons <hash-1> <hash-2>
<hash-1>: sym x
<hash-2>: cons <hash-3> <hash-4>
<hash-3>: num 42
<hash-4>: cons <hash-5> <hash-6>
<hash-5>: str "foo"
<hash-6>: nil

And with tags it's:

(<cons>, <hash-0>): <sym> <hash-1> <cons> <hash-2>
(<sym> , <hash-1>): <x>
(<cons>, <hash-2>): <num> <hash-3> <cons> <hash-4>
(<num> , <hash-3>): <42>
(<cons>, <hash-4>): <str> <hash-5> <nil> <hash-6>
(<str> , <hash-5>): "foo"
(<nil> , <hash-6>): <nil>

And that's (as I understand it) is exactly the point. Everything in angle brackets is a field element. In this example only strings and cons need to be hashed, so it could be written as:

(<cons>, <hash-0>): <sym> <x> <cons> <hash-2>
(<cons>, <hash-2>): <num> <42> <cons> <hash-4>
(<cons>, <hash-4>): <str> <hash-5> <nil> <nil>
(<str> , <hash-5>): "foo"

This means that <hash-1>, <hash-3> and <hash-6> aren't actually pointers, but the data itself. Is that correct?

And you know whether it's a pointer or the data thanks to the tag.

So in this example the tag information is stored redundantly, except for <hash-0>. So without losing any information it could be written as:

(<cons>, <hash-0>): <sym> <x> <cons> <hash-2>
<hash-2>: <num> <42> <cons> <hash-4>
<hash-4>: <str> <hash-5> <nil> <nil>
<hash-5>: "foo"

Now to the reasons to use the tag as part of the pointer:

First, our Poseidon hash function for each node depends its type, so hashing
cons uses an arity 4 hash, hashing a number literal uses the scalar directly
without any hashing, and hashing a string uses a particular
implementation-specific
algorithm
.
This information is needed in order to dereference our hash pointer.

That's correct if you e.g. would only see <hash-5> without any further information/context. But where would you get <hash-5> from? Wouldn't you get it from <hash-4> and hence know that <hash-5> is a string?

Secondly, some nodes of our syntax tree the same contents, and differ only by
their type, such as symbols and string literals. The type-tag disambiguates
collisions, such as between symbol foo and string "foo". While in principle
we could also solve this by always including the tag in the preimage of our
hash, that would add additional cost or complexity.

I might misunderstand it, but to me it's the point of content-addressing. If it's the same content, symbol foo and string foo, then it's the same hash. For content-addressing purpose it doesn't matter. The (what I call) semantic meaning comes from the context. In <hash-4> you know that you need to interpret the content of <hash-5> as string and not as symbol.

Finally, and most importantly, Lurk's evaluation model adds a notion of delayed
computation, called a continuation. These continuations are also
content-addressed, and in order to correctly execute them Lurk needs the
additional context provided by the type-tag
(https://github.com/lurk-lang/lurk-rs/blob/274add6bb7f3159e42465110e4c7fe73608e8a17/src/store.rs#L562).

Same as above, don't you know that already from the context?

@vmx
Copy link

vmx commented Apr 27, 2022

From discussions via chat. One idea to use CIDv1 is to use in the codec field something with the following structure:

struct LurkCodec {
    // Lurk prefix, e.g. 0xc0de.
    prefix: u16,
    / Version number.
    version: u16,
    // Unique field or hash codec (so that we don't mix data from different lurk backends).
    field: u16,
    // Tag
    tag: u16
}

Let's say we use that approach. Could you please give a real world example (e.g. [1, 2]), with the actual values. What I'm looking for is an instance of a CIDv1, with the LurkCodec expanded into the structure above and the bytes (perhaps explained a bit) such a CID would point to (the actual content). What I e.g. hope to see is, how CIDs are constructed from the data when traversing the graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment