From the Lurk README:
Lurk is a statically scoped dialect of Lisp, influenced by Scheme and Common Lisp. A language specification and reference implementation focused on describing and developing the core language can be found in the
lurk
repo.Lurk's distinguishing feature relative to most programming languages is that correct execution of Lurk programs can be directly proved using zk-SNARKs. The resulting proofs are succinct: they are relatively small, can be verified quickly, and they reveal only the information explicitly contained in the statement to be proved.
Lurk's distinguishing feature relative to most zk-SNARK authoring languages is that Lurk is Turing complete, so arbitrary computational claims can be made and proved (subject to resource limitations, obviously). Because Lurk is a Lisp, its code is simply Lurk data, and any Lurk data can be directly evaluated as a Lurk program. Lurk constructs compound data using SNARK-friendly Poseidon hashes (provided by Neptune), so its data is naturally content-addressable.
Lurk's natural content-addressibility raises an interesting question: "How does Lurk data integrate with the Interplanetary Linked Data (IPLD) ecosystem of content-addressed data formats?
There are two primary ways to content address Lurk data, based on two distinct representations:
- The Lurk Syntax Tree representation, where data is stored as a recursive tree
- The Lurk Scalar Graph representations, where data is stored as a content-addressed adjacency-list
These two representations are similar, but are not identical and used in different contexts: The Syntax representation corresponds better to Lurk programs as written by developers in source files, whereas the Scalar Graph representation corresponds better to the backend evaluation of Lurk programs. Unlike the Syntax Tree representation, the Scalar Graph can represent programs in a state of partial execution as "thunks".
At the source code level, Lurk consists of S-expressions:
; A partial EBNF grammar for Lurk Expressions
<expr> ::= <atom>
| <list>
| <improper-list>
<list> ::= "(" {<expr>} ")"
<improper-list> ::= "(" {<expr>} . <expr> ")"
<atom> ::= <symbol> ; an alphanumeric identifier
| <string-literal> ; an escaped unicode string literal
| <scalar-number> ; an element of a specific finite field
...
For brevity the definitions of <symbol>
<string-literal>
and
<scalar-number>
are not included, as their specific implementation details at
the parsing level are not relevant for the purposes of this document. A full
parser for the Lurk syntax can be found here:
https://github.com/lurk-lang/lurk-rs/blob/997c0033e0f393846b279b9472e76a663a61cfdd/src/parser/syntax.rs
Here is an example of what Lurk source code looks like is:
(letrec ((next (lambda (a b n target)
(if (eq n target)
a
(next b
(+ a b)
(+ 1 n)
target))))
(fib (next 0 1 0)))
(current-env))
Once parsed, this becomes represented in the lurk-rs Rust implementations as:
pub enum Print<F: PrimeField> {
List(Vec<Print<F>>),
ImproperList(Vec<Print<F>>, Box<Print<F>>),
Sym(String),
Num(Num<F>),
Str(String),
}
which can be transformed into
pub enum Term<F: PrimeField> {
Nil,
Cons(Box<Print<F>>, Box<Print<F>>),
Sym(String),
Num(Num<F>),
Str(String),
}
by unfolding all proper lists (x y z)
into cons-cells (x . (y . (z . nil)))
,
using the reserved symbol nil
to indicate termination (The variant Term::Nil
is is a factored out Term::Sym("nil")
for implementation reasons). Improper
lists are unfolded in the same way, except by using the provided terminator
element instead of nil
, so that (x y . z)
becomes (x . (y . z))
The Lurk source programs can be transformed into the IPLD data model (or JSON) using the following schema:
type Term union {
| Cons Cons
| Sym string
| Literal Literal
} representation kinded
type Cons struct {
car Term
cdr Term
} representation tuple
type Literal union {
tag int
val string
} representation tuple
(NB: This requires relaxing the IPLD schema condition that structs cannot be used as representation kinds.)
Where Literal
consists of a unique integer tag and the base64 bytes
serialization of Term::Str
, Term::Num
, or any future literal variant which
might be added to Lurk expressions. The value of this unique integer tag for
literals is taken from Lurk's tagged scalar pointers
(https://github.com/lurk-lang/lurk-rs/blob/274add6bb7f3159e42465110e4c7fe73608e8a17/src/store.rs#L538)
and is explained further below.
For illustration of the above mapping, the Lurk expression (x 42 "foo")
would
map to the IPLD (or JSON) representation: ["x", [[5, "Kg=="], [6, "Zm9v"]]]
assuming the bytestring representation of 42
is 0x42
in the field byte
serialization.
Some notes about this schema: We explicitly unfold all Lurk lists into cons-pairs so that every lurk expression has a single canonical representation in the IPLD data model. We also specifically eschew IPLD maps in our schema to avoid potential codec specific ordering issues.
The above textual syntax can be seen as a direct tree representation of Lurk data or Lurk programs (which are the same, since Lurk is homoiconic). However, Lurk's evaluation model requires an indirect graph representation in order to efficiently delay or reorder execution between different "frames" of its zkSNARK backend, as well as to allow for the possiblity of hiding specific Lurk data behind a type-tagged content-addressed pointer: One can think of this graph at an abstract level as hash-consing the Lurk syntax tree, and then storing the nodes of that tree separately as a directed adjacency list.
For illustration on works, suppose we take our expression (x 42 "foo")
from
the previous section, but rendered as a tree:
cons
/ \
x cons
/ \
42 cons
/ \
"foo" nil
In the direct representation, we have all the nodes of this tree immediately represented, with parents recursively containing children. However, we can also represent this tree as a graph adjacency list, by numbering the nodes:
0: cons
/ \
1: x 2: cons
/ \
3: 42 4: cons
/ \
5: "foo" 6: nil
And then separating:
0: cons 1 2
1: x
2: cons 3 4
3: 42
4: cons 5 6
5: "foo"
6: nil
However, assigning numeric indices in this way is arbitrary. We can generate content-addressed indices by hashing each node:
<hash-0>: cons <hash-1> <hash-2>
<hash-1>: x
<hash-2>: cons <hash-3> <hash-4>
<hash-3>: 42
<hash-4>: cons <hash-5> <hash-6>
<hash-5>: "foo"
<hash-6>: nil
In Lurk we use the zkSNARK-friendly hash algorithm Poseidon, which outputs a
finite-field element, rather than a bytestring digest. So each of the above
hashes is a field element F
for whatever PrimeField
the Lurk backend is
operating over.
Addtionally, we want to tag each of these hashes with information describing the type of node they were generated from:
(<cons>, <hash-0>): cons (<sym>, <hash-1>) (<cons>, <hash-2>)
(<sym> , <hash-1>): x
(<cons>, <hash-2>): cons (<num>, <hash-3>) (<cons>, <hash-4>)
(<num> , <hash-3>): 42
(<cons>, <hash-4>): cons (<str>, <hash-5>) (<nil>, <hash-6>)
(<str> , <hash-5>): "foo"
(<nil> , <hash-6>): nil
We include this tag in our pointers for several reasons:
First, our Poseidon hash function for each node depends its type, so hashing
cons
uses an arity 4 hash, hashing a number literal uses the scalar directly
without any hashing, and hashing a string uses a particular
implementation-specific
algorithm.
This information is needed in order to dereference our hash pointer.
Secondly, some nodes of our syntax tree the same contents, and differ only by
their type, such as symbols and string literals. The type-tag disambiguates
collisions, such as between symbol foo
and string "foo"
. While in principle
we could also solve this by always including the tag in the preimage of our
hash, that would add additional cost or complexity.
Finally, and most importantly, Lurk's evaluation model adds a notion of delayed computation, called a continuation. These continuations are also content-addressed, and in order to correctly execute them Lurk needs the additional context provided by the type-tag (https://github.com/lurk-lang/lurk-rs/blob/274add6bb7f3159e42465110e4c7fe73608e8a17/src/store.rs#L562).
As described in the previous section, the three components of our expression Graph are (with their corresponding Rust types):
- Tagged hash-pointers (
ScalarPtr
) - Graph Nodes (
ScalarExpression
) - A map of pointers to nodes (
ScalarStore
)
As this document is focused on content-addressing, we are only interested in the
canonical representation of Lurk Scalar Graphs in Rust. There are other
non-canonical datatypes (such as Ptr
, Expression
, and Store
) which are
used to optimize evaluation in the lurk-rs
implementation, but these can be
ignored for present purposes.
The canonical representation of our tagged hash-pointers is:
pub struct ScalarPtr<F: PrimeField>(F, F);
Where the first field element corresponds to a 16-bit tag describing the type of the node:
#[repr(u16)]
pub enum Tag {
Nil = 0b0000_0000_0000_0000,
Cons,
Sym,
Fun,
Num,
Thunk,
Str,
}
And the second element is a Poseidon hash of some ScalarExpression of the type described by the tag.
The corresponding canonical node representation of the above is:
pub enum ScalarExpression<F: PrimeField> {
Nil,
Cons(ScalarPtr<F>, ScalarPtr<F>),
Sym(String),
Fun {
arg: ScalarPtr<F>,
body: ScalarPtr<F>,
closed_env: ScalarPtr<F>,
},
Num(Num<F>),
Str(String),
Thunk(ScalarThunk<F>)
}
Where Nil
again the factored symbol nil
, Fun
is a factored out
representation of lambdas e.g. (lambda (x) x)
. Those forms are separated for
ease of conversion/evaluation.
The containing structure for a full canonical Lurk expression graph is:
pub struct ScalarStore<F: PrimeField> {
scalar_map: BTreeMap<ScalarPtr<F>, ScalarExpression<F>>,
...
}
which is a map of tagged-hashes to individual nodes of our expression.
The content addressing of these datatypes are not included in this document for
conciseness, but proceed on essentially the same lines as ScalarExpression
,
ScalarPtr
, and Tag
.
When transforming our graph to IPLD, the most crucial step is to construct our
mapping such that our Lurk pointers correspond to IPLD Links (or CIDs). This is
necessary to allow individual entries or sub-graphs in our ScalarStore
to be
independently serialized and shared over e.g. IPFS without complex and costly
traversal and re-mapping of identifiers.
In other words, we would like the following diagram to correctly commute:
Ipfs: CID --> IPLD
^ ^
| |
V V
Lurk: Ptr ---> Expr
The challenge with this requirement is that ScalarPtr
contains a tag, and
CID
generally does not. Without a way to embed our Tag
into CID
we would
be unable to correctly read individual nodes in our expression graph without
re-traversing the entire graph (which we may not be able to do in the cases of
opaque expressions, i.e. private zkSNARK inputs).
For instance, let's consider our earlier example (x 42 "foo")
, but focus only
the subexpression ("foo" . nil)
:
(<cons>, <hash-4>): cons (<str>, <hash-5>) (<nil>, <hash-6>)
(<str> , <hash-5>): "foo"
(<nil> , <hash-6>): nil
In order to compute <hash-4>
, we need to know the tags of the expressions
inside the cons
, in this case <str>
and <nil>
, since the Lurk algorithm
for hashing a cons
node is (roughly):
poseidon-4(car-tag, car-hash, cdr-tag, cdr-hash)
where car
and cdr
are respectively the left-hand and right-hand contents of
the cons
.
Fortunately, we can solve this problem by using the CID Multicodec field to
contain our tags. The CID codec
field is currently limited to 64-bits, which
is generally too small to store a full field-element, but fortunately Lurk tags
can be represented as 16-bit unsigned integers.
In order to represent ScalarPtr as a CID, we can therefore perform the following operation:
/// Convert a field element `F` into a 64-bit multicodec
fn to_multicodec<F: LurkField>(f: F) -> Option<u64> {
let tag: u16 = F::to_tag(f)?; // Convert a field element to a 16-bit tag
let codec =
F::LURK_PREFIX << 48 | /// a unique 16-bit prefix in the multicodec table
F::LURK_VERSION << 32 | /// a 16-bit version number
F::LURK_FIELD << 16 | /// a 16-bit field identifier
u64::from(tag)) /// the 16-bit tag
Some(codec)
}
With this embedding of Lurk tags as multicodecs, we can then embed our ScalarPtrs as CIDs:
fn scalar_ptr_to_cid<F: LurkField>(x: ScalarPtr<F>) -> Option<Cid> {
let codec = x.0.to_multicodec()?;
let digest = Multihash::wrap(F::LURK_HASH, x.1.to_repr().as_ref())?;
let cid = Cid::new_v1(codec, digest);
Some(cid)
}
This requires reserving all possible entries in the current multicodec table
with a given 16-bit prefix (e.g. currently no multicodec begins with 0xC0DE
or
0xDA7A
), in order to allow for future breaking changes to Lurk, and so to
prevent mixing graphs generated over incompatible fields.
If it does not prove possible to reserve this range on the current multicodec, the two alternatives are:
- Propose an extension to the Multicodec and CID specification which allows for arbitrary precision codecs. This would solve a potential issue of multicodec scarcity, but would require additional support from the multiformats standards. This is described in Appendix A of this document.
- Use a single Lurk multicodec and the identity multihash codec to allow the
concatenation of the tag and poseidon hash to be expressed the CID digest
field. This has the advantage of being easily implementable without any
multiformats standards changes, but has disadvantage of potentially making it
more difficult in for IPFS implementations to store/host/traverse Lurk data
in the future. In the other options, the Multihash codec
F::LURK_HASH
would correspond to the correct multihash codec for the Poseidon hash function used to construct the pointers or links.
We can express the encoding of ScalarExpression as an IPLD Schema, treating the
above embedding of ScalarPtr to CID as an advanced type (and as previously in
the Syntax schame, we must allow struct
to be used as discriminant in the
kinded
representation):
advanced ScalarPtr # Using the above tagged pointer representation
type ScalarExpression union{
| Nil Nil
| Cons Cons
| Sym Sym
| Fun Fun
| Num Num
| Str Str
| Thunk Thunk
} representation kinded
type Nil struct {
tag int # value always equals Tag::Nil
} representation tuple
type Cons struct {
tag int # value always equals Tag::Cons
car ScalarPtr
cdr ScalarPtr
} representation tuple
type Fun struct {
tag int # value always equals Tag::Fun
arg ScalarPtr
bod ScalarPrt
env ScalarPtr
} representation tuple
type Sym struct {
tag int # value always equals Tag::Sym
sym ScalarPtr
} representation tuple
type Num struct {
tag int # value always equals Tag::Num
num ScalarPtr
} representation tuple
type Str struct {
tag int # value always equals Tag::Str
str ScalarPtr
} representation tuple
advanced ScalarThunk # Elided for brevity
type Thunk struct {
tag int # value always equals Tag::Thunk
thunk ScalarThunk
} representation tuple
A property-tested implementation of the above can be found here: https://github.com/lurk-lang/lurk-rs/blob/012279bb693b5d919dda1b249e5fbdf1756ee2d6/src/scalar_store.rs#L191
In principle, including the tag in the IPLD schema for each variant of ScalarExpression is unnecessary, as long as we always have a prior tagged ScalarPtr as context. However, since one of our goals is to allow ScalarExpressions to be independently serializable, we do not want to always assume this.
Given the above schemas for ScalarPtr
and ScalarExpression
, the schema for
ScalarStore
is quite simple:
type Node struct {
ptr ScalarPtr
expr ScalarExpression
}
type ScalarStore [Node]
An implementation can be found here: https://github.com/lurk-lang/lurk-rs/blob/012279bb693b5d919dda1b249e5fbdf1756ee2d6/src/scalar_store.rs#L101
Currently multicodec
is defined to use unsigned-varint
, whose specification
has the folllowing constraint:
For security, to avoid memory attacks, we use a "practical max" of 9 bytes. Though there is no theoretical limit, and future specs can grow this number if it is truly necessary to have code or length values equal to or larger than
2^63
.
And indeed, implementations of CIDv1
generally store their codec
as u64
,
as in the rust-cid
implementation:
pub struct Cid<const S: usize> {
/// The version of CID.
version: Version,
/// The codec of CID.
codec: u64,
/// The multihash of CID.
hash: Multihash<S>,
}
which is encoded as:
<cidv1> ::= <multibase-prefix><multicodec-cidv1><multicodec-content-type><multihash-content-address>
We can relax this constraint, however, with a new CID version Cidv2
:
pub struct Cidv2<const S: usize> {
/// The codec of CID.
codec: num_bigint::BigUInt,
/// The multihash of CID.
hash: Multihash<S>,
}
which is encoded as:
<cidv2> ::= <multibase-prefix><multicodec-cidv2><multicodec-size><multicodec-content-type><multihash-content-address>
where <multicodec-size>
is a single byte describing the number of bytes needed to store the codec.
This increases the size of our multicodec-space from 2^64
possible codecs, to
2^(8 * 2^64)
, which is for all practical purposes is infiinite.
Additionally, because the size parameter is encoded as an unsigned-varint, this costs only one additional byte of size for codecs between 8 and 127 bytes, which is more than enough, to e.g. fit a 256-bit or 512 bit public key into a CID.
Many possibles mappings of the IPLD data model to Lurk data exist, here is one such possible:
Ipld::Bool => Lurk booleans `t` and `nil`
Ipld::String => Lurk string
Ipld::Integer => Lurk Numbers for positive integers, and the pair of a boolean and Number for negative integers
Ipld::Null => the Lurk symbol `null` (`nil` can be used, but breaks bijection)
Ipld::List => Lurk lists `(a b c)`
Ipld::Float => ("float" <base64-bytes>)
Ipld::Bytes => ("bytes" <base64-bytes>)
Ipld::Map => Lurk lists of key-value pairs: `(("foo" a) ("bar" b))`
Ipld::Link => ("link" <base64-bytes>)
This mapping could be improved by adding bytestring literals to Lurk, as well as
a way to directly represent ScalarPtr
at the syntax level.
I have questions especially about the (Lurk Scalar Graph section)[https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7#lurk-scalar-graph].
I'll reiterate and try to explain some things with my own words. I've the feeling I misunderstand something, so I hope it'll be easy to spot.
I'll start where the separation from the graph is happening. I think of:
as:
As all things have implicit types/tags. Or more explicit:
The next one is then:
And with tags it's:
And that's (as I understand it) is exactly the point. Everything in angle brackets is a field element. In this example only strings and cons need to be hashed, so it could be written as:
This means that
<hash-1>
,<hash-3>
and<hash-6>
aren't actually pointers, but the data itself. Is that correct?And you know whether it's a pointer or the data thanks to the tag.
So in this example the tag information is stored redundantly, except for
<hash-0>
. So without losing any information it could be written as:Now to the reasons to use the tag as part of the pointer:
That's correct if you e.g. would only see
<hash-5>
without any further information/context. But where would you get<hash-5>
from? Wouldn't you get it from<hash-4>
and hence know that<hash-5>
is a string?I might misunderstand it, but to me it's the point of content-addressing. If it's the same content, symbol
foo
and stringfoo
, then it's the same hash. For content-addressing purpose it doesn't matter. The (what I call) semantic meaning comes from the context. In<hash-4>
you know that you need to interpret the content of<hash-5>
as string and not as symbol.Same as above, don't you know that already from the context?