Skip to content

Instantly share code, notes, and snippets.

@nothingmuch
Last active July 12, 2022 20:59
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nothingmuch/864595842ffb08af61b77cfa003a4033 to your computer and use it in GitHub Desktop.
Save nothingmuch/864595842ffb08af61b77cfa003a4033 to your computer and use it in GitHub Desktop.
general model for merkle graphs as first class types, salvaged from an email

Abstract

This document proposes a model for content addressing storage and interchange of arbitrary graphs of data (where vertices are byte strings) in a decentralized setting while only requiring pairwise agreement about preferred hash functions or formats (as opposed to global agreement).

The main contribution is an interpretation of arbitrary Merklized data structures with concrete hash functions as dependent types, for which unambiguous type identifiers can be derived using any locally preferred hash function and hash consing.

Motivation

Graphs are the de facto universal data type

Pointer based memory models are a near universal graph based abstraction (urely functional data structures are restricted to directed acyclic structures), and many standard data structures such as trees, lists, maps/relations, and sets are all special cases of graphs and realized directly in a pointer based model.

However, for data storage we typically only use these special cases, e.g. filesystems generally have a single tree hierarchy containing mutable strings (files), the DNS system is a single tree hieararchy, relational databases demand that global invariants of some graph structure to be specified in the relational schema, and common interchange formats like JSON or protocol buffers typically encode tree-like structures into linear sequences.

This creates a common pattern in many applications - as data is swapped in and out of the memory model much like a virtual memory system it must be encoded or translated into these more linear formats at the boundary between different programs or programming languages.

Graphs can model monotonically increasing infomation

Trees and lists have no naturally commutative merge operation outside of the context of a semantic interpretation. For example, given knowledge that two trees represent parts of the same hierarch (e.g. directories of files), their contents can be combined into a single tree (e.g. unioning all the data, but generating conflicts for files with the same path but different content).

By treating these as special cases of graphs, otherwise conflicting data can still be idempotently merged, i.e. instead of a conflict, the two trees can be represented as alternatives. This does nothing to provide semantic merging, a human is still required to define precedence ordering between these alternatives, but unlike filesystems or monolithic files, if the human isn't available or able to define an ordering, the data can still be be captured and retained, permitting lossless, incremental progress towards a future consistent state.

Hash agnostic content addressing

Content addressing is suitable for distributed networks but the coordination of formats and protocols introduces another notion of centralization, both in terms of deifining standard formats and in terms of producing compatible implementations. Furthermore, these standards must be updated as

By abstracting this into a more general model that is agnostic of the particular hash function(s) or formats used, different specific instances of content addressing systems can be used interchangiably or in parallel while conceptually covering parts of the same combined graph.

This abstract graph already exists, and can be be thought of as already containing every git repository in existence, the entire bittorrent and IPFS networks, various blockchains, etc, but its constituent parts are not well connected and hard to traverse (thought of as a single immutable memory space, the data reside in different segments that require different addressing modes).

Variations of Graphs of Strings

We will consider several different graph constructions and how they relate to each other, where the set of vertices is always a subset of the set of all finite strings, i.e. {0,1}*, though in practice (and without loss of generality) {0..255}*(i.e. byte strings).

The first degree of freedom we must consider is whether or not strings identify vertices or merely label them (in which case two distinct vertices can have the same label).

The second is whether or not edges (only considering unlabeled, directed edges) are intrinsic or extrinsic to the vertex, i.e. do the vertex string values explicitly list their out neighbourhood.

For example, if we have a filesystem tree:

mkdir -p /foo/foo
echo foo > /foo/foo/foo

then the same string foo appears 4 times, as the contents of the file, as its name, as the containing directory, and as the parent directory.

As a labeled graph with extrinsic edges, we can think of this as:

foo -> foo -> foo -> foo

and infer that the rightmost foo, due to its out degree of 0 represents contents, that its antecessor in the chain represents a filename, and the others represent directory names

To represent this as a graph where strings uniquely identify vertices, we need to qualify everything, for example using a type tag, and a full path to maintain distinctness:

"dir:/foo" -> "dir:/foo/foo" -> "file:/foo/foo/foo" -> "blob:foo"

With intrinsic edges we need some way of referencing vertices (e.g. a symbol or numbering scheme), and the full graph can be represented as an unordered set of vertex strings, where the set of edges can be recovered by parsing.

Merkle Graphs

In Merkle graphs vertices are identified by strings and edges are intrinsic, but we also identify vertices using their labels' image in a hash function.

Assuming the collision resistance property, we can treat this function as injective (even though the pigeonhole principle proves that it isn't), and that provides a partial order relation over the vertices.

As an example, with sha256 truncated to 2 bytes, git inspired type tags, and using something resembling s-expressions to encode directory tree objects, and using sha256 the directory tree structure above can be mapped to the following relation (set of ordered pairs):

{ <"blob:foo", "6c9a">
, <"tree:((foo . 6c9a))", "9dee">
, <"tree:((foo . 9dee))", "e1a3">
, <"tree:((foo . e1a3))", "775f">
}

Due to the partial order relation over the strings, the set of hashes is not just isomorphic to their pre-images, but also to subgraphs, specifically the transitive closures of the vertices (i.e. 9dee refers to the graph "tree:((foo . 6c9a))" -> "blob:foo").

Pointer Memory Model

At a point in time, a pointer memory model can also be represented by such a graph. In this case vertices merely labeled by strings, specifically each vertex is the the memory contents at a pointer location (assuming one can infer the lengths of pointer referents either by having static type information, by parsing the target location, or by simply assuming it continues until the end of the segment). The edges are intrinsic to the vertices in this representation, and the labels are not unique.

Since the assumptions here are weaker, this means we can interpret all merkle graphs as memory graphs, where all referents are immutable, and hashes are pointers, which will be elaborated below.

Merklizing Cyclic Graphs With Algebraic Graphs

Since content addressed data is limited to directed acyclic graphs, in order to represent cyclic graphs with extrinsic edges in a Merklizable form, we need some sort of symbolic indirection, and for this we will assume an algebraic representation due to Andrey Mokhov.

This representation has no inherent partiality (unlike the traditional <V, E> representation, there's no way of letting the edges and vertices go out of sync). It defines 2 kinds of terms, empty graphs and vertex singletons, and two binary operations called overlay and connect (though overlay can be thought of as a special case of connect).

The overlay of two graphs, A and B, denoted A + B, is the non disjoint graph union <V_A ∪ V_B, E_A ∪ E_B>.

Connect of A and B, denoted A * B is <V_A ∪ V_B, E_A ∪ E_B ∪ (V_A × V_B ) >, in other words the overlay of A and B with edges added from A to B (it's non commutative).

This representation is able to represent any graph as an expression tree, and since these expressions are trees they are trivially Merklizable.

It's worth noting this has several variants, and it can also represent labeled edges if connect is parameterized. This is is also how overlay becomes special case of connect - parameterized over a boolean, the value of the parameter signifies whether or not an edge exists. See recent haskellx talk for more details, as the paper has not (yet?) been updated with this generalization.

Using this representation we can represent any arbitrary graph over unique bytestrings (it considers equivalent vertices as the same) either as a Merkle graph directly, if it is a DAG, or symbolically if it has cycles or we wish to avoid intrinsically encoding the edges. If distinct vertices require identical bytestring lables, this requires another layer of indirection (e.g. UUIDs).

Structured Vertices As Graphs

Returning with the serialized trees example, we can consider the parse tree of every compound vertex encoded into a string as a graph with extrinsic edges and non identifying labels.

For instance the vertex "tree:((foo . 6c9a) (bar . fcde))" can be parsed into the following edge set:

{ <"tree", "foo">
, <"tree", "bar">
, <"foo", "6c9a">
, <"bar", "fcde">
}

(though note that in this case vertices are identified by their labels for brevity's sake).

Encoding Graphs as Canonical S-Expressions

For encoding n-ary trees structure over byte strings canonical S-expressions are an almost ideal representation.

This is because canonical S-expressions do not have any extraneous data types, do not depend on byte or bit endianness (well, technically ascii decimals are big endian but at least there is good agreement about that ;-).

Using the algebraic form any graph over byte strings can be unambiguously serialized as a canonical S-expression, Merkle graphs can be given as simple lists of vertices (assuming a hash function will be used to recover the edge set), and simpler special cases encoded as canonical S-expressions directly (i.e. extrinsic edge information implicit in the tree structure, with identifying or non-identifying labels, if labels are identifying then a tree with duplication can be thought of as a traversal over a DAG).

Graphs Over Strings Summary

  • the most general graph form has labeled vertices and extrinsic edges
  • with indirection we can encode such arbitrary graphs as trees, in the algebraic representation.
  • we can concatenate/collapse subtrees into compound vertices into canonical s-expressions (without loss of generality), and this is inveritble, so this is a bijection between subgraphs and vertices
  • with a hash function can identify hash image vertices with their preimage, another bijection between vertices (and composed with the previous bijection, between vertices and graphs. computing this bijection requires inverted indices)
  • we can represent all of the above algebraicly, resulting in expressions which are trees (this is only a bijection from graphs to equivalence classes of algebraic expressions, but we can define a canonical form by lexicographically ordering vertices)

By composing these forms we can translate between multiple concrete representations:

  • as an abstract graph over strings (e.g. using a graph, or relational database)
  • as monolithic blobs (as csexps or some other format, any datastructure to be parsed)
    • ... either directly encoding a special case of a graph structure, or encoding an algebraic expression into a tree
  • assuming some hash function, with merklization
    • ... in a KVP relational, or native content addressing database
    • ... directly or as algebraic expressions

Note that we do not need to restrict ourselves to one hash function, as the collision resistance property can be assumed to carry over to the sum or product type of multiple collision resistant functions (i.e. we can think of it as an injection from preimages tuples of all hash images, of which only one component is required for inversion, or as a surjection from preimages to multiple hash images, which form an equivalence class).

Representing Computations

So far all these representations are concerned with static data. However, we can also think of arbitrary computations as relations between graphs representing the inputs and outputs, and these relations are of course also representable as graphs.

Since the overlay operation (or non disjoint graph union) is trivially commutative, unlike a general merge operation on trees, the overlay any two graphs is a least upper bound (in the join semi lattice sense).

This is a central idea, since this lattice order implies that as we compute data, we can always take our output graph overlay it on the previous states, i.e. as inputs are added new outputs are computed, the graph of all information grows monotonically (analogous to git fast forward merges, or Radul/Sussman propagator cells, etc).

This motivates a sort of distributed LISP machine interpration of any and all computation - all intermediate steps in the evaluation can be represented concretely as lambda expressions, and the sequence of beta reductions itself can be represented as a graph. All atoms are immutable bytestrings, and the memory graph only grows in time (but may be subject to garbage collection). Since the memory graph grows monotonically (typically in graph reduction models, expression vertices are substituted with their results), we can think of function evaluation as enumerating a search space, and due to the commutative nature of the overlay operation, we can use this as a unified, multithreaded model for all computations on all data that exists anywhere: every disk or container image can be represented as a bytestring or indirectly by its hash, every physical memory layout or virtual address space at a point in time can be serialized, as can be every string, every screen bitmap, every encoded value, and traditional operations such as the execution of a program binary in a POSIX file namespace can be represented as LISP style function application with the program binary as the function and the environment as its operands. Obviously not all these artefact should be concretely stored or represented like this, indeed doing so would be prhobitively expensive, abstractly this graph is just some subset of the the cartesian product of all finite strings with itself.

We can also think of representing function evaluation not as linear graphs of successive intermediate states of beta reductions, but using an edge labeled model, where functions are edges linking inputs to outputs. However this representation seems better suited for display/UI purposes, less as a data storage model (i.e. edges on the graphs representing morphisms is intuitive, but to the machine functions generally have a concrete representation given by a vertex (source code) or a graph (e.g. abstract syntax tree, or some other logical form) or as a literal mapping between two sets of vertices or subgraphs, i.e. modeled statically relations instead of by given by a formula).

Hash Function Agnostic Content Addressing

To address the problem of multiple Merklized representations (different encodings/schemas, different hash functions) we can consider some sort of minimal language, with a lambda calculus (representable as S-expressions) based syntax, sufficient for defining such functions.

Defining Primitive Operations

A total functional language would be preferrable for defining such low level operations, e.g. Simplicity or Dhall would be good fit (incidentally both already have native merklized representations).

For example, to encode data in git's format we would need need zlib, a length function over byte string, encoding of numerical values to decimal digits, string concatenation, and bit arithmetic operations for sha1. These operations would suffice for constructing canonical s-expressions as well.

Modeling Type Constructors for Merklized Data Types

Git uses blob or tree prefixed onto the strings it hashes - essentially a tagged union or product type. However, we can think of these not as prefixing strings before applying sha1, but rather as distinct functions - i.e. string concatenation curried with "blob" function (... and length encoding and zlib) composed onto sha1 function defines a separate hash function that works on bare strings. Hash values produced by this function unambiguously refer to their preimages as blobs.

Thinking of this as a distinct hash function instead of as sha1 applied to a prefixed string is in some sense like static instead of dynamic typing, the type information is relegated to the context (in this case comitted to by the hash), instead of being included in the referent. In other words this concrete hash function built out of sha1 is analogous to the type constructor Blob provided by the following Haskell style variant type:

data GitObject = Blob String
               | Tree String
               | Commit String

Note that in this example Tree also takes a string as an argument, as opposed to something like [(String, Object)] or a better model of what Git trees actually encode, the purpose of the Tree type in this case is to allow correct interpretation of the string as encoding a tree, and to ensure that a blob with the same string value would not be incorrectly interpreted as such.

Deriving Type Identifiers

Suppose we arbitrarily pick some hash function, e.g. sha256. By hashing these composite functions' definitions (i.e. their ASTs) we can then derive globally unambiguous identifiers which capture the notion of hashing strings with sha1, and prefixing them with the variant type tag blob.

Although these identifiers are unambiguous they are not unique, as they depend on our locally preferred hash functions, i.e. the recursive data types that model git's representation by hash consing arbitrary strings tagged as blobs, trees will have some unique hash under sha256, and another under sha1. Similarly, equivalent types defined using different primitives or ASTs will of course also produce different identifiers.

This means that we can use an open world instead of a closed world model for arbitrary types - e.g. instead of git's 4 types - blob, tree, commit & tag (or 7 if considering packfile internals) - this set is open for extension and depends on the available definitions of functions that are assumed to be collision resistant.

The images of these functions can be thought of as typed pointers because if you can find the preimage, necessarily this has to be done with some sort of inverted lookup table of previously hashed strings, the construction of which requires knowledge of specific hash function acting as a type constructor.

Model Extensionality

The implication is that this model can be self hosting with very little required to bootstrap it, with arbitrary collections of these hash consed type definitions statisfying a join semi lattice order - type identifiers will not conflict (though the least upper bound of the set of types may not necessarily represent equivalent types as identical).

In other words starting with simple unordered sets of immutable strings and a bootstrap hash function, or graphs over immutable strings and no hash function, data from Merkle graphs of otherwise mutually incompatible systems can be represented in the same abstract graph, and insertion of graphs representing the type constructors for those systems' allows recovery of the data encoded in their concrete representation in a form that is independent of the particulars of the hashing (analogous to git fast-export's format).

Concrete Git example

Consider the above filesystem example as it would be represented by git on disk (ignoring packfiles, only.git/objects) and that this directory tree structure be represented as a graph with non unique labels, and extrinsic edges.

git init
mkdir -p foo/foo
echo foo > foo/foo/foo
git add foo && git commit -m foo

The Merklized data on disk can be inspected:

$ tree --prune .git/objects
.git/objects/
├── 20
│   ├── 137b580832c31eaf18e4ec3cebad13c622a486
│   └── 5f6b799e7d5c2524468ca006a0131aa57ecce7
├── 25
│   └── 7cc5642cb1a054f08cc83f2d943e56fd3ebe99
├── 68
│   └── 853dd9adce76c3581ffa80f158f3438cd35df9
└── d8
    └── f9470a64af05df287f1a7d340b8ba91ecacbd2
$ pigz -d < .git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99 
blob 4foo

In algebraic graph form (partial listing, where || is concatenation and denoting compressed strings using zlib):

G_git = ( "path:.git", "path:.git/objects" * "path:.git/objects/25",
        + "path:.git/objects/25" * ".git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99"
        + ".git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99" * "blob:" || zlib("blob 4foo")
        + ...
        )

or in the more traditional notation:

V = { "path:.git", "path:.git/objects"
    , "path:.git/objects/25"
    , ".git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99"
    , "blob:" || zlib("blob 4foo")
    , ...
    }
    
E = { <"path:.git", "path:.git/objects">
    , <"path:.git/objects", "path:.git/objects/25">
    , <"path:.git/objects/25", ".git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99">
    , <"path:.git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99", "blob:" || zlib("blob 4foo")>
    , ...
    }

The entire repository contents can also be exported in a non Merklized form:

$ git fast-export master
blob
mark :1
data 4
foo

reset refs/heads/master
commit refs/heads/master
mark :2
author User <user@example.com> 1542720897 +0000
committer User <user@example.com> 1542720897 +0000
data 4
foo
M 100644 :1 foo/foo/foo

And ignoring the authorship and permissions data, this content can be represented using the graph expression:

G_content = ( "path:/foo" * "path:/foo/foo"
            + "path:/foo/foo" * "path:/foo/foo/foo"
            + "path:/foo/foo/foo" * "blob:foo"
            )

or more explicitly:

V = { "path:/foo"
    , "path:/foo/foo"
    , "path:/foo/foo/foo"
    , "blob:foo"
    }
E = { <"path:/foo", "path:/foo/foo">
    , <"path:/foo/foo", "path:/foo/foo/foo">
    , <"path:/foo/foo/foo", "blob:foo">
    }

These forms are interchangiable given knowledge of Git's internal encoding, i.e. we can start with the .git/objects directory as a graph, map it to a parsed graph of the same data by decompressing the leaves and parsing the structured types, resolving the inverted sha1 relation into edges, and then recover the ha1 agnostic representation of the same contents (the non unique label, extrinsic edge representation, as represented e.g. by the fast export format), and subsequently map that to some other merklized representation. All of these mappings are invertible, and the definitions of these mappings are themselves just graphs over strings, so both the inputs and the outputs, as well as the sequence of conversions and their definitions can all reside in a single combined graph, all of wihch may be content addressed or conveyed independently of any hash function depending on the circumstances.

This graph is roughly G_git + G_content, further overlaid with the definitions of the encoding functions and graphs representing parse trees of the encoded tree and commit object.

Idempotent n-Way Synchronization Between Content Addressing Systems

Suppose you start with e.g. PerKeep, and you create a new permanode, mount it via FUSE, and run git init there, and then start a sync loop - for every object in PerKeep, run git hash-object to insert it into the objects database, and for every object in the git database, insert it to Perkeep via the REST api, and implicitly via the FUSE mount. This process will never converge - since the two schemas are mutually unintelligible, as contents are hashed it continually generates seemingly random data to be inserted into either store.

Using this generalized model, this bidirectional synchronization can be made idempotent, converging on the least upper bound - the graph where both schemas are represented as merklized functions, all primary information (user inputs) are merklized in both systems natively, the equivalence relation between this graph and the two concrete merklized formats is represented as a first class object, stored in both PerKeep and git, such that in the case of catastrophic loss of either the .git directory or the PerKeep blob store, the other system can be used to make a full recovery (including identical permanodes).

Implications of Merklized Computational Model

This model can effectively blur the boundaries between the existing content addressing systems, but now the rabbit hole really starts getting interesting, because we can also think of the union of all git repositories in existence, the union of all parse trees over those, every compiler can be thought of as a function in this model, etc. Hashes are obviously references for either strings or entire subgraphs, but more generally any program deterministic with known inputs is also just a reference to the graph of its output. In other words, this approach naturally suggests seeing the all possible computations as search space that can be addressed.

For high level programming, the Radul-Sussman propagator model, or minikanren style relational programming is a very natural fit for relational programming on a declarative graph, but other compelling languages that come to mind are Hazel, Eve, Shen, and Unison to name just a few examples well suited for the domain.

In specifying locations in this graph, we can also consider subsets of programming languages, and apply a more relational model to defining. First off, formulae defining functions reside in a single unified AST of all code, which can be interacted with as a live programming environment. In this environment equivalence of functions or reductions of functions, or rather multifunctions can be tracked (for example, suppose you have an that is syntactically parseable as two languages, let's say Java and C, there is likely some set of variable assignments for which they are identical, and due to semantic differences, at some point a divergence - it's reasonable to represent this kind of syntactic ambiguity in a nondestructive way).

Literate programming becomes much more natural, as syntactic constructs and documentation can be directly interleaved as merklized linked data. In particular notebook environments like Jupyter also become a lot saner (intermediate kernel states and sessions can be merklized, instead of relying on the mutable state, and the order of cell evaluation, and the notebook format itself is also a graph).

The concept of an app also becomes an unnecessarily oppressive abstraction, instead we can just directly deal with functions over graphs are the unit of abstraction, utilizing a lens metaphor instead of a box metaphor (e.g. think of how needlessly constraining the notion of putting your data "into" excel is... if we ignore the historical accident of tree based filesystems, the only remaining justification for this approach is to enforce vendor lock in).

The RDF graph of the semantic web as well as Ted Nelson's Xanadu model both become a lot more robust than the way they are currently conceived, since they no longer relying on URLs and the DNS model, but can instead use content addressing URNs, and relate content in networks which are currently not considered a part of "the web".

TODO - these examples need to be integrated into the main doc, originally they were sent out in a later email as elaborations

identifying labels means the pair <"foo", ...>represents an edge from the vertex foo, but this is a problem if you have multiple vertices labeled foo, so basically it means your label set isomorphic with the vertex set. if this is the case, then you can identify the vertexes with the labels and just pretend foo is the vertex

merkle graphs let you do that because the label also includes the out neighborhood, so the string label uniquely identifies the vertex in relation to all its neighbours

to represent without labels, i'd have to say that vertex 1 points at vertex 2, and vertex 3 at vertex 4, etc, and then say vertex 1 is labeled foo, vertex 2 is edd0..., etc. the filesystem example demonstrates these variants: basically there's 4 strings with value foo, 2 directory names, a file name and file contents foo, so we can't just use foo as the label if we want to distinguish between them. we can (sort of) do that just for the file contents, but if we use extrinsic labels, we can have a label set { <x, "foo"> | x \in V } and V = { 1, 2, 3, 4 }, and then the edges representing this hierarchy would be something like { <1,2>, <2, 3>, <3, 4> }. file contents would be represented by vertices with an out degree of 0 , filenames would be vertices with exactly out edge, to a file contents vertex, and directories are everything else, and their contents is represented by those vertices' out neighbourhoods.

but if we want to model this exact same graph with intrinsic labels, we can add type identifiers, e.g. blob: for the contents and path: for the file/directory names, and we can distinguish between the 3 files name foo by giving a FQ path, removing ambiguity:

V = { "path:/foo", "path:/foo/foo", "path:/foo/foo/foo", "blob:foo" }
E = { <"path:/foo", "path:/foo/foo">, <"path:/foo/foo", "path:/foo/foo/foo">, <"path:/foo/foo/foo", "blob:foo"> }

if we wanted to model blob:foo as just foo we could still do that if kept the edges intrinsic to labels, so then the vertex representing the file named foo would have to include a hash of the bare string foo (with the usual caveats of a 2nd preimage attack, i.e. how do you model a file whose contents is exactly the label of a directory node on the same graph).

to represent these options as csexps, there are several options:

if we number the vertices, then we can provide a list of pairs to map from vertex IDs to strings, so it'd be something like (1:13:foo1:23:foo1:3:3foo1:43:foo) (assuming atoms representing vertex IDs are encoded as decimal numbers), or maybe ((1:13:foo)(1:23:foo) ... ), which would represent the s-exp ((1 . foo) (2 . foo) ... )

and then the graph could be just 1 * 2 + 2 * 3 + 3 * 4 in algebraic form, and an s expression representing that could be (+ (* 1 2) (* 2 3) (* 3 4)).

in this special case, we can also just model the entire thing as (3:foo(3:foo(3:foo3foo))), since it's a tree, and we can also do the algebraic form to represent the edges if we use fq paths, (1:+(1:*9path:/foo11path:/foo/foo)(1:*11path:/foo/foo15:path:/foo/foo/foo)(1:*15:path:/foo/foo/foo8:blob:foo)).

we can also represent this by merklizing. assuming two hashes, one for blobs, one for paths, and let's say we encode files as just a csexp pair, and directories as a csexp list with the label as the first element. denoting H_blob("foo") as XXX, then the vertex path:/foo/foo/foo would be represented as (3:foo3:XXX). denoting H_path("(3:foo3:XXX)) as YYY, we can then model path:/foo/foo as (3:foo3:YYY). let's say that hashes to ZZZ, then the root is just (3:foo3:ZZZ). the entire graph, with implicitly defined hash functions, is given as the set { "foo", "(3:foo3:XXX)", "(3:foo3:YYY)", "(3:foo3:ZZZ)" }. that can be encoded as csexps trivially: (3:foo12:(3:foo3:XXX)12:(3:foo3:YYY)12:(3:foo3:ZZZ)).

to recover the graph structure we need explicit definitions of H_blob and H_path, and we need to compute the images of all elements in this set, and we need to know to parse the compound nodes as cs-exprs, and look up their constituent hashes in the pre-image tables

i think that's all the ways you can encode that example as csexps, assuming cs-exp parsing and the hash functions are primitive operations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment