pervognsen/algebraic_data_type_numeral_encoding.md

## algebraic_data_type_numeral_encoding.md

      
    Raw
  

              algebraic_data_type_numeral_encoding.md
            
          
    (Originally written as a reply to an HN submission of this article: https://www.cs.virginia.edu/~lat7h/blog/posts/434.html)
There's a simple recipe for arithmetically encoding recursive algebraic data types (in the functional programming sense) which is related to this.
What you might have seen is Goedel numbering where a finite sequence of natural numbers a_0, a_1, ..., a_n (where n isn't fixed but can vary per sequence) is mapped bijectively onto p_0^a_0 a_1^p_1 ... a_n^p_n where p_0, p_1, ... is an enumeration of the primes.
However, if you want to represent trees instead of sequences, you have a better, simpler option. The key is the existence of a bijective pairing function between N^2 and N, which you can write as <m, n> for m, n in N.
You have a lot of choices for how to construct the pairing function. But a curious fact is that there is essentially one polynomial pairing function and it's the one you saw in class when you learned that the rationals are countable: https://en.wikipedia.org/wiki/Fueter%E2%80%93P%C3%B3lya_theorem
From this perspective it should be clear that bit interleaving provides one such pairing function.
Here's a larger example of how you can use this for encoding algebraic data types. The data type in this article is an unlabeled binary tree:
data Tree = Leaf | Node Tree Tree

I will use square brackets to denote the encoding. Then
[Leaf]     = 0
[Node a b] = 1 + 2*<[a], [b]>

So the tag for the sum type is the residue modulo 2, and if the tag is 1 you can divide by 2 (shift right by one bit) to extract the encoded pair of subtrees. If the sum type had three terms you could use 3 instead of 2:
data Tree = WhiteLeaf | BlackLeaf | Node Tree Tree

[WhiteLeaf] = 0
[BlackLeaf] = 1
[Node a b]  = 2 + 3*<[a], [b]>

This also works with labeled trees:
data Tree = Leaf N | Node Tree Tree

[Leaf n]   = 0 + 2*n
[Node a b] = 1 + 2*<[a], [b]>

If the label type wasn't a natural number, you'd just recursively encode it.
In the above I treated the <m, n> pairing as a black box that can be implemented in different ways (e.g. Cantor pairing, bit interleaving). You can abstract out the encoding of sums as well. You're looking for a bijection between N and Inl N | Inr N and I made the particular choice
[Inl n] = 0 + 2*n
[Inr n] = 1 + 2*n

But you have other options as well, though this is probably the simplest. And once you have an encoding for sums and products you can apply them recursively to encode any recursive algebraic data type.
Note that the sum and pair encoding functions are bijective on the natural numbers, but the corresponding encodings for algebraic data types aren't necessarily bijective, only injective. For the unlabeled binary tree example, the number 2 has a tag of 0 (leaf) but isn't the encoding of any tree. However, the encoding for labeled binary trees is bijective if the label type is bijectively encoded. The problem with this kind of recursively constructed encoding that terminates in finite types is that you obviously cannot have a bijection between a finite set and an infinite set like N.
To tie it back into the article, he uses this encoding:
data Tree = Leaf | Node Tree Tree

[Leaf]     = 0
[Node a b] = 1 + <[a], [b]>

You can extend this to my example with two kinds of leaves:
data Tree = WhiteLeaf | BlackLeaf | Node Tree Tree

[WhiteLeaf] = 0
[BlackLeaf] = 1
[Node a b]  = 2 + <[a], [b]>

Note that the tag ordering is all-important here. This only works when the tag ordering is "tail recursive". If you used 0 as the tag for Node, then there'd be an ambiguity between leaves and proper subtrees; you wouldn't know if 1 corresponded to a leaf or 0 + encoded_subtree_pair.
So, this kind of prefix-sum encoding is bijective but can only accommodate one unbounded term, which must be assigned the final tag. No tag ordering can work for this:
data Tree = Leaf | WhiteNode Tree Tree | BlackNode Tree Tree

[Leaf] = 0
[WhiteNode a b] = 1 + <[a], [b]>
[BlackNode a b] = 2 + <[a], [b]> // Wrong!

I haven't thought about ordinals for a long time, but this feels related to the fact that 1 + ω = ω is not equal to ω + 1.