Skip to content

Instantly share code, notes, and snippets.

@mitchellh
Last active June 30, 2021 15:01
Show Gist options
  • Save mitchellh/35836d98385fe2c0f0846cc100492224 to your computer and use it in GitHub Desktop.
Save mitchellh/35836d98385fe2c0f0846cc100492224 to your computer and use it in GitHub Desktop.
Radix tree and validation in TLA+
This module contains operations for working with radix trees. A radix tree
is a data structure for efficient storage and lookup of values that often
share prefixes, typically used with strings.
A common question when I show this to people is: how do I add to the tree?
delete? update? For these, grab the Range of the tree, use set logic to
add/remove any elements, and construct a new tree with RadixTree.
For educational purposes, I've heavily commented all the operations. I
recommend using the constant expression evaluator to try the building blocks
to learn how things work if you're confused. That's how I learned.
----------------------------- MODULE RadixTrees -----------------------------
LOCAL INSTANCE FiniteSets
LOCAL INSTANCE Sequences
LOCAL INSTANCE Integers
-----------------------------------------------------------------------------
\* Helpers that aren't Radix-tree specific.
\* shortestSeq returns the shortest sequence in a set.
LOCAL shortestSeq(set) ==
CHOOSE seq \in set:
\A other \in set:
/\ Len(seq) <= Len(other)
\* Filter the sequence to only return the values that start with c.
LOCAL filterPrefix(set, c) == { seq \in set: seq[1] = c }
\* Strip the prefix from each element in set. This assumes that each
\* element in set already starts with prefix. Empty values will not
\* be returned.
LOCAL stripPrefix(set, prefix) ==
{ SubSeq(seq, Len(prefix)+1, Len(seq)): seq \in set \ {prefix} }
\* Returns the set of first characters of a set of char sequences.
LOCAL firstChars(set) == { seq[1]: seq \in set }
\* Find the longest shared prefix of a set of sequences. Sequences can
\* be different lengths, but must have comparable types.
\* i.e. longestPrefix({1,2},{1,2,3},{1,2,5}) == {1,2}
LOCAL longestPrefix(set) ==
\* Every item must at least have a common first character otherwise
\* the longest prefix is surely empty
IF \A x \in set, y \in set:
/\ Len(x) >= 1
/\ Len(y) >= 1
/\ x[1] = y[1]
THEN
LET
seq == shortestSeq(set)
end == CHOOSE end \in 1..Len(seq):
/\ \A i \in 1..end:
\A x \in set, y \in set: x[i] = y[i]
/\ IF Len(seq) <= end
THEN TRUE
ELSE \E x \in set, y \in set: x[end+1] /= y[end+1]
IN SubSeq(seq, 1, end)
ELSE <<>>
-----------------------------------------------------------------------------
\* Radix tree helpers
RECURSIVE range(_, _)
LOCAL range(T, prefix) ==
LET
current == IF T.Value THEN {prefix \o T.Prefix} ELSE {}
\* current value of node (if exists)
children == UNION {
range(T.Edges[edge], prefix \o T.Prefix):
edge \in DOMAIN T.Edges
}
\* child values for each edge. this creates a set of sets
\* so we call union to flatten it.
IN current \cup children
-----------------------------------------------------------------------------
\* Returns the constructed radix tree for the set of keys Keys.
RECURSIVE RadixTree(_)
RadixTree(Keys) ==
IF Keys = {} THEN {} \* base case, no keys empty tree
ELSE LET
prefix == longestPrefix(Keys)
\* longest shared prefix
keys == stripPrefix(Keys, prefix)
\* keys for children, prefix stripped
edgeLabels == firstChars(stripPrefix(Keys, prefix))
\* labels for each edge (single characters)
IN [
Prefix |-> prefix,
Value |-> prefix \in Keys, \* true if this represents an inserted value
Edges |-> [L \in edgeLabels |-> RadixTree(filterPrefix(keys, L))]
]
\* Range returns all of the values that are in the radix tree T.
Range(T) == range(T, <<>>)
=============================================================================
\* Modification History
\* Last modified Tue Jun 29 09:07:10 PDT 2021 by mitchellh
\* Created Mon Jun 28 08:08:13 PDT 2021 by mitchellh
------------------------ MODULE RadixTreesValidation ------------------------
EXTENDS FiniteSets, Integers, RadixTrees
\* Set of characters to use for the alphabet of generated strings.
CONSTANT Alphabet
\* Length of input strings generated
CONSTANT MinLength, MaxLength
ASSUME
/\ {MinLength, MaxLength} \subseteq Nat
/\ MinLength <= MaxLength
\* Number of unique elements to construct the radix tree with. This
\* is a set of numbers so you can test with inputs of multiple sizes.
CONSTANT ElementCounts
ASSUME ElementCounts \subseteq Nat
\* Inputs is the set of input strings valid for the tree.
Inputs == UNION { [1..n -> Alphabet]: n \in MinLength..MaxLength }
\* InputSets is the full set of possible inputs we can send to the radix tree.
InputSets == { T \in SUBSET Inputs: Cardinality(T) \in ElementCounts }
-----------------------------------------------------------------------------
\* The range of a radix tree should be the set of its inputs.
RangeIsInput ==
\A input \in InputSets:
Range(RadixTree(input)) = input
\* The expression that should be checked for validity in the model.
Valid == RangeIsInput
=============================================================================
\* Modification History
\* Last modified Tue Jun 29 09:06:56 PDT 2021 by mitchellh
\* Created Tue Jun 29 08:02:38 PDT 2021 by mitchellh
@hwayne
Copy link

hwayne commented Jun 29, 2021

Pretty solid modeling! Couple comments on my first quick skim:


You can write \A x, y \in set: instead of \A x \in set, y \in set.


On line #56, you have a statement of form IF x THEN TRUE ELSE P. The truth table of this is

x P o
T T T
T F T
F T T
F F F

So it's just x \/ P.


I'd probably write longestPrefix as something like

longestPrefix(set) ==
  LET
    seq == ShortestSeq(set)
    preflens == {0} \cup {
      i \in 1..Len(seq):
        \A s1, s2 in set: SubSeq(s1, 1, i) = SubSeq(s2, 1, i) } 
  IN SubSeq(seq, 1, i)

(SubSeq(seq, 1, 0) is the empty sequence in all cases)


range and RadixTree are the core parts of the algorithm but I haven't studied them yet.

@mitchellh
Copy link
Author

Thanks @hwayne! I've made a couple of those changes (I left longestPrefix as is for now). I had x \/ P prior but I think due to some iteration/refactoring it came out into that weird IF expr. Thanks for catching that again, cleaned that up.

Note that I originally tried to write this similar to your LinkedList ex in Ch 8 of your book where I created the set of all [including invalid] trees that are roughly shaped like radix trees, then use filtering to get the one I want. Even for small alphabets/lengths this created too large of sets, so I had to fall back to a more iterative/programmery approach seen here. Not sure if I did something wrong.

@hwayne
Copy link

hwayne commented Jun 29, 2021

The iterative programmatic approach is perfectly fine; while it's nice to write data structures as formal definitions, it quickly leads to exponential blowup, so we have to compromise in the face of reality :)

@lemmy
Copy link

lemmy commented Jun 29, 2021

@mitchellh TLC lets you override operators with code should the pure TLA+ definition become a bottleneck (a profiler helps find bottlenecks). This way, we can write specs that favor readability over efficiency while helping tools. Almost all of the standard modules and some of the community modules have code overrides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment