-
-
Save mitchellh/35836d98385fe2c0f0846cc100492224 to your computer and use it in GitHub Desktop.
This module contains operations for working with radix trees. A radix tree | |
is a data structure for efficient storage and lookup of values that often | |
share prefixes, typically used with strings. | |
A common question when I show this to people is: how do I add to the tree? | |
delete? update? For these, grab the Range of the tree, use set logic to | |
add/remove any elements, and construct a new tree with RadixTree. | |
For educational purposes, I've heavily commented all the operations. I | |
recommend using the constant expression evaluator to try the building blocks | |
to learn how things work if you're confused. That's how I learned. | |
----------------------------- MODULE RadixTrees ----------------------------- | |
LOCAL INSTANCE FiniteSets | |
LOCAL INSTANCE Sequences | |
LOCAL INSTANCE Integers | |
----------------------------------------------------------------------------- | |
\* Helpers that aren't Radix-tree specific. | |
\* shortestSeq returns the shortest sequence in a set. | |
LOCAL shortestSeq(set) == | |
CHOOSE seq \in set: | |
\A other \in set: | |
/\ Len(seq) <= Len(other) | |
\* Filter the sequence to only return the values that start with c. | |
LOCAL filterPrefix(set, c) == { seq \in set: seq[1] = c } | |
\* Strip the prefix from each element in set. This assumes that each | |
\* element in set already starts with prefix. Empty values will not | |
\* be returned. | |
LOCAL stripPrefix(set, prefix) == | |
{ SubSeq(seq, Len(prefix)+1, Len(seq)): seq \in set \ {prefix} } | |
\* Returns the set of first characters of a set of char sequences. | |
LOCAL firstChars(set) == { seq[1]: seq \in set } | |
\* Find the longest shared prefix of a set of sequences. Sequences can | |
\* be different lengths, but must have comparable types. | |
\* i.e. longestPrefix({1,2},{1,2,3},{1,2,5}) == {1,2} | |
LOCAL longestPrefix(set) == | |
\* Every item must at least have a common first character otherwise | |
\* the longest prefix is surely empty | |
IF \A x \in set, y \in set: | |
/\ Len(x) >= 1 | |
/\ Len(y) >= 1 | |
/\ x[1] = y[1] | |
THEN | |
LET | |
seq == shortestSeq(set) | |
end == CHOOSE end \in 1..Len(seq): | |
/\ \A i \in 1..end: | |
\A x \in set, y \in set: x[i] = y[i] | |
/\ IF Len(seq) <= end | |
THEN TRUE | |
ELSE \E x \in set, y \in set: x[end+1] /= y[end+1] | |
IN SubSeq(seq, 1, end) | |
ELSE <<>> | |
----------------------------------------------------------------------------- | |
\* Radix tree helpers | |
RECURSIVE range(_, _) | |
LOCAL range(T, prefix) == | |
LET | |
current == IF T.Value THEN {prefix \o T.Prefix} ELSE {} | |
\* current value of node (if exists) | |
children == UNION { | |
range(T.Edges[edge], prefix \o T.Prefix): | |
edge \in DOMAIN T.Edges | |
} | |
\* child values for each edge. this creates a set of sets | |
\* so we call union to flatten it. | |
IN current \cup children | |
----------------------------------------------------------------------------- | |
\* Returns the constructed radix tree for the set of keys Keys. | |
RECURSIVE RadixTree(_) | |
RadixTree(Keys) == | |
IF Keys = {} THEN {} \* base case, no keys empty tree | |
ELSE LET | |
prefix == longestPrefix(Keys) | |
\* longest shared prefix | |
keys == stripPrefix(Keys, prefix) | |
\* keys for children, prefix stripped | |
edgeLabels == firstChars(stripPrefix(Keys, prefix)) | |
\* labels for each edge (single characters) | |
IN [ | |
Prefix |-> prefix, | |
Value |-> prefix \in Keys, \* true if this represents an inserted value | |
Edges |-> [L \in edgeLabels |-> RadixTree(filterPrefix(keys, L))] | |
] | |
\* Range returns all of the values that are in the radix tree T. | |
Range(T) == range(T, <<>>) | |
============================================================================= | |
\* Modification History | |
\* Last modified Tue Jun 29 09:07:10 PDT 2021 by mitchellh | |
\* Created Mon Jun 28 08:08:13 PDT 2021 by mitchellh |
------------------------ MODULE RadixTreesValidation ------------------------ | |
EXTENDS FiniteSets, Integers, RadixTrees | |
\* Set of characters to use for the alphabet of generated strings. | |
CONSTANT Alphabet | |
\* Length of input strings generated | |
CONSTANT MinLength, MaxLength | |
ASSUME | |
/\ {MinLength, MaxLength} \subseteq Nat | |
/\ MinLength <= MaxLength | |
\* Number of unique elements to construct the radix tree with. This | |
\* is a set of numbers so you can test with inputs of multiple sizes. | |
CONSTANT ElementCounts | |
ASSUME ElementCounts \subseteq Nat | |
\* Inputs is the set of input strings valid for the tree. | |
Inputs == UNION { [1..n -> Alphabet]: n \in MinLength..MaxLength } | |
\* InputSets is the full set of possible inputs we can send to the radix tree. | |
InputSets == { T \in SUBSET Inputs: Cardinality(T) \in ElementCounts } | |
----------------------------------------------------------------------------- | |
\* The range of a radix tree should be the set of its inputs. | |
RangeIsInput == | |
\A input \in InputSets: | |
Range(RadixTree(input)) = input | |
\* The expression that should be checked for validity in the model. | |
Valid == RangeIsInput | |
============================================================================= | |
\* Modification History | |
\* Last modified Tue Jun 29 09:06:56 PDT 2021 by mitchellh | |
\* Created Tue Jun 29 08:02:38 PDT 2021 by mitchellh |
Thanks @hwayne! I've made a couple of those changes (I left longestPrefix
as is for now). I had x \/ P
prior but I think due to some iteration/refactoring it came out into that weird IF
expr. Thanks for catching that again, cleaned that up.
Note that I originally tried to write this similar to your LinkedList ex in Ch 8 of your book where I created the set of all [including invalid] trees that are roughly shaped like radix trees, then use filtering to get the one I want. Even for small alphabets/lengths this created too large of sets, so I had to fall back to a more iterative/programmery approach seen here. Not sure if I did something wrong.
The iterative programmatic approach is perfectly fine; while it's nice to write data structures as formal definitions, it quickly leads to exponential blowup, so we have to compromise in the face of reality :)
@mitchellh TLC lets you override operators with code should the pure TLA+ definition become a bottleneck (a profiler helps find bottlenecks). This way, we can write specs that favor readability over efficiency while helping tools. Almost all of the standard modules and some of the community modules have code overrides.
Pretty solid modeling! Couple comments on my first quick skim:
You can write
\A x, y \in set:
instead of\A x \in set, y \in set
.On line #56, you have a statement of form
IF x THEN TRUE ELSE P
. The truth table of this isSo it's just
x \/ P
.I'd probably write
longestPrefix
as something like(
SubSeq(seq, 1, 0)
is the empty sequence in all cases)range
andRadixTree
are the core parts of the algorithm but I haven't studied them yet.