Last active
June 30, 2021 15:01
-
-
Save mitchellh/35836d98385fe2c0f0846cc100492224 to your computer and use it in GitHub Desktop.
Radix tree and validation in TLA+
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This module contains operations for working with radix trees. A radix tree | |
is a data structure for efficient storage and lookup of values that often | |
share prefixes, typically used with strings. | |
A common question when I show this to people is: how do I add to the tree? | |
delete? update? For these, grab the Range of the tree, use set logic to | |
add/remove any elements, and construct a new tree with RadixTree. | |
For educational purposes, I've heavily commented all the operations. I | |
recommend using the constant expression evaluator to try the building blocks | |
to learn how things work if you're confused. That's how I learned. | |
----------------------------- MODULE RadixTrees ----------------------------- | |
LOCAL INSTANCE FiniteSets | |
LOCAL INSTANCE Sequences | |
LOCAL INSTANCE Integers | |
----------------------------------------------------------------------------- | |
\* Helpers that aren't Radix-tree specific. | |
\* shortestSeq returns the shortest sequence in a set. | |
LOCAL shortestSeq(set) == | |
CHOOSE seq \in set: | |
\A other \in set: | |
/\ Len(seq) <= Len(other) | |
\* Filter the sequence to only return the values that start with c. | |
LOCAL filterPrefix(set, c) == { seq \in set: seq[1] = c } | |
\* Strip the prefix from each element in set. This assumes that each | |
\* element in set already starts with prefix. Empty values will not | |
\* be returned. | |
LOCAL stripPrefix(set, prefix) == | |
{ SubSeq(seq, Len(prefix)+1, Len(seq)): seq \in set \ {prefix} } | |
\* Returns the set of first characters of a set of char sequences. | |
LOCAL firstChars(set) == { seq[1]: seq \in set } | |
\* Find the longest shared prefix of a set of sequences. Sequences can | |
\* be different lengths, but must have comparable types. | |
\* i.e. longestPrefix({1,2},{1,2,3},{1,2,5}) == {1,2} | |
LOCAL longestPrefix(set) == | |
\* Every item must at least have a common first character otherwise | |
\* the longest prefix is surely empty | |
IF \A x \in set, y \in set: | |
/\ Len(x) >= 1 | |
/\ Len(y) >= 1 | |
/\ x[1] = y[1] | |
THEN | |
LET | |
seq == shortestSeq(set) | |
end == CHOOSE end \in 1..Len(seq): | |
/\ \A i \in 1..end: | |
\A x \in set, y \in set: x[i] = y[i] | |
/\ IF Len(seq) <= end | |
THEN TRUE | |
ELSE \E x \in set, y \in set: x[end+1] /= y[end+1] | |
IN SubSeq(seq, 1, end) | |
ELSE <<>> | |
----------------------------------------------------------------------------- | |
\* Radix tree helpers | |
RECURSIVE range(_, _) | |
LOCAL range(T, prefix) == | |
LET | |
current == IF T.Value THEN {prefix \o T.Prefix} ELSE {} | |
\* current value of node (if exists) | |
children == UNION { | |
range(T.Edges[edge], prefix \o T.Prefix): | |
edge \in DOMAIN T.Edges | |
} | |
\* child values for each edge. this creates a set of sets | |
\* so we call union to flatten it. | |
IN current \cup children | |
----------------------------------------------------------------------------- | |
\* Returns the constructed radix tree for the set of keys Keys. | |
RECURSIVE RadixTree(_) | |
RadixTree(Keys) == | |
IF Keys = {} THEN {} \* base case, no keys empty tree | |
ELSE LET | |
prefix == longestPrefix(Keys) | |
\* longest shared prefix | |
keys == stripPrefix(Keys, prefix) | |
\* keys for children, prefix stripped | |
edgeLabels == firstChars(stripPrefix(Keys, prefix)) | |
\* labels for each edge (single characters) | |
IN [ | |
Prefix |-> prefix, | |
Value |-> prefix \in Keys, \* true if this represents an inserted value | |
Edges |-> [L \in edgeLabels |-> RadixTree(filterPrefix(keys, L))] | |
] | |
\* Range returns all of the values that are in the radix tree T. | |
Range(T) == range(T, <<>>) | |
============================================================================= | |
\* Modification History | |
\* Last modified Tue Jun 29 09:07:10 PDT 2021 by mitchellh | |
\* Created Mon Jun 28 08:08:13 PDT 2021 by mitchellh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
------------------------ MODULE RadixTreesValidation ------------------------ | |
EXTENDS FiniteSets, Integers, RadixTrees | |
\* Set of characters to use for the alphabet of generated strings. | |
CONSTANT Alphabet | |
\* Length of input strings generated | |
CONSTANT MinLength, MaxLength | |
ASSUME | |
/\ {MinLength, MaxLength} \subseteq Nat | |
/\ MinLength <= MaxLength | |
\* Number of unique elements to construct the radix tree with. This | |
\* is a set of numbers so you can test with inputs of multiple sizes. | |
CONSTANT ElementCounts | |
ASSUME ElementCounts \subseteq Nat | |
\* Inputs is the set of input strings valid for the tree. | |
Inputs == UNION { [1..n -> Alphabet]: n \in MinLength..MaxLength } | |
\* InputSets is the full set of possible inputs we can send to the radix tree. | |
InputSets == { T \in SUBSET Inputs: Cardinality(T) \in ElementCounts } | |
----------------------------------------------------------------------------- | |
\* The range of a radix tree should be the set of its inputs. | |
RangeIsInput == | |
\A input \in InputSets: | |
Range(RadixTree(input)) = input | |
\* The expression that should be checked for validity in the model. | |
Valid == RangeIsInput | |
============================================================================= | |
\* Modification History | |
\* Last modified Tue Jun 29 09:06:56 PDT 2021 by mitchellh | |
\* Created Tue Jun 29 08:02:38 PDT 2021 by mitchellh |
@mitchellh TLC lets you override operators with code should the pure TLA+ definition become a bottleneck (a profiler helps find bottlenecks). This way, we can write specs that favor readability over efficiency while helping tools. Almost all of the standard modules and some of the community modules have code overrides.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The iterative programmatic approach is perfectly fine; while it's nice to write data structures as formal definitions, it quickly leads to exponential blowup, so we have to compromise in the face of reality :)