Skip to content

Instantly share code, notes, and snippets.

@mitchellh
Last active June 30, 2021 15:01
Show Gist options
  • Save mitchellh/35836d98385fe2c0f0846cc100492224 to your computer and use it in GitHub Desktop.
Save mitchellh/35836d98385fe2c0f0846cc100492224 to your computer and use it in GitHub Desktop.
Radix tree and validation in TLA+
This module contains operations for working with radix trees. A radix tree
is a data structure for efficient storage and lookup of values that often
share prefixes, typically used with strings.
A common question when I show this to people is: how do I add to the tree?
delete? update? For these, grab the Range of the tree, use set logic to
add/remove any elements, and construct a new tree with RadixTree.
For educational purposes, I've heavily commented all the operations. I
recommend using the constant expression evaluator to try the building blocks
to learn how things work if you're confused. That's how I learned.
----------------------------- MODULE RadixTrees -----------------------------
LOCAL INSTANCE FiniteSets
LOCAL INSTANCE Sequences
LOCAL INSTANCE Integers
-----------------------------------------------------------------------------
\* Helpers that aren't Radix-tree specific.
\* shortestSeq returns the shortest sequence in a set.
LOCAL shortestSeq(set) ==
CHOOSE seq \in set:
\A other \in set:
/\ Len(seq) <= Len(other)
\* Filter the sequence to only return the values that start with c.
LOCAL filterPrefix(set, c) == { seq \in set: seq[1] = c }
\* Strip the prefix from each element in set. This assumes that each
\* element in set already starts with prefix. Empty values will not
\* be returned.
LOCAL stripPrefix(set, prefix) ==
{ SubSeq(seq, Len(prefix)+1, Len(seq)): seq \in set \ {prefix} }
\* Returns the set of first characters of a set of char sequences.
LOCAL firstChars(set) == { seq[1]: seq \in set }
\* Find the longest shared prefix of a set of sequences. Sequences can
\* be different lengths, but must have comparable types.
\* i.e. longestPrefix({1,2},{1,2,3},{1,2,5}) == {1,2}
LOCAL longestPrefix(set) ==
\* Every item must at least have a common first character otherwise
\* the longest prefix is surely empty
IF \A x \in set, y \in set:
/\ Len(x) >= 1
/\ Len(y) >= 1
/\ x[1] = y[1]
THEN
LET
seq == shortestSeq(set)
end == CHOOSE end \in 1..Len(seq):
/\ \A i \in 1..end:
\A x \in set, y \in set: x[i] = y[i]
/\ IF Len(seq) <= end
THEN TRUE
ELSE \E x \in set, y \in set: x[end+1] /= y[end+1]
IN SubSeq(seq, 1, end)
ELSE <<>>
-----------------------------------------------------------------------------
\* Radix tree helpers
RECURSIVE range(_, _)
LOCAL range(T, prefix) ==
LET
current == IF T.Value THEN {prefix \o T.Prefix} ELSE {}
\* current value of node (if exists)
children == UNION {
range(T.Edges[edge], prefix \o T.Prefix):
edge \in DOMAIN T.Edges
}
\* child values for each edge. this creates a set of sets
\* so we call union to flatten it.
IN current \cup children
-----------------------------------------------------------------------------
\* Returns the constructed radix tree for the set of keys Keys.
RECURSIVE RadixTree(_)
RadixTree(Keys) ==
IF Keys = {} THEN {} \* base case, no keys empty tree
ELSE LET
prefix == longestPrefix(Keys)
\* longest shared prefix
keys == stripPrefix(Keys, prefix)
\* keys for children, prefix stripped
edgeLabels == firstChars(stripPrefix(Keys, prefix))
\* labels for each edge (single characters)
IN [
Prefix |-> prefix,
Value |-> prefix \in Keys, \* true if this represents an inserted value
Edges |-> [L \in edgeLabels |-> RadixTree(filterPrefix(keys, L))]
]
\* Range returns all of the values that are in the radix tree T.
Range(T) == range(T, <<>>)
=============================================================================
\* Modification History
\* Last modified Tue Jun 29 09:07:10 PDT 2021 by mitchellh
\* Created Mon Jun 28 08:08:13 PDT 2021 by mitchellh
------------------------ MODULE RadixTreesValidation ------------------------
EXTENDS FiniteSets, Integers, RadixTrees
\* Set of characters to use for the alphabet of generated strings.
CONSTANT Alphabet
\* Length of input strings generated
CONSTANT MinLength, MaxLength
ASSUME
/\ {MinLength, MaxLength} \subseteq Nat
/\ MinLength <= MaxLength
\* Number of unique elements to construct the radix tree with. This
\* is a set of numbers so you can test with inputs of multiple sizes.
CONSTANT ElementCounts
ASSUME ElementCounts \subseteq Nat
\* Inputs is the set of input strings valid for the tree.
Inputs == UNION { [1..n -> Alphabet]: n \in MinLength..MaxLength }
\* InputSets is the full set of possible inputs we can send to the radix tree.
InputSets == { T \in SUBSET Inputs: Cardinality(T) \in ElementCounts }
-----------------------------------------------------------------------------
\* The range of a radix tree should be the set of its inputs.
RangeIsInput ==
\A input \in InputSets:
Range(RadixTree(input)) = input
\* The expression that should be checked for validity in the model.
Valid == RangeIsInput
=============================================================================
\* Modification History
\* Last modified Tue Jun 29 09:06:56 PDT 2021 by mitchellh
\* Created Tue Jun 29 08:02:38 PDT 2021 by mitchellh
@hwayne
Copy link

hwayne commented Jun 29, 2021

The iterative programmatic approach is perfectly fine; while it's nice to write data structures as formal definitions, it quickly leads to exponential blowup, so we have to compromise in the face of reality :)

@lemmy
Copy link

lemmy commented Jun 29, 2021

@mitchellh TLC lets you override operators with code should the pure TLA+ definition become a bottleneck (a profiler helps find bottlenecks). This way, we can write specs that favor readability over efficiency while helping tools. Almost all of the standard modules and some of the community modules have code overrides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment