Skip to content

Instantly share code, notes, and snippets.

@rchikhi
Last active March 14, 2021 20:55
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save rchikhi/ac3d7e7b71e5547cef72a80c8e358283 to your computer and use it in GitHub Desktop.
Save rchikhi/ac3d7e7b71e5547cef72a80c8e358283 to your computer and use it in GitHub Desktop.
Wheeler graphs
Gagie, Manzini, Siren
Theoretical Computer Science, 2017
https://www.sciencedirect.com/science/article/pii/S0304397517305285
Notes of a whiteboard presentation to the Bonsai team in Lille.
These notes largely follow the paper.
Rayan Chikhi, 2019
Some history (not in the paper):
- 1927 David Wheeler born
- 1951 1st PhD in compsci (Wheeler)
- 1953 subroutines invented, replacing GOTOs (Wheeler, Wilkes)
- 1963 Michael Burrows born
- 1983 (B)WT discovered by Wheeler
- 1988 Burrows PhD thesis (in Cambridge)
- 1994 BWT article
- 1995 AltaVista (Burrows wrote the indexer, Monier the crawler)
- 1996 bzip2 (Seward, same author as valgrind)
- 2000 FM-index
- 2005 XBWT
- 2012 BOSS
- 2017 Wheeler graphs
Due to seemingly unrelated BWT variants on tree, graphs, alignments,
(some not even invertible nor compressible,) paper starts with the
blind man / elephant analogy:
one touching a leg thinking it's a tree,
the trunk thinking it's a snake,
the tail thinking it's a rope.
Setting:
- totally ordered alphabet
- directed (multi-)graph,
each edge labeled by a single character
Wheeler graph:
- there exists an ordering (not necessarily total)
of the nodes such that:
- nodes with in-degree=0 precede all others
- given two edges:
u --a --> v
u' --a'--> v'
a < a' => v < v'
(else, a=a') u < u' => v <= v'
Example:
 [ ] -a-> [ ] -b-> [ ]
\____________b___/^
is a Wheeler graph, with possible node order
1 2 3
But if we changed e.g. the first b into a c, it wouldn't remain a Wheeler graph.
Consequence:
- all edges entering a given node have the same label
(otherwise, u,u' be the in-neighbors of v=v', a<a' implies v!=v', contrad)
Succinct encoding of Wheeler graphs:
- binary vectors I and O of in/out-degrees (resp.) unary encoded using 0's separated by 1's,
- string L of outgoing edges labels, given in the order of the nodes,
- C array (number of edges with label <=c).
For the graph above:
 1 -a-> 2 -b-> 3
\________b__/^
O = 001011
L = ab b
I = 101001
*In the paper, the authors directly present Path Coherence, but instead I've started with the motivation example*
Motivating example:
- Without using suffix array or BWT,
build a data structure supporting efficient substring membership queries within a string s
- using a deterministic finite automaton (DFA) -> the suffix trie of s
- using a non-deterministic finite automaton (NFA):
- linear path consisting of all characters of s,
- as well as all links from root (=epsilon-transitions)
- all states are accepting
Recall the definitions from Wikipedia:
- DFA and NFA = (states, alphabet, transition function, start state, accept states)
- transition function in DFA: (state,letter) -> state
- in NFA: (state,letter) -> set of states
- DFA accepts s = exists sequences of states start,r_1,..,r_n\in accept, such that r_{i+1} = transition(r_i,s_i)
- NFA, same but r_{i+1} \in transition(..)
Rest of motivating example:
- DFA is a Wheeler graph
- NFA is not, we need to:
- eliminate epsilon-transitions,
because all incoming nodes don't have same label in original NFA (but in Wheeler graphs, they do),
- make all states initial
- order nodes according to reversed lexi order of prefixes leading to that node
Example with s=ABRACA (not in the Wheeler graph paper - my own):
NFA (with nodes labelled by the prefix just for visual clarity):
epsilon
-A-> A
-B-> AB
-R-> ABR
-A-> ABRA
-C-> ABRAC
-A-> ABRACA
reverse lexicographical prefixes:
epsilon 0
A 1
ABRACA 2
ABRA 3
AB 4
ABRAC 5
ABR 6
Wheeler graph: the NFA above but with node order given by the above prefixes, and discarding node labels
0123456
L = AB$CRAA
(+ the C, O and I arrays)
Relationship between the succinct Wheeler graph representation and BWT:
- since ABRACA has no out-neighbors, add a fictional one labeled by $
- compare with BWT of reverse(ABRACA$)=$ACARBA:
$ACARBA
A$ACARB
ACARBA$
ARBA$AC
BA$ACAR
CARBA$A
RBA$ACA
BWT($ACARBA) = AB$CRAA = L
Path coherence:
- if there exists a total order on the nodes, and
- now fix a consecutive range [i,j] and a string s
- consider nodes reachable from nodes [i,j] in |s| steps, following labels from s
- path coherence means those nodes also form a consecutive range [i',j']
Lemma:
- Wheeler graph => path coherent
Proof:
- consider [i,j] a consecutive range and some character a
- let [i',j'] be the smallest range that contain all nodes reacheable from [i,j] in one step following an edge labelled by a
- note that (by choice of that range) i' and j' have positive in-degree
- by def of Wheeler graphs (nodes with in-degree 0 precede all others), all [i',j'] have positive in-degree
- assume there exists i'<v<j' having an incoming edge a'!=a
- i' < v => a < a'
- v < j => a' < a (Wheeler def + modus tollens)
- contradiction
- so all nodes in [i', j'] are labelled by a
- furthermore, now that we know that all labels are equals,
using last part of Wheeler def + modus tollens again,
all nodes with destination i'<=x<=j' must originate from [i,j]
- so to recap, by construction, nodes reachable from [i,j] are in the range [i',j']
- but as shown above, all nodes in [i',j'] originate from [i,j]
- it follows that nodes reachable from [i,j] with label a are those in the consecutive range [i',j']
- rest is induction on string length
Data structure result:
- n nodes, e edges, alphabet A
- 2(e+n) + e log(|A|) + |A|log(e) + o(n+elog(|A|)) bits.
- forward and backward *edge* traversal in O(log(|A|)) [-- for backward I don't yet see how, the authors don't go into details.]
[For backward I can see how to determine the incoming edges label in O(|A|):
- test for all letters whether C[c] < select(I,j) - j <=C[c+1]
(following the inequality, given in the paper, for testing if x_j has all incoming edges labeled c)
- the c that works gives the incoming edge label
- but this doesn't give each incoming node index]
Then there is another result given on smaller space encoding, based on entropy of labels,
but I had some problems understanding that paragraph due to the redefinition of $L_i$/$L_k$ notations.
Theorem (6):
- given a FA (DFA or NFA in fact), no epsilon-transitions, 1 initial state or all of them initial
- if it's a Wheeler graph, then it can be stored compactly in 2(e+n)+... space
- and can determine string acceptance in O(|s|log(|A|)) time
Proof:
- If it's a Wheeler graph, it's path coherent (Lemma proved above)
- Consider the initial states: they form a consecutive range
- Thus (by def of path coherence) nodes reachable from initial states form a consecutive range
- By induction, one can go form the interval of each prefix of s to the interval of the next prefix in O(log|A|)
Other Wheeler graphs:
- collection of cyclic strings, circular queries
- collection of strings through a trie
- a trie is a Wheeler graph
- XBWT (same as in tries)
- dBGs (BOSS without the dummy $..$ nodes)
- FM-index of alignments (many authors incl. Lecroq)
- GCSA
- PBWT
- wavelet trees
*This is all the second half of the Wheeler graph paper, but I skipped those other graphs in my presentation*
I added an example for tries (note: not suffix tries), but didn't have time to present it in the hour.
Consider a set of strings:
ABC
BAC
ABA
ACA
Corresponding trie:
.
/ \
A B
/ \ |
B C A
/ \ | |
A C A C
- it's a Wheeler graph
- for substring queries, all states need to be initial
- also all states are accepting
order of nodes in Wheeler graph:
epsilon 0
A 1
BA 2
ABA 3
ACA 4
B 5
AB 6
AC 7
BAC 8
ABC 9
0 1 2 345 6 7 89
O = 0010010111010010111
L = AB BC C A AC A
Search for substring 'CA' in that graph:
- all states are initial so the range when traversing 'C' is [7,9]
- it can be obtained from the C array
- then, need to traverse graph by following 'A' from that range
- transition(7,'A') = 4
- transition(8,'A') = none
- transition(9,'A') = none
- The image range is thus {4}, telling us 'CA' is there
[- it remains to understand why we can get away with not querying the whole range, given those 'none'. I'm likely missing something; maybe using the O array]
Another example, query of 'AC':
- range for 'A': [1,4]
- transition(1,'C') = 7
- transition(4,'C') = none
- transition(3,'C') = none
- transition(2,'C') = 8
- result range: [7,8]
(Just out of curiosity: if we had put $'s at the leaves of the trie, the encoding would be:
L = ABBCC$$AACA$$
This isn't the BWT of ABA$ABC$ACA$BAC; one can see this because already they're of different lengths.)
---
Questions from the audience:
- Where do Bowtie/BWA fit in the history timeline? [2009]
- Other articles used Wheeler graphs? [yes: tunneling, read indexing, and last week: Wheeler language]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment