rchikhi/wheeler_graphs.txt

## wheeler_graphs.txt
Wheeler graphs

Gagie, Manzini, Siren
Theoretical Computer Science, 2017
https://www.sciencedirect.com/science/article/pii/S0304397517305285

Notes of a whiteboard presentation to the Bonsai team in Lille.
These notes largely follow the paper.
Rayan Chikhi, 2019

Some history (not in the paper):

- 1927 David Wheeler born
- 1951 1st PhD in compsci (Wheeler)
- 1953 subroutines invented, replacing GOTOs (Wheeler, Wilkes)
- 1963 Michael Burrows born
- 1983 (B)WT discovered by Wheeler
- 1988 Burrows PhD thesis (in Cambridge)
- 1994 BWT article
- 1995 AltaVista (Burrows wrote the indexer, Monier the crawler)
- 1996 bzip2 (Seward, same author as valgrind)
- 2000 FM-index
- 2005 XBWT
- 2012 BOSS
- 2017 Wheeler graphs

Due to seemingly unrelated BWT variants on tree, graphs, alignments,
(some not even invertible nor compressible,) paper starts with the
blind man / elephant analogy:
one touching a leg thinking it's a tree,
the trunk thinking it's a snake,
the tail thinking it's a rope.

Setting:
- totally ordered alphabet
- directed (multi-)graph,
  each edge labeled by a single character

Wheeler graph:
- there exists an ordering (not necessarily total)
  of the nodes such that:
  - nodes with in-degree=0 precede all others
  - given two edges:

       u  --a -->  v
       u' --a'-->  v'

               a < a' =>  v <  v'
  (else, a=a') u < u' =>  v <= v'

Example:

 [ ] -a-> [ ] -b-> [ ]
  \____________b___/^

 is a Wheeler graph, with possible node order
 1         2        3

But if we changed e.g. the first b into a c, it wouldn't remain a Wheeler graph.

Consequence:
- all edges entering a given node have the same label
(otherwise, u,u' be the in-neighbors of v=v', a<a' implies v!=v', contrad)

Succinct encoding of Wheeler graphs:
- binary vectors I and O of in/out-degrees (resp.) unary encoded using 0's separated by 1's,
- string L of outgoing edges labels, given in the order of the nodes,
- C array (number of edges with label <=c).

For the graph above:

 1 -a-> 2 -b-> 3
  \________b__/^

O = 001011
L = ab b

I = 101001

*In the paper, the authors directly present Path Coherence, but instead I've started with the motivation example*

Motivating example:
- Without using suffix array or BWT,
  build a data structure supporting efficient substring membership queries within a string s
- using a deterministic finite automaton (DFA) -> the suffix trie of s
- using a non-deterministic finite automaton (NFA):
  - linear path consisting of all characters of s,
  - as well as all links from root (=epsilon-transitions)
  - all states are accepting

Recall the definitions from Wikipedia:
- DFA and NFA = (states, alphabet, transition function, start state, accept states)
- transition function in DFA: (state,letter) -> state
-                     in NFA: (state,letter) -> set of states
- DFA accepts s = exists sequences of states start,r_1,..,r_n\in accept, such that r_{i+1} = transition(r_i,s_i)
- NFA, same but r_{i+1} \in transition(..)

Rest of motivating example:
- DFA is a Wheeler graph
- NFA is not, we need to:
   - eliminate epsilon-transitions,
     because all incoming nodes don't have same label in original NFA (but in Wheeler graphs, they do),
   - make all states initial
   - order nodes according to reversed lexi order of prefixes leading to that node

Example with s=ABRACA (not in the Wheeler graph paper - my own):

NFA (with nodes labelled by the prefix just for visual clarity):

   epsilon
   -A-> A
   -B-> AB
   -R-> ABR
   -A-> ABRA
   -C-> ABRAC
   -A-> ABRACA

reverse lexicographical prefixes:

epsilon 0
      A 1
 ABRACA 2
   ABRA 3
     AB 4
  ABRAC 5
    ABR 6

Wheeler graph: the NFA above but with node order given by the above prefixes, and discarding node labels

    0123456
L = AB$CRAA

(+ the C, O and I arrays)

Relationship between the succinct Wheeler graph representation and BWT:
- since ABRACA has no out-neighbors, add a fictional one labeled by $
- compare with BWT of reverse(ABRACA$)=$ACARBA:

$ACARBA
A$ACARB
ACARBA$
ARBA$AC
BA$ACAR
CARBA$A
RBA$ACA

BWT($ACARBA) = AB$CRAA = L

Path coherence:
- if there exists a total order on the nodes, and
- now fix a consecutive range [i,j] and a string s
- consider nodes reachable from nodes [i,j] in |s| steps, following labels from s
- path coherence means those nodes also form a consecutive range [i',j']

Lemma:
- Wheeler graph => path coherent

Proof:
- consider [i,j] a consecutive range and some character a
- let [i',j'] be the smallest range that contain all nodes reacheable from [i,j] in one step following an edge labelled by a
- note that (by choice of that range) i' and j' have positive in-degree
- by def of Wheeler graphs (nodes with in-degree 0 precede all others), all [i',j'] have positive in-degree

- assume there exists i'<v<j' having an incoming edge a'!=a
- i' < v => a  < a'
- v  < j => a' < a (Wheeler def + modus tollens)
- contradiction
- so all nodes in [i', j'] are labelled by a

- furthermore, now that we know that all labels are equals,
  using last part of Wheeler def + modus tollens again,
  all nodes with destination i'<=x<=j' must originate from [i,j]

- so to recap, by construction, nodes reachable from [i,j] are in the range [i',j']
- but as shown above, all nodes in [i',j'] originate from [i,j]
- it follows that nodes reachable from [i,j] with label a are those in the consecutive range [i',j']
- rest is induction on string length

Data structure result:
- n nodes, e edges, alphabet A
- 2(e+n) + e log(|A|) + |A|log(e) + o(n+elog(|A|)) bits.
- forward and backward *edge* traversal in O(log(|A|)) [-- for backward I don't yet see how, the authors don't go into details.]

[For backward I can see how to determine the incoming edges label in O(|A|):
- test for all letters whether C[c] <  select(I,j) - j <=C[c+1]
  (following the inequality, given in the paper, for testing if x_j has all incoming edges labeled c)
- the c that works gives the incoming edge label
- but this doesn't give each incoming node index]

Then there is another result given on smaller space encoding, based on entropy of labels,
but I had some problems understanding that paragraph due to the redefinition of $L_i$/$L_k$ notations.

Theorem (6):
- given a FA (DFA or NFA in fact), no epsilon-transitions, 1 initial state or all of them initial
- if it's a Wheeler graph, then it can be stored compactly in 2(e+n)+... space
- and can determine string acceptance in O(|s|log(|A|)) time

Proof:
- If it's a Wheeler graph, it's path coherent (Lemma proved above)
- Consider the initial states: they form a consecutive range
- Thus (by def of path coherence) nodes reachable from initial states form a consecutive range
- By induction, one can go form the interval of each prefix of s to the interval of the next prefix in O(log|A|)

Other Wheeler graphs:
- collection of cyclic strings, circular queries
- collection of strings through a trie
- a trie is a Wheeler graph
- XBWT (same as in tries)
- dBGs (BOSS without the dummy $..$ nodes)
- FM-index of alignments (many authors incl. Lecroq)
- GCSA
- PBWT
- wavelet trees

*This is all the second half of the Wheeler graph paper, but I skipped those other graphs in my presentation*

I added an example for tries (note: not suffix tries), but didn't have time to present it in the hour.

Consider a set of strings:
ABC
BAC
ABA
ACA

Corresponding trie:
	.
       / \
      A   B
     / \  |
    B   C A
   / \  | |
  A   C A C

- it's a Wheeler graph
- for substring queries, all states need to be initial
- also all states are accepting

order of nodes in Wheeler graph:

epsilon 0
  A     1
 BA     2
ABA     3
ACA     4
  B     5
 AB     6
 AC     7
BAC     8
ABC     9

    0  1  2 345 6  7 89
O = 0010010111010010111
L = AB BC C   A AC A

Search for substring 'CA' in that graph:
- all states are initial so the range when traversing 'C' is [7,9]
- it can be obtained from the C array
- then, need to traverse graph by following 'A' from that range
- transition(7,'A') = 4
- transition(8,'A') = none
- transition(9,'A') = none
- The image range is thus {4}, telling us 'CA' is there
[- it remains to understand why we can get away with not querying the whole range, given those 'none'. I'm likely missing something; maybe using the O array]

Another example, query of 'AC':
- range for 'A': [1,4]
- transition(1,'C') = 7
- transition(4,'C') = none
- transition(3,'C') = none
- transition(2,'C') = 8
- result range: [7,8]

(Just out of curiosity: if we had put $'s at the leaves of the trie, the encoding would be:
L = ABBCC$$AACA$$
This isn't the BWT of ABA$ABC$ACA$BAC; one can see this because already they're of different lengths.)

---

Questions from the audience:
- Where do Bowtie/BWA fit in the history timeline? [2009]
- Other articles used Wheeler graphs? [yes: tunneling, read indexing, and last week: Wheeler language]
	Wheeler graphs

	Gagie, Manzini, Siren
	Theoretical Computer Science, 2017
	https://www.sciencedirect.com/science/article/pii/S0304397517305285

	Notes of a whiteboard presentation to the Bonsai team in Lille.
	These notes largely follow the paper.
	Rayan Chikhi, 2019

	Some history (not in the paper):

	- 1927 David Wheeler born
	- 1951 1st PhD in compsci (Wheeler)
	- 1953 subroutines invented, replacing GOTOs (Wheeler, Wilkes)
	- 1963 Michael Burrows born
	- 1983 (B)WT discovered by Wheeler
	- 1988 Burrows PhD thesis (in Cambridge)
	- 1994 BWT article
	- 1995 AltaVista (Burrows wrote the indexer, Monier the crawler)
	- 1996 bzip2 (Seward, same author as valgrind)
	- 2000 FM-index
	- 2005 XBWT
	- 2012 BOSS
	- 2017 Wheeler graphs

	Due to seemingly unrelated BWT variants on tree, graphs, alignments,
	(some not even invertible nor compressible,) paper starts with the
	blind man / elephant analogy:
	one touching a leg thinking it's a tree,
	the trunk thinking it's a snake,
	the tail thinking it's a rope.

	Setting:
	- totally ordered alphabet
	- directed (multi-)graph,
	each edge labeled by a single character

	Wheeler graph:
	- there exists an ordering (not necessarily total)
	of the nodes such that:
	- nodes with in-degree=0 precede all others
	- given two edges:

	u --a --> v
	u' --a'--> v'

	a < a' => v < v'
	(else, a=a') u < u' => v <= v'

	Example:

	[ ] -a-> [ ] -b-> [ ]
	\____________b___/^

	is a Wheeler graph, with possible node order
	1 2 3

	But if we changed e.g. the first b into a c, it wouldn't remain a Wheeler graph.

	Consequence:
	- all edges entering a given node have the same label
	(otherwise, u,u' be the in-neighbors of v=v', a<a' implies v!=v', contrad)

	Succinct encoding of Wheeler graphs:
	- binary vectors I and O of in/out-degrees (resp.) unary encoded using 0's separated by 1's,
	- string L of outgoing edges labels, given in the order of the nodes,
	- C array (number of edges with label <=c).

	For the graph above:

	1 -a-> 2 -b-> 3
	\________b__/^

	O = 001011
	L = ab b

	I = 101001

	In the paper, the authors directly present Path Coherence, but instead I've started with the motivation example

	Motivating example:
	- Without using suffix array or BWT,
	build a data structure supporting efficient substring membership queries within a string s
	- using a deterministic finite automaton (DFA) -> the suffix trie of s
	- using a non-deterministic finite automaton (NFA):
	- linear path consisting of all characters of s,
	- as well as all links from root (=epsilon-transitions)
	- all states are accepting

	Recall the definitions from Wikipedia:
	- DFA and NFA = (states, alphabet, transition function, start state, accept states)
	- transition function in DFA: (state,letter) -> state
	- in NFA: (state,letter) -> set of states
	- DFA accepts s = exists sequences of states start,r_1,..,r_n\in accept, such that r_{i+1} = transition(r_i,s_i)
	- NFA, same but r_{i+1} \in transition(..)

	Rest of motivating example:
	- DFA is a Wheeler graph
	- NFA is not, we need to:
	- eliminate epsilon-transitions,
	because all incoming nodes don't have same label in original NFA (but in Wheeler graphs, they do),
	- make all states initial
	- order nodes according to reversed lexi order of prefixes leading to that node

	Example with s=ABRACA (not in the Wheeler graph paper - my own):

	NFA (with nodes labelled by the prefix just for visual clarity):

	epsilon
	-A-> A
	-B-> AB
	-R-> ABR
	-A-> ABRA
	-C-> ABRAC
	-A-> ABRACA

	reverse lexicographical prefixes:

	epsilon 0
	A 1
	ABRACA 2
	ABRA 3
	AB 4
	ABRAC 5
	ABR 6

	Wheeler graph: the NFA above but with node order given by the above prefixes, and discarding node labels

	0123456
	L = AB$CRAA

	(+ the C, O and I arrays)

	Relationship between the succinct Wheeler graph representation and BWT:
	- since ABRACA has no out-neighbors, add a fictional one labeled by $
	- compare with BWT of reverse(ABRACA$)=$ACARBA:

	$ACARBA
	A$ACARB
	ACARBA$
	ARBA$AC
	BA$ACAR
	CARBA$A
	RBA$ACA

	BWT($ACARBA) = AB$CRAA = L

	Path coherence:
	- if there exists a total order on the nodes, and
	- now fix a consecutive range [i,j] and a string s
	- consider nodes reachable from nodes [i,j] in \|s\| steps, following labels from s
	- path coherence means those nodes also form a consecutive range [i',j']

	Lemma:
	- Wheeler graph => path coherent

	Proof:
	- consider [i,j] a consecutive range and some character a
	- let [i',j'] be the smallest range that contain all nodes reacheable from [i,j] in one step following an edge labelled by a
	- note that (by choice of that range) i' and j' have positive in-degree
	- by def of Wheeler graphs (nodes with in-degree 0 precede all others), all [i',j'] have positive in-degree

	- assume there exists i'<v<j' having an incoming edge a'!=a
	- i' < v => a < a'
	- v < j => a' < a (Wheeler def + modus tollens)
	- contradiction
	- so all nodes in [i', j'] are labelled by a

	- furthermore, now that we know that all labels are equals,
	using last part of Wheeler def + modus tollens again,
	all nodes with destination i'<=x<=j' must originate from [i,j]

	- so to recap, by construction, nodes reachable from [i,j] are in the range [i',j']
	- but as shown above, all nodes in [i',j'] originate from [i,j]
	- it follows that nodes reachable from [i,j] with label a are those in the consecutive range [i',j']
	- rest is induction on string length

	Data structure result:
	- n nodes, e edges, alphabet A
	- 2(e+n) + e log(\|A\|) + \|A\|log(e) + o(n+elog(\|A\|)) bits.
	- forward and backward edge traversal in O(log(\|A\|)) [-- for backward I don't yet see how, the authors don't go into details.]

	[For backward I can see how to determine the incoming edges label in O(\|A\|):
	- test for all letters whether C[c] < select(I,j) - j <=C[c+1]
	(following the inequality, given in the paper, for testing if x_j has all incoming edges labeled c)
	- the c that works gives the incoming edge label
	- but this doesn't give each incoming node index]

	Then there is another result given on smaller space encoding, based on entropy of labels,
	but I had some problems understanding that paragraph due to the redefinition of $L_i$/$L_k$ notations.

	Theorem (6):
	- given a FA (DFA or NFA in fact), no epsilon-transitions, 1 initial state or all of them initial
	- if it's a Wheeler graph, then it can be stored compactly in 2(e+n)+... space
	- and can determine string acceptance in O(\|s\|log(\|A\|)) time

	Proof:
	- If it's a Wheeler graph, it's path coherent (Lemma proved above)
	- Consider the initial states: they form a consecutive range
	- Thus (by def of path coherence) nodes reachable from initial states form a consecutive range
	- By induction, one can go form the interval of each prefix of s to the interval of the next prefix in O(log\|A\|)

	Other Wheeler graphs:
	- collection of cyclic strings, circular queries
	- collection of strings through a trie
	- a trie is a Wheeler graph
	- XBWT (same as in tries)
	- dBGs (BOSS without the dummy $..$ nodes)
	- FM-index of alignments (many authors incl. Lecroq)
	- GCSA
	- PBWT
	- wavelet trees

	This is all the second half of the Wheeler graph paper, but I skipped those other graphs in my presentation

	I added an example for tries (note: not suffix tries), but didn't have time to present it in the hour.

	Consider a set of strings:
	ABC
	BAC
	ABA
	ACA

	Corresponding trie:
	.
	/ \
	A B
	/ \ \|
	B C A
	/ \ \| \|
	A C A C

	- it's a Wheeler graph
	- for substring queries, all states need to be initial
	- also all states are accepting

	order of nodes in Wheeler graph:

	epsilon 0
	A 1
	BA 2
	ABA 3
	ACA 4
	B 5
	AB 6
	AC 7
	BAC 8
	ABC 9

	0 1 2 345 6 7 89
	O = 0010010111010010111
	L = AB BC C A AC A

	Search for substring 'CA' in that graph:
	- all states are initial so the range when traversing 'C' is [7,9]
	- it can be obtained from the C array
	- then, need to traverse graph by following 'A' from that range
	- transition(7,'A') = 4
	- transition(8,'A') = none
	- transition(9,'A') = none
	- The image range is thus {4}, telling us 'CA' is there
	[- it remains to understand why we can get away with not querying the whole range, given those 'none'. I'm likely missing something; maybe using the O array]

	Another example, query of 'AC':
	- range for 'A': [1,4]
	- transition(1,'C') = 7
	- transition(4,'C') = none
	- transition(3,'C') = none
	- transition(2,'C') = 8
	- result range: [7,8]

	(Just out of curiosity: if we had put $'s at the leaves of the trie, the encoding would be:
	L = ABBCC$$AACA$$
	This isn't the BWT of ABA$ABC$ACA$BAC; one can see this because already they're of different lengths.)

	---

	Questions from the audience:
	- Where do Bowtie/BWA fit in the history timeline? [2009]
	- Other articles used Wheeler graphs? [yes: tunneling, read indexing, and last week: Wheeler language]