rchikhi/alg1-ropebwt2.txt

## alg1-ropebwt2.txt
This document is a partial presentation of the RopeBWT2 pre-print

http://arxiv.org/abs/1406.0426

It is the transcript of a presentation made within the Medvedev group at Penn State
in July 2014. It focuses on illustrating some notions from the methods, and
illustrating and proving Algorithm 1. While this document does not cover the main
contribution of the RopeBWT2 paper, I hope it can be helpful towards understanding
the theoretical foundations that led to Algorithms 2 and 3.


BWT
---

First, recall how the Burrows Wheeler Transform of a word s, noted BWT(s), is computed.

As an example, let s=abra$, where $ is a special sentinel character. It is set to be
lexicographically smaller than all other letters.

Consider the suffixes of s, indexed by position of the first letter:

0 abra$
1 bra$
2 ra$
3 a$
4 $

Now sort them by lexicographical order (effectively permuting the indices):

4 $
3 a$
0 abra$
1 bra$
2 ra$

Note that the suffix array of "abra$" is the resulting permutation of the indices,
i.e. [4, 3, 0, 1, 2].

Now transform each suffix by making it circular (i.e. the last letter of s
is immediately followed by the first letter of s):

4 $abra
3 a$abr
0 abra$
1 bra$a
2 ra$ab

The BWT is obtained by reading the last character of each suffix in this sorted order.

  BWT(abra$) = ar$ab

It is possible to do the inverse operation of reconstructing the original string from
its BWT, i.e. UNBWT(ar$ab) = abra$.

See the Wikipedia page of the BWT
(http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform#Explanation)
for the inverse algorithm.


Segmentation of the BWT
-----------------------

Let B be the BWT of a string s. B can be segmented as follows: for a character c that
appears in s, let B_c be a sub-sequence of the BWT, obtained by taking the last letters
of all suffixes that start with c. By construction, those letters are consecutive in the
BWT, because all suffixes starting with c are ordered consecutively. Thus B_c is a
sub-string of the BWT. The BWT is the concatenation of B_c's for all characters c, in order.

In our previous example B = BWT(abra$),

  B = B_$ B_a B_b B_r

where

  B_$ = a
  B_a = r$
  B_b = a
  B_r = b

One does not need to sort the suffixes to obtain this segmentation. It suffices to
observe that the length of B_c is the number of occurrences of c in s.


Ranking and inserting
---------------------

Along with the segmentation of the BWT, Algorithm 1 uses two more operations:

 rank(c,k,B) returns the number of occurrences of the character c in the string B[1..k-1].

 insert(c,k,B) inserts the character c at position k in B


Algorithm 1 - Illustration
--------------------------

Algorithm 1 is InsertIO1(B,P) in the paper, which takes as input:

 B (the BWT of a string T#, where # is a sentinel character)
 P (any string)

It outputs BWT(T#P$), where $ is a character lexicographically larger than #,
but lexicographically smaller than all other letters.

An order of sentinels is introduced in the Methods section of the paper
($_0 < $_1 < ... < $_(m-1)), but not in the presentation of Algorithm 1.
Yet, it is important.
We will later see that Algorithm 1 would be incorrect if we considered that # = $.

Let's run Algorithm 1 on an example: InsertIO1(B = "ar#ab", P = "da")

Recall the segmentation of B, taking into account the new characters in da$, is:

 B_# B_$ B_a B_b B_d B_r,

where

 B_# = "a"
 B_$ = ""
 B_a = "r#"
 B_b = "a"
 B_d = ""
 B_r = "b".

The steps of the Algorithm are:

c <- "$"
k <- 0 = number of occurrences of $ in B

// first iteration of the for loop
i = 1, P[i] = "a"
insert("a", 0, B_$), now B_$ = "a"
k <- rank("a", 0, B_$) + [number of occurrences of "a" in B_# ]
  <- 0 + 1 = 1
c <- "a"

// second iteration of the for loop
i = 0, P[i] = "d"
insert("d", 1, B_a), now B_a = "rd#"
k <- rank("d", 1, B_a) + [number of occurrences of "d" in B_#, B_$, B_a and B_b]
  <- 0 + 0 = 0
c <- "d"

// third iteration of the for loop
i = -1, P[i] = "$"
insert("$", 0, B_d), now B_d = "$"
// we can skip the assignments of k and c as this is the last iteration

Finally, B = aard#a$b is the result of the algorithm.

One can verify that indeed, BWT(abra#da$) = aard#a$b (when doing so, recall that # < $).


Algorithm 1 needs ordered sentinels
------------------------------------

A quick observation is that T must not end with the sentinel $ (the one corresponding
to the string that will be appended), else Algorithm 1 would be incorrect.
To see this, we will run Algorithm 1 on the string abra$da$ and see that it does
not update the BWT correctly.

First, observe that BWT(abra$da$) = aadr$a$b

(Here are the details:
0 abra$da$    7 $abra$da
1 bra$da$a    4 $da$abra
2 ra$da$ab    6 a$abra$d
3 a$da$abr -> 3 a$da$abr
4 $da$abra    0 abra$da$
5 da$abra$    1 bra$da$a
6 a$abra$d    5 da$abra$
7 $abra$da    2 ra$da$ab)

The first two iterations of Algorithm 1 are:

c <- "$"
k <- 1 = number of occurrences of $ in B
// first iteration of the for loop
i = 1, P[i] = "a"
insert("a", 1, B_$), now B_$ = "aa"
k <- rank("a", 1, B_$) + [ no letter is smaller than $]
  <- 1 + 0 = 1
c <- "a"
// second iteration of the for loop
i = 0, P[i] = "d"
insert("d", 1, B_a)

B_a is now "rd$" whereas it should be "dr$".


Proof of Algorithm 1
--------------------

We need to prove that the insert() operations are done at the right index, in the
right string B_c.

A similar property was proven in [Bauer 2013]. We restate it in this Lemma:

Lemma 1:

Let s be a string and i an integer in [0,|s|-1].
Let n[i] be the number of suffixes smaller than s[i..] starting with s[i].
Let B = BWT(s), segmented into strings B_c for each character c.
Then, n[i] is also the number of occurrences of s[i] in the string b, where b is the concatenation
of all B_c's with c < s[i+1] and B_c[0..n[i+1]-1] with c = s[i+1].

Proof:

Let s[i..] = xyz, where x and y are characters, and z is a string.
Then, s[i+1..] = yz. (y = s[i+1])
We are interested in the number of suffixes smaller than xyz that start with the character x.
This is equivalent to finding the number of (circularized) suffixes ending with x that are smaller
than yz. In other words, this is the number of suffixes smaller than yz that contribute to a
letter x in the BWT.
Suffixes that start with a letter strictly smaller than y are exactly those that correspond
to letters in B_c with c < y. Thus, those which end with x contribute to the letter x in B_c,
c < y. Thus, the number of letters x in B_c, c < y, is exactly the number of suffixes that
start with a letter strictly smaller than y and end with x.
Now, we need to count how many suffixes start with y, end with the letter x, and are smaller
than yz. These suffixes contribute to B_y. There are exactly n[i+1] suffixes that start with y and
are smaller than yz, thus n[i+1] letters in B_y before the letter corresponding to yz.
Thus, the number of suffixes that start with y, end with x, and are smaller than yz is the
number of letters x in B_y[1..n[i+1]-1].
Putting it all together, the number of suffixes smaller than yz that end with x is the number
of x's in B_c, c < y, plus the number of x in B_y[1..n[i+1]-1]. QED.

Observe that the position k where P[i] is inserted in Algorithm 1 is exactly n[i+1], and the
segment B_c is for c = P[i+1]. The update rule for k follows Lemma 1.
	This document is a partial presentation of the RopeBWT2 pre-print

	http://arxiv.org/abs/1406.0426

	It is the transcript of a presentation made within the Medvedev group at Penn State
	in July 2014. It focuses on illustrating some notions from the methods, and
	illustrating and proving Algorithm 1. While this document does not cover the main
	contribution of the RopeBWT2 paper, I hope it can be helpful towards understanding
	the theoretical foundations that led to Algorithms 2 and 3.


	BWT
	---

	First, recall how the Burrows Wheeler Transform of a word s, noted BWT(s), is computed.

	As an example, let s=abra$, where $ is a special sentinel character. It is set to be
	lexicographically smaller than all other letters.

	Consider the suffixes of s, indexed by position of the first letter:

	0 abra$
	1 bra$
	2 ra$
	3 a$
	4 $

	Now sort them by lexicographical order (effectively permuting the indices):

	4 $
	3 a$
	0 abra$
	1 bra$
	2 ra$

	Note that the suffix array of "abra$" is the resulting permutation of the indices,
	i.e. [4, 3, 0, 1, 2].

	Now transform each suffix by making it circular (i.e. the last letter of s
	is immediately followed by the first letter of s):

	4 $abra
	3 a$abr
	0 abra$
	1 bra$a
	2 ra$ab

	The BWT is obtained by reading the last character of each suffix in this sorted order.

	BWT(abra$) = ar$ab

	It is possible to do the inverse operation of reconstructing the original string from
	its BWT, i.e. UNBWT(ar$ab) = abra$.

	See the Wikipedia page of the BWT
	(http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform#Explanation)
	for the inverse algorithm.


	Segmentation of the BWT
	-----------------------

	Let B be the BWT of a string s. B can be segmented as follows: for a character c that
	appears in s, let B_c be a sub-sequence of the BWT, obtained by taking the last letters
	of all suffixes that start with c. By construction, those letters are consecutive in the
	BWT, because all suffixes starting with c are ordered consecutively. Thus B_c is a
	sub-string of the BWT. The BWT is the concatenation of B_c's for all characters c, in order.

	In our previous example B = BWT(abra$),

	B = B_$ B_a B_b B_r

	where

	B_$ = a
	B_a = r$
	B_b = a
	B_r = b

	One does not need to sort the suffixes to obtain this segmentation. It suffices to
	observe that the length of B_c is the number of occurrences of c in s.


	Ranking and inserting
	---------------------

	Along with the segmentation of the BWT, Algorithm 1 uses two more operations:

	rank(c,k,B) returns the number of occurrences of the character c in the string B[1..k-1].

	insert(c,k,B) inserts the character c at position k in B



	Algorithm 1 - Illustration
	--------------------------

	Algorithm 1 is InsertIO1(B,P) in the paper, which takes as input:

	B (the BWT of a string T#, where # is a sentinel character)
	P (any string)

	It outputs BWT(T#P$), where $ is a character lexicographically larger than #,
	but lexicographically smaller than all other letters.

	An order of sentinels is introduced in the Methods section of the paper
	($_0 < $_1 < ... < $_(m-1)), but not in the presentation of Algorithm 1.
	Yet, it is important.
	We will later see that Algorithm 1 would be incorrect if we considered that # = $.

	Let's run Algorithm 1 on an example: InsertIO1(B = "ar#ab", P = "da")

	Recall the segmentation of B, taking into account the new characters in da$, is:

	B_# B_$ B_a B_b B_d B_r,

	where

	B_# = "a"
	B_$ = ""
	B_a = "r#"
	B_b = "a"
	B_d = ""
	B_r = "b".

	The steps of the Algorithm are:

	c <- "$"
	k <- 0 = number of occurrences of $ in B

	// first iteration of the for loop
	i = 1, P[i] = "a"
	insert("a", 0, B_$), now B_$ = "a"
	k <- rank("a", 0, B_$) + [number of occurrences of "a" in B_# ]
	<- 0 + 1 = 1
	c <- "a"

	// second iteration of the for loop
	i = 0, P[i] = "d"
	insert("d", 1, B_a), now B_a = "rd#"
	k <- rank("d", 1, B_a) + [number of occurrences of "d" in B_#, B_$, B_a and B_b]
	<- 0 + 0 = 0
	c <- "d"

	// third iteration of the for loop
	i = -1, P[i] = "$"
	insert("$", 0, B_d), now B_d = "$"
	// we can skip the assignments of k and c as this is the last iteration

	Finally, B = aard#a$b is the result of the algorithm.

	One can verify that indeed, BWT(abra#da$) = aard#a$b (when doing so, recall that # < $).


	Algorithm 1 needs ordered sentinels
	------------------------------------

	A quick observation is that T must not end with the sentinel $ (the one corresponding
	to the string that will be appended), else Algorithm 1 would be incorrect.
	To see this, we will run Algorithm 1 on the string abra$da$ and see that it does
	not update the BWT correctly.

	First, observe that BWT(abra$da$) = aadr$a$b

	(Here are the details:
	0 abra$da$ 7 $abra$da
	1 bra$da$a 4 $da$abra
	2 ra$da$ab 6 a$abra$d
	3 a$da$abr -> 3 a$da$abr
	4 $da$abra 0 abra$da$
	5 da$abra$ 1 bra$da$a
	6 a$abra$d 5 da$abra$
	7 $abra$da 2 ra$da$ab)

	The first two iterations of Algorithm 1 are:

	c <- "$"
	k <- 1 = number of occurrences of $ in B
	// first iteration of the for loop
	i = 1, P[i] = "a"
	insert("a", 1, B_$), now B_$ = "aa"
	k <- rank("a", 1, B_$) + [ no letter is smaller than $]
	<- 1 + 0 = 1
	c <- "a"
	// second iteration of the for loop
	i = 0, P[i] = "d"
	insert("d", 1, B_a)

	B_a is now "rd$" whereas it should be "dr$".


	Proof of Algorithm 1
	--------------------

	We need to prove that the insert() operations are done at the right index, in the
	right string B_c.

	A similar property was proven in [Bauer 2013]. We restate it in this Lemma:

	Lemma 1:

	Let s be a string and i an integer in [0,\|s\|-1].
	Let n[i] be the number of suffixes smaller than s[i..] starting with s[i].
	Let B = BWT(s), segmented into strings B_c for each character c.
	Then, n[i] is also the number of occurrences of s[i] in the string b, where b is the concatenation
	of all B_c's with c < s[i+1] and B_c[0..n[i+1]-1] with c = s[i+1].

	Proof:

	Let s[i..] = xyz, where x and y are characters, and z is a string.
	Then, s[i+1..] = yz. (y = s[i+1])
	We are interested in the number of suffixes smaller than xyz that start with the character x.
	This is equivalent to finding the number of (circularized) suffixes ending with x that are smaller
	than yz. In other words, this is the number of suffixes smaller than yz that contribute to a
	letter x in the BWT.
	Suffixes that start with a letter strictly smaller than y are exactly those that correspond
	to letters in B_c with c < y. Thus, those which end with x contribute to the letter x in B_c,
	c < y. Thus, the number of letters x in B_c, c < y, is exactly the number of suffixes that
	start with a letter strictly smaller than y and end with x.
	Now, we need to count how many suffixes start with y, end with the letter x, and are smaller
	than yz. These suffixes contribute to B_y. There are exactly n[i+1] suffixes that start with y and
	are smaller than yz, thus n[i+1] letters in B_y before the letter corresponding to yz.
	Thus, the number of suffixes that start with y, end with x, and are smaller than yz is the
	number of letters x in B_y[1..n[i+1]-1].
	Putting it all together, the number of suffixes smaller than yz that end with x is the number
	of x's in B_c, c < y, plus the number of x in B_y[1..n[i+1]-1]. QED.

	Observe that the position k where P[i] is inserted in Algorithm 1 is exactly n[i+1], and the
	segment B_c is for c = P[i+1]. The update rule for k follows Lemma 1.