Skip to content

Instantly share code, notes, and snippets.

@markormesher
Last active November 5, 2023 15:06
Show Gist options
  • Save markormesher/7e3b39028f7e7237b098ebf35dcd6545 to your computer and use it in GitHub Desktop.
Save markormesher/7e3b39028f7e7237b098ebf35dcd6545 to your computer and use it in GitHub Desktop.
The Doubling Algorithm for Suffix Array Construction

The Doubling Algorithm for Suffix Array Construction

I'll use x = "banana", m = 6 as an example. The suffixes are:

0:  banana
1:  anana
2:  nana
etc.

First, we'll work out R1, which is the rank of each suffix according to it's first letter only. Because we're just comparing single letters, we can do it in Theta(m) with a bucket sort. There will be some duplicates, and that's fine. We only have three different letters, so we give all of the A's a rank of 0, all of the B's a rank of 1, and all of the N's a rank of 2. We end up with this:

    R1
0:  1
1:  0
2:  2
3:  0
4:  2
5:  0

Now, we'll work out R2, which is the rank of the suffixes according to their first two letters. Instead of comparing actual characters, we can just use the ranks we figured out in the last round, using the "formula" below:

R2k[i] = rank of the pair (Rk[i], Rk[i + k])

For building R2 from R1, this means:

R2[i] = rank of the pair (R1[i], R1[i + 1])

What this states is that the R2 rank of suffix 0 is the same as the rank of the pair (R1(0), R1(1)). We're going to bucket sort according to the rank of R1(0) (i.e. "banana"), then bucket sort again within each set of matching R1(0) values according to the rank of R1(1) (i.e. "anana") to break ties.

We'll end up with this:

    R1  R2
0:  1   1,0 -> 2
1:  0   0,2 -> 1
2:  2   2,0 -> 3
3:  0   0,2 -> 1
4:  2   2,0 -> 3
5:  0   0,* -> 0

Note that the last position has * to indicate that there is no "next rank" to sort with. That's fine, because * "beats" everything, meaning the single "a" at prefix 5 will sort before the "ana" at prefix 3, even though their pairs both start with 0.

We can continue for a third round, moving up to R4 this time and using the R2 ranks (this is why it's called the "doubling" method). We'll use the same method as before:

R4[i] = rank of the pair (R2[i], R2[i + 2])

That gives us this:

    R1  R2         R4
0:  1   1,0 -> 2   2,3 -> 3
1:  0   0,2 -> 1   1,1 -> 2
2:  2   2,0 -> 3   3,3 -> 5
3:  0   0,2 -> 1   1,0 -> 1
4:  2   2,0 -> 3   3,* -> 4
5:  0   0,* -> 0   0,* -> 0

Now we have no duplicates, so we can stop. Reading right to left, we can construct the suffix (rank 0 is given to position 5, rank 1 is given to position 3, etc).

The final output:

SA = 5  (a)
     3  (ana)
     1  (anana)
     0  (banana)
     4  (na)
     2  (nana)

In this case, sorting according to the first 4 characters was enough to create a unique rank for each suffix. In the worst case, we'd need to keep going for O(log m) rounds to sort according to all m letters. The bucket sort at each iteration runs in Theta(m), so the overall running time is O(m * log m).

@PhoebeWangintw
Copy link

PhoebeWangintw commented Jan 15, 2020

Hi, thanks for the nice tutorial!
However, I found two possible mistakes.
The first one lies in the formula of calculating the rank of the suffix

R2k[i] = rank of the pair (Rk[i], Rk[i + k])

Is it:

R2^k[i] = rank of the pair (Rk[i], Rk[i + k])

This will make R2 and R4 more clear.
eg. When k=2 (R4=R2^2):

R2^2[i] = rank of the pair (R2[i], R2[i + 2])

Another possible mistake is:

    R1  R2         R4
0:  1   1,0 -> 2   2,3 -> 3
1:  0   0,2 -> 1   1,1 -> 2
2:  2   2,0 -> 3   3,[2] -> 5
3:  0   0,2 -> 1   1,0 -> 1
4:  2   2,0 -> 3   3,* -> 4
5:  0   0,* -> 0   0,* -> 0

The number in [] should be 3.
Still thanks again for the tutorial! :)

@markormesher
Copy link
Author

Hi @PhoebeWangintw, thanks for your comments.

  • R2k[i] is used because each rank is created by doubling the number, rather than squaring it. Elements of R2 are combined to make R4 (where doubling and squaring both work, because 2x2 = 2^2 = 4), but if you have to go further then elements of R4 are used to make R8, elements of R8 are used to make R16, etc.
  • Well spotted on the second point, you're absolutely right - I dug out my original paper version of these notes and that number was originally a 3 before typing them up, not a 2. I will fix that. Thankfully it doesn't change the result of the calculation in this example because the only other pair that would end up in the same first bucket is 3,*.

@ywsoliman
Copy link

Hey, thanks for that great tutorial, where can I find the implementation of that algorithm?

@markormesher
Copy link
Author

Hi @ywsoliman. I haven't implemented this algorithm anywhere I'm afraid, this write-up is just an explanation of the theory. However, I found this write-up by searching online for the algorithm name plus "implementation", which should get you started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment