Skip to content

Instantly share code, notes, and snippets.

@markormesher
Last active November 5, 2023 15:06
Show Gist options
  • Save markormesher/7e3b39028f7e7237b098ebf35dcd6545 to your computer and use it in GitHub Desktop.
Save markormesher/7e3b39028f7e7237b098ebf35dcd6545 to your computer and use it in GitHub Desktop.
The Doubling Algorithm for Suffix Array Construction

The Doubling Algorithm for Suffix Array Construction

I'll use x = "banana", m = 6 as an example. The suffixes are:

0:  banana
1:  anana
2:  nana
etc.

First, we'll work out R1, which is the rank of each suffix according to it's first letter only. Because we're just comparing single letters, we can do it in Theta(m) with a bucket sort. There will be some duplicates, and that's fine. We only have three different letters, so we give all of the A's a rank of 0, all of the B's a rank of 1, and all of the N's a rank of 2. We end up with this:

    R1
0:  1
1:  0
2:  2
3:  0
4:  2
5:  0

Now, we'll work out R2, which is the rank of the suffixes according to their first two letters. Instead of comparing actual characters, we can just use the ranks we figured out in the last round, using the "formula" below:

R2k[i] = rank of the pair (Rk[i], Rk[i + k])

For building R2 from R1, this means:

R2[i] = rank of the pair (R1[i], R1[i + 1])

What this states is that the R2 rank of suffix 0 is the same as the rank of the pair (R1(0), R1(1)). We're going to bucket sort according to the rank of R1(0) (i.e. "banana"), then bucket sort again within each set of matching R1(0) values according to the rank of R1(1) (i.e. "anana") to break ties.

We'll end up with this:

    R1  R2
0:  1   1,0 -> 2
1:  0   0,2 -> 1
2:  2   2,0 -> 3
3:  0   0,2 -> 1
4:  2   2,0 -> 3
5:  0   0,* -> 0

Note that the last position has * to indicate that there is no "next rank" to sort with. That's fine, because * "beats" everything, meaning the single "a" at prefix 5 will sort before the "ana" at prefix 3, even though their pairs both start with 0.

We can continue for a third round, moving up to R4 this time and using the R2 ranks (this is why it's called the "doubling" method). We'll use the same method as before:

R4[i] = rank of the pair (R2[i], R2[i + 2])

That gives us this:

    R1  R2         R4
0:  1   1,0 -> 2   2,3 -> 3
1:  0   0,2 -> 1   1,1 -> 2
2:  2   2,0 -> 3   3,3 -> 5
3:  0   0,2 -> 1   1,0 -> 1
4:  2   2,0 -> 3   3,* -> 4
5:  0   0,* -> 0   0,* -> 0

Now we have no duplicates, so we can stop. Reading right to left, we can construct the suffix (rank 0 is given to position 5, rank 1 is given to position 3, etc).

The final output:

SA = 5  (a)
     3  (ana)
     1  (anana)
     0  (banana)
     4  (na)
     2  (nana)

In this case, sorting according to the first 4 characters was enough to create a unique rank for each suffix. In the worst case, we'd need to keep going for O(log m) rounds to sort according to all m letters. The bucket sort at each iteration runs in Theta(m), so the overall running time is O(m * log m).

@markormesher
Copy link
Author

Hi @ywsoliman. I haven't implemented this algorithm anywhere I'm afraid, this write-up is just an explanation of the theory. However, I found this write-up by searching online for the algorithm name plus "implementation", which should get you started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment