Skip to content

Instantly share code, notes, and snippets.

@markormesher
Last active June 6, 2024 00:08
Show Gist options
  • Save markormesher/59b990fba09972b4737e7ed66912e044 to your computer and use it in GitHub Desktop.
Save markormesher/59b990fba09972b4737e7ed66912e044 to your computer and use it in GitHub Desktop.
The Skew (Linear Time) Algorithm for Suffix Array Construction

The Skew (Linear Time) Algorithm for Suffix Array Construction

I'll use x = "processing" as an example. Throughout the example * represents the null character, which sorts before any other character (i.e. * < a < b < c < ...).

We start by creating three groups of suffixes - S0, S1 and S2 - so that each suffix in the group Sk starts at the index 3q + k for some value of q. In plain English:

  • S0 suffixes start at positions 0, 3, 6, etc.
  • S1 suffixes start at positions 1, 4, 7, etc.
  • S2 suffixes start at positions 2, 5, 8, etc.

The groups for "processing" are:

S0 = 0: processing
     3: cessing
     6: sing
     9: g
     
S1 = 1: rocessing
     4: essing
     7: ing
     
S2 = 2: ocessing
     5: ssing
     8: ng

We then construct S12, which is made up of the 3-grams of suffixes from S1 and S2. This is shown below, with the starting indexes above each 3-gram (this will be helpful later).

       1     4     7      2     5     8
S12 = roc | ess | ing || oce | ssi | ng*

Because the elements of S12 are all 3 characters long, we can sort them in linear time using a bucket/radix sort and rename each 3-letter block with it's rank, as follows:

       1     4     7      2     5     8
S12 = roc | ess | ing || oce | ssi | ng*
       4     0     1      3     5     2

The renamed S12 is now just "401352", which we can use as the ranks for suffixes in S1 and S2, meaning we have sorted those parts of the whole list. It contains no duplicates so no recursive step is needed; if there were duplicates here, we would apply the entire algorithm again to that string until we can uniquely sort the suffixes in S1 and S2. Because the problem shrinks with each recursion, the algorithm remains linear (see recursion statement at the end).

Now that S1 and S2 are sorted, we can use them to sort the elements of S0 by replacing each suffix with a pair made up of it's first letter and the rank of it's remainder (denoted R(...) below). The ranks of the remainder can be read from the result above (which is why it is useful to write the start indexes of the 3-grams in S12). By re-mapping suffixes to pairs, we can sort S0 in linear time. This is done as follows:

                                                                     Rank
0: processing  ->  (p, R(rocessing))  ->  (p, R(1))   ->  (p, 4)  ->  2
3: cessing     ->  (c, R(essing))     ->  (c, R(4))   ->  (c, 0)  ->  0
6: sing        ->  (s, R(ing))        ->  (s, R(7))   ->  (s, 1)  ->  3
9: g           ->  (g, R(*))          ->  (g, R(10))  ->  (g, *)  ->  1

We've now sorted S1 and S2 into one list and S0 into another list, both in linear time. If we can merge them in linear time, we have a linear algorithm. We can do this by replacing chunks of suffixes with the ranks of those chunks to create pairs and triples, just as we did to sort S0.

In the table below, the top row gives the starting indexes (i) of the sorted suffixes in S0 first, then the starting indexes of the sorted suffixes in S1 and S2 (both created by just reading from earlier results). The next two rows show the pairs (1c) and triples (2c) that will be used to compare elements in constant time.

Pairs are formed by reading the first character of a suffix and the rank of the rest of it; triples are formed by reading the first two characters and the rank of whatever is left. Different items have a pair, triple or both based on the following properties:

  • Suffixes from S0 can form a pair by reading one character and a rank from S1 and a triple by reading two characters and a rank from S2.
  • Suffixes from S1 can form a pair by reading one character and a rank from S2.
  • Suffixes from S2 can form a triple by reading two characters and a rank from S1.

We always want to read ranks from S1 or S2 so that comparisons are consistent.

|  i  |  3  |  9  |  0  |  6  ||  4  |  7  |  8  |  2  |  1  |  5  |
|-----|-----------------------||-----------------------------------|
|  1c |  c0    g*    p4    s1 ||  e5    i2                r3       |
|  2c | ce5   g**   pr3   si2 ||             ng*   oc0         ss1 |

Now that each suffix can be represented as a pair or triple we can compare them in constant time, so the lists can be merged easily:

  • Comparing 3 vs. 4: c0 "beats" e5, so SA[0] = 3.
  • Comparing 9 vs. 4: e5 "beats" g*, so SA[1] = 4.
  • Comparing 9 vs. 7: g* "beats" i2, so SA[2] = 9.
  • etc.

After the lists are fully merged, we get the following result:

SA = 3  (cessing)
     4  (essing)
     9  (g)
     7  (ing)
     8  (ng)
     2  (ocessing)
     0  (processing)
     1  (rocessing)
     6  (sing)
     5  (ssing)

The running time of the algorithm is in the form T(n) = T(2n/3) + O(n) which resolves to O(n).

@naturecreator
Copy link

Hello Mark,

The way you explained the Skew Algorithm is commendable. Each and every bit is crystal clear. This article is helping me a lot. But, I would like to know how to perform recursion on the string because this is the topic on which I have to present a seminar and also would like to know why do we need pairs and triples as in what are their roles / importance. I would be grateful if you could help me in this regarding.

@markormesher
Copy link
Author

Hi @naturecreator. Unfortunately I wrote these notes about 3 years ago when I was studying text search and processing but I've not touched the domain much since then, so I'm probably not the best person to answer your questions. However, you might want to try the following resources:

@naturecreator
Copy link

Hello Mark,

Thank you for providing the information :)

@itsKarad
Copy link

itsKarad commented Jun 6, 2024

Great explanation, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment