markormesher/skew-algorithm.md

## skew-algorithm.md

      
    Raw
  

              skew-algorithm.md
            
          
    The Skew (Linear Time) Algorithm for Suffix Array Construction

I'll use x = "processing" as an example. Throughout the example * represents the null character, which sorts before any other character (i.e. *  < a < b < c < ...).
We start by creating three groups of suffixes - S0, S1 and S2 - so that each suffix in the group Sk starts at the index 3q + k for some value of q. In plain English:

S0 suffixes start at positions 0, 3, 6, etc.
S1 suffixes start at positions 1, 4, 7, etc.
S2 suffixes start at positions 2, 5, 8, etc.

The groups for "processing" are:
S0 = 0: processing
     3: cessing
     6: sing
     9: g
     
S1 = 1: rocessing
     4: essing
     7: ing
     
S2 = 2: ocessing
     5: ssing
     8: ng

We then construct S12, which is made up of the 3-grams of suffixes from S1 and S2. This is shown below, with the starting indexes above each 3-gram (this will be helpful later).
       1     4     7      2     5     8
S12 = roc | ess | ing || oce | ssi | ng*

Because the elements of S12 are all 3 characters long, we can sort them in linear time using a bucket/radix sort and rename each 3-letter block with it's rank, as follows:
       1     4     7      2     5     8
S12 = roc | ess | ing || oce | ssi | ng*
       4     0     1      3     5     2

The renamed S12 is now just "401352", which we can use as the ranks for suffixes in S1 and S2, meaning we have sorted those parts of the whole list. It contains no duplicates so no recursive step is needed; if there were duplicates here, we would apply the entire algorithm again to that string until we can uniquely sort the suffixes in S1 and S2. Because the problem shrinks with each recursion, the algorithm remains linear (see recursion statement at the end).
Now that S1 and S2 are sorted, we can use them to sort the elements of S0 by replacing each suffix with a pair made up of it's first letter and the rank of it's remainder (denoted R(...) below). The ranks of the remainder can be read from the result above (which is why it is useful to write the start indexes of the 3-grams in S12). By re-mapping suffixes to pairs, we can sort S0 in linear time. This is done as follows:
                                                                     Rank
0: processing  ->  (p, R(rocessing))  ->  (p, R(1))   ->  (p, 4)  ->  2
3: cessing     ->  (c, R(essing))     ->  (c, R(4))   ->  (c, 0)  ->  0
6: sing        ->  (s, R(ing))        ->  (s, R(7))   ->  (s, 1)  ->  3
9: g           ->  (g, R(*))          ->  (g, R(10))  ->  (g, *)  ->  1

We've now sorted S1 and S2 into one list and S0 into another list, both in linear time. If we can merge them in linear time, we have a linear algorithm. We can do this by replacing chunks of suffixes with the ranks of those chunks to create pairs and triples, just as we did to sort S0.
In the table below, the top row gives the starting indexes (i) of the sorted suffixes in S0 first, then the starting indexes of the sorted suffixes in S1 and S2 (both created by just reading from earlier results). The next two rows show the pairs (1c) and triples (2c) that will be used to compare elements in constant time.
Pairs are formed by reading the first character of a suffix and the rank of the rest of it; triples are formed by reading the first two characters and the rank of whatever is left. Different items have a pair, triple or both based on the following properties:

Suffixes from S0 can form a pair by reading one character and a rank from S1 and a triple by reading two characters and a rank from S2.
Suffixes from S1 can form a pair by reading one character and a rank from S2.
Suffixes from S2 can form a triple by reading two characters and a rank from S1.

We always want to read ranks from S1 or S2 so that comparisons are consistent.
|  i  |  3  |  9  |  0  |  6  ||  4  |  7  |  8  |  2  |  1  |  5  |
|-----|-----------------------||-----------------------------------|
|  1c |  c0    g*    p4    s1 ||  e5    i2                r3       |
|  2c | ce5   g**   pr3   si2 ||             ng*   oc0         ss1 |

Now that each suffix can be represented as a pair or triple we can compare them in constant time, so the lists can be merged easily:

Comparing 3 vs. 4: c0 "beats" e5, so SA[0] = 3.
Comparing 9 vs. 4: e5 "beats" g*, so SA[1] = 4.
Comparing 9 vs. 7: g* "beats" i2, so SA[2] = 9.
etc.

After the lists are fully merged, we get the following result:
SA = 3  (cessing)
     4  (essing)
     9  (g)
     7  (ing)
     8  (ng)
     2  (ocessing)
     0  (processing)
     1  (rocessing)
     6  (sing)
     5  (ssing)

The running time of the algorithm is in the form T(n) = T(2n/3) + O(n) which resolves to O(n).