pervognsen/transpose.md

## transpose.md

      
    Raw
  

              transpose.md
            
          
    The trick to designing transpose algorithms for both small and large problems is to recognize their simple recursive structure.
For a matrix A, let's denote its transpose by T(A) as a shorthand. First, suppose A is a 2x2 matrix:
    [A00 A01]
A = [A10 A11]

Then we have:
       [A00 A10]
T(A) = [A01 A11]

If you subdivide any matrix A into conformant 2x2 blocks then the same formula holds if we recursively transpose:
       [T(A00) T(A10)]
T(A) = [T(A01) T(A11)]

You can think of the original 2x2 scalar form as a special case since transposing a scalar is a no-op. (For matrices with complex numbers the transpose of a scalar is the conjugate so it isn't a no-op there.)
This works even if the subdivision into 2x2 blocks is asymmetric. For example, if A is a 10x10 matrix you could divide it into blocks of size 3x3, 3x7, 7x3 and 7x7.
(The most symmetric case is when you can recursively divide the matrix into equal quarters at every step. This requires that the dimensions of the matrix are a power of two.)
To see how this works in practice, let's try to transpose a 4x4 matrix:
    [A00 A01 A02 A03]
    [A10 A11 A12 A13]
    [A20 A21 A22 A23]
    [A30 A31 A32 A33]

We first imagine it subdivided into 2x2 blocks:
    [A00 A01|A02 A03]
    [A10 A11|A12 A13]
    [-------+-------]
    [A20 A21|A22 A23]
    [A30 A31|A32 A33]

We internally transpose each of the blocks:
    [A00 A10|A02 A12]
    [A01 A11|A03 A13]
    [-------+-------]
    [A20 A30|A22 A32]
    [A21 A31|A23 A33]

Then we swap the two off-diagonal blocks:
    [A00 A10|A20 A30]
    [A01 A11|A21 A31]
    [-------+-------]
    [A02 A12|A22 A32]
    [A03 A13|A23 A33]

Now suppose we want to implement this with SSE intrinsics. Our 4x4 matrix is stored as four row vectors which I will call A0, A1, A2, A3.
The first step was transposing each of the 2x2 blocks which is accomplished with unpacklo/unpackhi:
    B00 = unpacklo(A0, A1) = [A00 A10
                              A01 A11]

This is the transpose of the upper left block stored as a row-major vector. Similarly for the other transposed blocks:
    B01 = unpackhi(A0, A1) = [A02 A12
                              A03 A13]
    B10 = unpacklo(A2, A3) = [A20 A30
                              A21 A31]
    B11 = unpackhi(A2, A3) = [A22 A32
                              A23 A33]

We can use movelh/movehl to convert back from row-major blocks to rows:
    movelh(B00, B01) = [A00 A10|A02 A12]
    movehl(B01, B00) = [A01 A11|A03 A13]
                       [-------+-------]
    movelh(B10, B11) = [A20 A30|A21 A31]
    movehl(B11, B10) = [A21 A31|A23 A33]

However, we needed to swap B01 and B10 in the second phase of the recursive transpose. We don't need to do this as a separate operation. We can just swap B01
and B10 as operands to movelh/movehl:
    movelh(B00, B10) = [A00 A10|A20 A30]
    movehl(B10, B00) = [A01 A11|A21 A31]
                       [-------+-------]
    movelh(B01, B11) = [A02 A12|A22 A32]
    movehl(B11, B01) = [A03 A13|A23 A33]

This was a little fiddly but I wanted to show how this recursive description works even at the smallest scale. But where it really shines is when you want
to scale this 4x4 kernel to larger sizes. If you want to transpose a 8x8 matrix, you can just instantiate your 4x4 kernel for the four 4x4 submatrices.
Then you have to swap the two off-diagonal blocks, but this is just "register renaming", so it doesn't actually involve any additional work! It's like how we just swapped B01 and B10 as operands to movelh/movehl in the 4x4 case. (Of course, once you exceed the capacity of your register file, you'll have to start writing back to memory. In that case the register renaming applies to the source operands of the stores to memory.)
If you apply this recursive approach using actual recursive function calls rather than static instantiation of fixed-size kernels (you can still use your fixed-size kernels as base cases for the recursive calls), you get a so-called cache-oblivious algorithm for matrix transposition which will scale efficiently to arbitrarily huge matrices without knowing about the target machine's specific memory hierarchy characteristics (cache size, cacheline size, page size, etc).