public
Last active

The design of a fast & generic algorithm for Reduction along given dimensions

  • Download Gist
rdim_design.md
Markdown

Reduction Along Given Dimensions

Reduction (e.g. sum, maximum, etc) along given dimensions are widely used in scientific computing. However, implementing a fast and generic function for such computation is not as trivial as some may consider. In the following, I would like to introduce such an implementation. Particularly, I would focus on implementing the sum function. Most of the techniques, however, can be directly applied to implement other types of reduction.

The codes are provided in this gist.

Colwise and Rowwise Sum

Let's begin with simpler examples, namely, colwise reduction and rowwise sum. These are the most common computation routines and the discussion of their implementation may help to illustrate several key ideas.

Implementation of colwise sum is relatively straightforward -- you simply compute the sum of each column, as

function sum_colwise{T}(a::Array{T})
    m, n = size(a)
    dst = Array(T, 1, n)
    oa = 0
    for j = 1 : n
        s = a[oa + 1]
        for i = 2 : m
            s += a[oa + i]
        end
        dst[j] = s
        oa += m
    end
    return dst
end

However, simply using the same way to implement rowwise sum is not very efficient. In Julia, arrays are organized in a column-major order, which means that given an array of size (m, n), adjacent elements along a row are m positions away from each other in the underlying memory layout. Hence, scaning such an array in a rowwise manner, that is, scaning one row and then jumping to the begining of the next row, is likely to cause a lot of cache missing, especially when m is large.

An efficient implementation of the rowwise sum should follow a different pattern: initialize the result using the first column, and then add the remaining columns of the input array in a colwise manner, as

function sum_rowwise{T}(a::Array{T})
    m, n = size(a)
    dst = Array(T, m, 1)

    # initialize using the first column of a
    for i = 1 : m
        dst[i] = a[i]
    end

    # add remaining columns
    oa = m
    for j = 2 : n
        dst[i] += a[oa+i]
        oa += m
    end
    return dst
end

Vector Kernels as Building Blocks

From the codes above, we can see that there are three vector operations lying at the core:

  • compute the sum of a subvector (here, subvector refers to a contiguous range of an array)
  • use a subvector of the input to initialize a subvector of the output (here, the initialize is done by copying)
  • accumulate a subvector of the input to a subvector of the output

For convenience, these operations are referred to as vector kernels in the following discussion. Particularly, for sum, the first kernel is already available in the Julia base, that is sum_seq, the other two kernels can be implemented as follows:

vcopy!(dst::Array, od::Int, a::Array, oa::Int, n::Int) = copy!(dst, od+1, a, oa+1, n)

function vadd!(dst::Array, od::Int, a::Array, oa::Int, n::Int)
    for i = 1 : n
        @inbounds dst[od+i] += a[oa+i]
    end    
end

With these kernels, the methods above can be rewritten as

function sum_colwise{T}(a::Array{T})
    m, n = size(a)
    dst = Array(T, 1, n)
    oa = 0
    for j = 1 : n
        dst[j] = sum_seq(a, oa+1, oa+m)
        oa += m
    end
    return dst
end

function sum_rowwise{T}(a::Array{T})
    m, n = size(a)
    dst = Array(T, m, 1)
    vcopy!(dst, 0, a, 0, m)
    oa = m
    for j = 2:n
        vadd!(dst, 0, a, oa, m)
        oa += m
    end
    return dst
end

Using vector kernels as building blocks allows one to highly optimize the implementation without modifying the overall skeleton. For example, we already use sum_seq and copy! to improve the performance & accuracy of the computation. When SIMD support is ready, we can substantially improve the performance of these functions without modifying the higher level skeletons -- all what we need to do is to re-implement several vector kernels using SIMD instructions.

Actually, the idea of using vector kernels have been very successfully adopted in the BLAS system, where higher level functions rely on highly vector kernels (BLAS Level-1 routines) as building blocks.

Virtual Shape

Next, I will discuss how to generate these functions to arrays of higher rank and reduction along an arbitrary set of dimensions.

An important observation here is that reduction over arrays of higher ranks (e.g. cubes or 4D arrays) can often be reduced to equivalent computation over arrays of lower ranks.

For example, let's consider an array a of size (3, 4, 5, 2). Reduction of this array along the first dimension (e.g. sum(a, 1)) is equivalent to the reduction over an array of size (3, 40). So to implement sum(a, 1) for this 4D array a, we can do the following:

  1. construct a result array dst of size (1, 4, 5, 2), and consider it as a vector of length 40.
  2. consider the input array a as a matrix of size (3, 40).
  3. perform the column-wise sum as above.

In other words, when doing reduction, we may consider two shapes for each array: the actual shape that you may obtain using size(a) and the virtual shape that the algorithm utilizes to arrange the computation. These two shapes have the same number of elements, but may have different number of dimensions. For the example above, the actual shape of a is (3, 4, 5, 2), while its virtual shape is (3, 40).

Derivation of the virtual shape from the actual shape can follow the procedure below:

  1. Attach with each dimension of the actual shape with a boolean variable. Set it to true if the dimension is in the reduction region (i.e. there is reduction along this dimension), or false otherwise.

  2. Compress consecutive dimensions with the same boolean tag into a single virtual dimension.

Take the example above for instance, the actual shape is (3, 4, 5, 2), and when doing sum(a, 1), the boolean values are (true, false, false, false), so we can compress it into a virtual shape of lower rank as (3, 40).

More examples: for sum(a, (2, 3)), the virtual shape is (3, 20, 2). The gist has a function named rcompress_dims to compress the actual size tuple into a virtual shape of possible lower rank.

Recursive Slicing

Using virtual shape, some reduction computation for arrays of higher ranks can be reduced to two basic paradigms: colwise reduction and rowwise reduction. Yet, there remain some more complex cases where virtual shape itself is not enough. For example, if we do sum(a, (1, 3)), the virtual shape remains the same as size(a), which has four dimensions.

The strategy that I take to solve this problem is recursive slicing. The following example illustrates this approach. Consider an array with virtual shape (m, n, k), the reduction over this array can be summarized by the following pseudo code

rlast = # whether the last dimension k is to be reduced

if rlast  # the last dimension is reduced
    # initialize the result using the first slice
    sum!(dst, slice(a, 1))     
    for i = 2:k
        # accumulate the reduction results from remaining slices
        add_sum!(dst, slice(a, i))
    end
else   # the last dimension is preserved
    # perform reduction for each slice independently
    for i = 1 : k
        sum!(slice(dst, i), slice(a, i))
    end
end

Note: this is just a code skeleton mainly for conveying the basic idea, the actual code may look more complex than this. Here, the slices are with respect to the virtual shape. An array of virtual shape (m, n, k) is considered to be comprised of k slices of size (m, n). Here, slice(a, i) refers to the procedure of extracting such a slice. In the actual implementation, there is no such a function, I basically use offsets to maintain the position of each slice, increment the offset at each iteration, and pass the offset as well as the slice size as arguments in the recursive call. No copying is done throughout the entire procedure.

This entire procedure is very regular and therefore can be written as a code-generation macro as what I did in the gist. With such a macro, various reduction functions such as sum, maximum, minimum, prod, etc can be easily generated.

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.