mingfeima/part_2_parallelization_techniques.md

## part_2_parallelization_techniques.md

      
    Raw
  

              part_2_parallelization_techniques.md
            
          
    Part II: Parallelization Techniques

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part III: Vectorization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.
This section contains the following subjects:

Dimension Collapse
Dimension Blocking
Parallel Dim Reduce

1. Dimension Collapse

Dimension Collapse is a useful technique helping us simply the process of traversing elements from a ND Array. This concept applies to the following scenarios:
1.1 Logically Generic & Physically Contiguous

Suppose we have a contiguous 2D tensor with shape of [M, N] and want to perform an inplace add with a scalar value. Intuitively we can launch two loops to traverse the tensor, or we can view the tensor as 1D because: 1) operator on dimension M and N is generic; 2) the physical memory is contiguous. To parallel it, we simply need to equally divide the whole tensor to each thread, as shown in Fig-1(a).

The mapping from 2D index to 1D index is a type of dimension collapse.
1.2 Logically Generic & Physically Non-Contiguous

Same problem as before, but this time let's say the tensor is non-contiguous, strides are [N + N', 1]. Multiple approaches to traverse the tensor in parallel:
Parallel on outer dimension: this is good enough if the size of outer dimension, a.k.a. M is big enough to fully utilize all available CPU cores, the inner dimension could be left for vectorization, as shown in Fig-1(b);
Parallel on inner dimension: usually not a good choice since chopping down inner dimension would lead to non-contiguous memory access per thread (in this case, with a leading dimension size of N + N';
Map to 1D index: if the size of outer dimension is not big enough, say M = 4 and we have 20 CPU cores, also N is extremely large, simply parallel on outer dimension will leave 16/20 CPU cores idle. On this scenario, more parallelism must be exploited to fully the whole CPU resource, again we can map the 2d index to 1d for parallelilzation. But this time we need to convert the global 1D index (i) back to 2D index (m, n) before accessing to the data on each thread:
Back to the previous chapter, "Data Indexing with Strides", we can calculate the data offset given index (m, n): offset = m * (N + N') + n. Also we have multiple approaches to compute (m, n) given the global index (i):

div and mod: mostly commonly used method but not very efficient since integer div is a slow instruction (GCC will combine the div and mod into one but still not good enough);
incremental: actually we can only calculate the index to the very first element on each thread and incrementally handle the rest indices. data_index_init and data_index_step is doing this job, the benefit is we only need to perform very limited amount of idiv instead of once for each element. As shown in Fig-1(c) above.

Blocking to vectorize: To ease process of vectorization, we can do blocking on original tensor, view it as 3D tensor of [M, K, block_size], usually block_size is chosen to be multiple of vector length. Then the outer dimensions of M and K is collapsed for parallelization. As shown in Fig-1(d) above.
Sometimes, we also do blocking to avoid remainder which is not efficient for vectorization.
1.3 Logically Non-Generic & Physically Contiguous

Most of operators with a kernel parameter apply to this scenario, e.g. MaxPool2d example from the previous chapter.
Back to the MaxPool2d example, at the very beginning CPU implementation on PyTorch looks like (channels first):
  /// pseudo on max_pool2d channels first
  
  void max_pool2d_update_output_frame() {
    // parallel on C
    at::parallel_for(0, channels, 0, [&]() {
      // do the job
    });
  }
  
  void max_pool2d_update_output() {
    // parallel on N
    at::parallel_for(0, nbatch, 0, [&]() {
      max_pool2d_update_output_frame();
    });
  }
There exists a couple of issues with this implementation:

It has nested omp loops: Rule of thumb is to avoid using nested omp loops since omp runtime doesn't have mechanism of global resource management like tbb so it's likely that different omp thread pools competing for resource, or to say 'over-subscription'. PyTorch therefore forces any inner nested omp loop to go sequentially, so no over-subscription hazard as long as you let PyTorch to handle the threading.
It's not flexible: if we have an input shape of [4, 64, 112, 112], then only 4 cores will be used since 64 is forbidden to be paralleled; if we have an input shape of [1, 3, 224, 224], then only 3 cores will be used as the code will parallel on the inner loop.

Ideally we can parallel on the entire tensor with dimension collapse technique, as described in previous chapter, and we only need 3 indices for nbatch * channels, height and width since nbatch and channels are logically generic.
2. Dimension Blocking

Dimension Blocking is a reverse process for Dimension Collapse, which means to view tensor in higher dimension it actually is. We may need to do blocking for several reasons: 1) extend parallelism; 2) eliminate remainder of vectorization, etc.
Fig-1(d) is an example of blocking for vectorization.
The following is example of using blocking to extend parallelism, e.g. torch.cumsum(), also referred as prefix_sum.
#74899 and #74900 are targeting to improve cumsum() CPU performance. The parallel scheme in the PR is fairly simple which will parallel on outer dimensions if dim=-1 and vectorize on inner most dimension.
The PR doesn't go to extremes, in case the outer dimension is 1 or to say cumsum on a 1D tensor, it goes sequentially. The difficulty lies in the fact that cumsum has data dependency which made it essentially a sequential operator. With a little modification, it can be paralleled with a 3-step approach as shown in Fig-2(a):
Equally divide input data to each thread and，

step-1: do prefix_sum per thread, e.g. each thread starts with a zero offset;
step-2: update the global offset for each thread;
step-3: apply the global offset for each thread.


The parallel scheme above has one problem: since we are going to visit the same data twice in step-1 and step3, if the tensor size is really big and payload for each thread exeeds L2 (which is per thread), the performance is going to drop dramatically since the input data will be flushed out of L2 when revisiting them in step-3.
Improvement upon the parallel scheme would be blocking to be [M. N] where:
  N = T * block_size;
  M = numel / N

And block_size corresponds to payload for each thread which should be selected to be L2 hit. So as to say, T * block_size amount of data is handled in parallel and iterates for M times, as shown in Fig-2(b).
The manner of blocking could be really flexible, usually we need to take input shape into consideration to choose the optimal blocking scheme, here is another dimension blocking example which optimize log_softmax when dim != -1, #64726
3. Parallel Dim Reduce

Another commonly used parallel scheme is paralleled dim-reduce. Let's take the example of BatchNorm2d's collect stats functionality on channels first and channels last memory format.
So the process of collecting stats across N, H and W is a type of dim-reduce, from [N, C, H, W] to [C]. On channels first, it is horizontal reduce and on channels last it is vertical reduce.
3.1 Channels First Kernel

On CF memory format, it's very straightforward, we can directly parallel on channels and accumulate across the rest of dimensions, as shown in Fig-3(a):

3.2 Channels First Kernel

On CL memory format, it is a little bit trickier: If we run sequentially, we can simply imploy the one-path approach as shown in Fig-3(b) left part. If we directly parallel it on the outer dimension of NHW, it would lead to write conflict. To resolve this, two-path approach is applied:
First we need to allocate a temp buffer size of [T, C]

Path-I: parallel on outer dimension of NHW and reduce [NHW, C] to [T, C];
Path-II:: reduce the temp buffer from [T, C] to output [C].

  int num_threads = at::get_num_threads();
  Tensor buffer = at::empty({num_threads, n_channel}, input.options()).zero_();
  scalar_t* buffer_data = buffer.data_ptr<scalar_t>();

  // compute mean per input
  at::parallel_for(0, N, 1, [&](int64_t begin, int64_t end) {
    int tid = at::get_thread_num();
    scalar_t* buffer_ptr = buffer_data + tid * n_channel;
    for (const auto i : c10::irange(begin, end)) {
      const scalar_t* x_ptr = input_data + i * n_channel;
      vec::map2<scalar_t>(
          [](Vec x, Vec y) { return x + y; },
          buffer_ptr,
          x_ptr,
          buffer_ptr,
          n_channel);
    }
  });

  at::parallel_for(0, n_channel, 1, [&](int64_t begin, int64_t end) {
    for (const auto c : c10::irange(begin, end)) {
      accscalar_t sum = 0;
      for (const auto t : c10::irange(num_threads)) {
        sum += buffer_data[t * n_channel + c];
      }
      scalar_t mean = sum / N;
      mean_data[c] = mean;
    }
  });
Notes on the above kernel:

vec::map2<>() is a vectorized wrapper which add a lane of X to buffer.