fabianmcg/.md

## .md

      
    Raw
  

              .md
            
          
    Summary

The current pipeline for generating GPU code has some drawbacks, like not being able to link to bytecode libraries like libdevice, or device linking in general, requiring building MLIR on the target system for obtaining executable code, to name a few.
This proposal would introduce an alternative pipeline for generating device code. Essentially it would generate clang compliant offload LLVM IR and the task for generating the final executable would be left to clang, removing the burden away from MLIR and leveraging clang's capabilities for offload code generation.
Important notice, this won't introduce a clang build dependency to MLIR.
Context & Motivation

The current approach for lowering the GPU dialect down can be summed up by:
mlir-opt --convert-gpu-to-nvvm \
         --convert-nvgpu-to-nvvm \
         --gpu-to-cubin \
         --gpu-to-llvm input.mlir

Where device code generation happens inside either --gpu-to-cubin or --gpu-to-hsaco, these serialization passes follow the process:
(NVVM | ROCDL) + LLVM -> LLVM IR -> Device binary -> Embedding the device binary as constant into the host IR

This previous pipeline works in many cases, however as soon users start using things like math.sqrt they run into issues. Most of these issues appear due to the fact that there's no linking during the serialization pass, see SerializeToCubin.cpp. Thus pipeline scalability is not possible with the current system.
Another issue with the current approach is that users are forced to build MLIR on every target system (or at least a system with a full pipeline), as the serialization passes and the ExecutionEngine depend on an installation of either ROCM or CUDA. One important thing to note is that this requirement is not imposed to pure host code, as the user can translate down to LLVM IR in one machine, copy the file to the target system  and generate the executable with clang on the target system.
Why is this last point an issue? MLIR moves fast, so weekly recompiles are not uncommon, however these weekly builds might be prohibitive for some users with limited and precious compute time in target systems (e.g. DOE scientist), but with otherwise unlimited time in other systems in which they can build MLIR. Hence the ability to generate LLVM IR in one machine and compile it with clang in a different one would prove useful.
Clang Offload

Clang is able to handle offload to multiple target architectures like NVPTX and AMDGPU with ease, calling vendor tools like ptx automatically and even passing arguments to this tools, making the process extensible.
Clang currently has 2 offload drivers, a default one and --offload-new-driver. The new driver is not yet available in Windows / MacOS, however it has some important benefits:

Simpler to use.
Eventually is (probably) going to become the default (there are some issues to fix to support other platforms).
It's more powerful in its capabilities and provides a more uniform compilation process.
Capable of performing device LTO.

For more information see Offloading Design .
Offloading example

// example.cu
__global__ void fill(int n, float a, float *x) {
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i < n)
    x[i] = a;
}
int main(void) {
  int n = 1 << 20;
  float *x;
  cudaMallocManaged(&x, n * sizeof(float));
  fill<<<(n + 255) / 256, 256>>>(n, 2.0f, x);
  cudaDeviceSynchronize();
  cudaFree(x);
}

New driver compilation process:

# Create host & device IR.
clang -fgpu-rdc --offload-new-driver --offload-device-only --offload-arch=sm_70 -o example_dev.bc -c -emit-llvm example.cu
clang -fgpu-rdc --offload-new-driver --offload-host-only -o example_host.bc -c -emit-llvm  example.cu

# Compile everything together.
clang-offload-packager -o example_dev.bin --image=file=example_dev.bc,arch=sm_70,triple=nvptx64-nvidia-cuda,kind=cuda
clang -fgpu-rdc --offload-new-driver example_host.bc -Xclang -fembed-offload-object=example_dev.bin -O3 -lcudart -o example.exe

The last step can be replaced by:
# The output of this command will have host & device code, with the device code embedded as a constant, and some other necessary annotations.
clang -fgpu-rdc --offload-new-driver example_host.bc -Xclang -fembed-offload-object=example_dev.bin -O3 -c -emit-llvm  -o example_full.bc
clang -fgpu-rdc --offload-new-driver example_full.bc -lcudart -o example.exe

The output from the first command (example_dev.bc) is plain LLVM IR device offload code.
The output from the second command (example_host.bc) is LLVM IR code with with 2 additional annotations:
@.omp_offloading.entry_name = internal unnamed_addr constant [12 x i8] c"_Z4fillifPf\00"
@.omp_offloading.entry._Z4fillifPf = weak constant %struct.__tgt_offload_entry { ptr @_Z19__device_stub__fillifPf, ptr @.omp_offloading.entry_name, i64 0, i32 0, i32 0 }, section "cuda_offloading_entries", align 1

The output from embedding example_dev.bin  into example_host.bc and producing example_full.bc just adds the following annotations:
@llvm.embedded.object = private constant [<num> x i8] c"...", section ".llvm.offloading", align 8, !exclude !0
@llvm.compiler.used = appending global [1 x ptr] [ptr @llvm.embedded.object], section "llvm.metadata"

The key step of transforming everything into an executable is performed under the hood by clang, by calling the tool clang-linker-wrapper, this tool will compile device LLVM IR code down to fatbin and generate the necessary code to register the kernels.
Default driver compilation process:

It's more rigid and unlike the new driver it cannot be invoked from a combination of host IR and a device IR files, It requires either a cuda input file or a LLVM IR file with many more target specific annotations, including kernel registration code.
Possible routes

From easiest to implement to hardest:

Keep most of the existing stuff, but instead of serializing everything in one pass, have a pass that emits device LLVM IR, let the user manually compile this IR down to cubin, and create a standalone tool for embedding that IR into the host IR.

Drawbacks: no device lto, and there's still extensibility issues, extra tool to maintain.


Create a new serialization pipeline, but instead of compiling down to binary we serialize to device bytecode, introduce clang compatible offload annotations for the new driver, and let clang generate the executable.

Drawbacks: It would only work on linux until the new clang offload driver becomes available in all platforms. No JIT. The introduction of a new runner library, as clang uses the cuda runtime instead of the driver.


Do 2, but with the old driver.

Drawbacks: It's going to get deprecated in the future when clang makes the switch. The number of extra IR to be generated is considerable.


I'm inclined to do 2, it's relatively simple to implement and in the long run it's just a better option, as it provides more extensibility for generating device code, more optimization opportunities, and easier to maintain.
Note: On every step that says CUDA or related terms, these can be swapped for HIP.