Note to self
This works with Clang 12 on Manjaro, and maybe other Linux versions. CUDA 11.1 is installed with pacman in /opt/cuda
.
git clone https://github.com/llvm/llvm-project.git
This uses whatever C and C++ compilers you have on your system (I have GCC 10.2).
cd llvm-project
mkdir build
cd build
cmake ../llvm/ -DCMAKE_BUILD_TYPE=Release -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_61 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=61 -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DLLVM_ENABLE_PROJECTS="clang;openmp"
make -j$(nproc)
Replace 61
with the compute capability of your GPU (e.g. 70 for a V100). Note that this will take forever.
The resulting Clang binary and OpenMP library can be used, but it will give a warning like
clang-12: warning: No library 'libomptarget-nvptx-sm_61.bc' found in the default clang lib directory or in LIBRARY_PATH. Expect degraded performance due to no inlining of runtime functions on target devices. [-Wopenmp-target]
This is why I do the next step, called "bootstrapping".
Repeat the build procedure, but now using the compilers built in the previous step.
cd ..
mkdir build2
cd build2
CC=../build/bin/clang CXX=../build/bin/clang++ cmake ../llvm/ -DCMAKE_BUILD_TYPE=Release -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_61 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=61 -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DLLVM_ENABLE_PROJECTS="clang;openmp" -DCMAKE_INSTALL_PREFIX=$HOME/.local
make -j$(nproc)
make install
Again, remember to replace 61
. Also, pick an install prefix that is in your PATH
and LD_LIBRARY_PATH
. If you have other Clangs lying around, e.g. from pacman/apt, the install prefix should be prepended to the paths.
A stupid example:
#include <cstdio>
int main(){
int N = 1e8;
float *x = new float[N];
float *y = new float[N];
#pragma omp target teams distribute parallel for map(to: x[0:N]) map(from: y[0:N])
for(int i = 0; i < N; i++){
for(int j = 0; j < 1000; j++){
y[i] += 3*x[i];
}
}
printf("%g\n", y[5]);
delete [] x;
delete [] y;
}
Compile for GPU:
clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 run.cpp && time ./a.out
Compile for CPU (threads):
clang++ -fopenmp -O3 run.cpp && time ./a.out
This example runs roughly 6 times faster on my P2000 than on my i7-8850H. The difference increases with the number of inner repetitions as the data transfer becomes less important.