andyfaff/Python_openmp_notes.md

## Python_openmp_notes.md

      
    Raw
  

              Python_openmp_notes.md
            
          
    Investigating use of openmp on macOS for Python related things

This assumes you have installed the command line tools on macOS. The first two sections look into installing OpenMP from scratch. However, on macOS it might be easier just to use homebrew.
Install cmake


Download source from https://github.com/Kitware/CMake/releases/download/v3.15.1/cmake-3.15.1.tar.gz (there may be a later release).
Untar the files: tar xzvf cmake-3.15.1.tar.gz
Go into directory and build:

cd cmake-3.15.1
./bootstrap
make
sudo make install

By default cmake is installed in /usr/local/bin.
Install openmp


Grab source from git repository:
git clone https://github.com/llvm/llvm-project.git


Enter the openmp directory, make a build directory:


cd llvm-project/openmp
mkdir build
cd build


Configure with cmake (https://github.com/llvm-mirror/openmp/blob/master/README.rst)

cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..


Then build:

make
sudo make install

By default libomp.dylib is installed in /usr/local/lib, the headers are installed in /usr/local/include. Both these locations are on my PATH. If they're not on your path then you may need to set the DYLD_LIBRARY_PATH environment variable.
The https://openmp.llvm.org/ webpage states:

The runtime can be built with gcc, icc or clang. However, note that a runtime built with clang cannot be guaranteed to work with OpenMP code compiled by the other compilers, since clang does not support a 128-bit float type, and cannot therefore generate the code used for reductions of that type (which may occur in user code compiled by the other compilers).


To make a static build (libomp.a) you have to set LIBOMP_ENABLE_SHARED:BOOL=FALSE in CMakeCache.txt.

An openmp example

Let's create a small program to check that openmp is installed and compilable. Save the following as example.c
#include <stdio.h>
#include <omp.h>
 
int main() {
  #pragma omp parallel num_threads(3)
  {
    int id = omp_get_thread_num();
    int data = id;
    int total = omp_get_num_threads();
    printf("Greetings from process %d out of %d with Data %d\n", id, total, data);
  }
  printf("parallel for ends.\n");
  return 0;
}

Now compile:
gcc -Xpreprocessor -fopenmp example.c -I /usr/local/include -L /usr/local/lib -lomp -o example.bin

Then run the program ./example.bin. The following is output:
Greetings from process 0 out of 3 with Data 0
Greetings from process 1 out of 3 with Data 1
Greetings from process 2 out of 3 with Data 2
parallel for ends.

The output may vary a little on your machine.
A cython example with openmp

omp_testing.pyx
from cython.parallel import prange
  
cdef int i
cdef int n = 30
cdef int sum = 0

for i in prange(n, nogil=True):
    sum += i

print(sum)

The following setup.py file can be used.
import os
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext


ext_modules=[
    Extension("omp_testing",
              ["omp_testing.pyx"],
              libraries=[],
              extra_compile_args=[],
              extra_link_args=['-lomp']
              )
]

setup(
  name="omp_testing",
  cmdclass={"build_ext": build_ext},
  ext_modules=ext_modules
)

The following environment variables need to be set. I've not had success putting them in extra_compile_args.
export CFLAGS="-Xpreprocessor -fopenmp $CFLAGS"
export CXXFLAGS="-Xpreprocessor -fopenmp $CXXFLAGS"

The preprocessor command allows the openmp pragmas to be processed. I didn't need to specify the library location (/usr/local/lib) because it's on my PATH.
There are also extra tests that you can try from the cython codebase.
If you're experiencing problems running those extra tests try adding -D_OPENMP to extra_compile_args. There may be several locations in the generated C code that are protected by that:
#ifdef _OPENMP
#pragma omp parallel num_threads(__pyx_t_2)
#endif /* _OPENMP */

openmp can causes crashes if more than one copy of the openMP runtime is used


I tried running the cyreflect.pyx extension that uses a prange loop. The extension is linked against the libomp.dylib build above (running otool -L cyreflect.*.so indicates the link). When I tried plotting in Jupyter with matplotlib the Python kernel crashed with the following message:


OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

The reason for this crash is because there is some functionality in either matplotlib or numpy that uses the libiomp5.dylib library. This library is used by numpy+MKL. The libomp.dylib and libiomp5.dylib libraries are in conflict. The crash disappears if you make an environment variable KMP_DUPLICATE_LIB_OK=TRUE. The unsupported, undocumented workaround sounds scary, given that it may silently produce incorrect results and cause crashes!

If the '-lomp' entry is removed from the extra_linker_args in the setup.py file, then the extension still builds. Running otool -L cyreflect.*.so does not have a path for the omp linkage. Running nm -gC cyreflect.*.so indicates that there are unresolved symbols for omp (e.g.  U ___kmpc_barrier). When the extension is run again there is no longer a crash. The symbols required for the extension to run are presumably sourced from the libiomp5.dylib, and not /usr/local/lib/libomp.dylib.

Complications for building a Python extension with OpenMP


Detection of OpenMP support: the setup.py file would have to dynamically figure out whether the compiler and available libraries offer OpenMP support. Astropy have a helper package that can do that, but the modified version in sklearn looks a bit better. These is a non-trivial modification.


If MKL is used by numpy (or another package) then libiomp5.dylib will be available. If this is present then there is no need to link with an OpenMP library of your own during the build process, the required symbols will be found by the dynamic linker during execution. However, if the libiomp5.dylib file is not available, then you need to have an OpenMP library linked during the build process, or have have one available on PATH during execution. Unfortunately if you have linked your own OpenMP library, and other packages try to use the other libiomp5.dylib library as well, then a crash will result. In this circumstance the KMP_DUPLICATE_LIB_OK environment variable has to be set, which is a bodge. Unfortunately, when building and distributing wheels you will have to include an OpenMP library in the wheel, because you don't know if the libiomp5 library will be available. To bundle the OpenMP library auditwheel can be used on Linux, and delocate on macOS. When the package is used the KMP_DUPLICATE_LIB_OK environment variable would have to be set on import, before multiple omp libraries get initialised. If there was someway of figuring out which libraries were present, and unloading one, then that would be awesome.
Note that building OpenMP also creates a symlink /usr/local/lib/libiomp5.dylib, which points to /usr/local/lib/libomp.dylib. If you specify '-liomp5' in the setup.py file then the extension works with no crashes. It's unclear to me which library gets used during linking, and during execution. If there are several packages that bundle libomp, then one could end up with many libraries all loaded at the same time. I wonder if it would make sense to make a PyPI package solely for the OpenMP runtime?


What flags are required on different systems? On Linux one has to use -fopenmp, on windows it's /openmp, on other OSes?? The helper code above helps


Is the linked OpenMP fork safe?
See https://joblib.readthedocs.io/en/latest/parallel.html#bad-interaction-of-multiprocessing-and-third-party-libraries and
https://codewithoutrules.com/2018/09/04/python-multiprocessing.


Some openmp runtimes aren’t fork-safe. Most notably, this includes gcc’s libgomp. Upon entering the first openmp parallel region, the runtime initializes a thread pool which won’t be rebuilt in the child after fork. This means that any parallel regions in the child will deadlock. Single threaded openmp loops seem to be safe though.

The difficulty here is that Extension code may not know how it's going to be called. Will the user just use in a single process, or will they be calling from a Pool? Perhaps it's safer to use spawn or forkserver when initialising a Pool? This can be done with multiprocessing.get_context. Perhaps any functions that use OpenMP should be set to single thread, unless the user specifically required it.

Oversubscription of Processors
If there is a hierarchical call structure and each layer is able to multithread, then it's important to avoid oversubscription. For example, consider A --> B --> C. If all of A, B, C can use OMP/multiprocessing/pthreads/MPI then it's entirely possible that the processor becomes oversubscribed. For a given application it could make sense to change the distribution of parallelisation around. For example, A could be parallelised over multiple nodes of a cluster using MPI, B could be distributed across the processors of a node using multiprocessing.Pool, C could just be single threaded. However, if the calculation is done on a single CPU, then it might make sense to parallelise C with OMP, with serial execution of A and B. This means that all layers of the application should offer fine grained control over how they achieve parallelisation. In sklearn they say:


On OSX, we can get a runtime error due to multiple OpenMP libraries loaded simultaneously. This can happen for instance when calling BLAS inside a prange. Setting the following environment variable allows multiple OpenMP libraries to be loaded. It should not degrade performances since we manually take care of potential over-subcription performance issues, in sections of the code where nested OpenMP loops can happen, by dynamically reconfiguring the inner OpenMP runtime to temporarily disable it while under the scope of the outer OpenMP parallel section.