hannes/dlopen.md

## dlopen.md

      
    Raw
  

              dlopen.md
            
          
    Parallel Python within the same process or hacking around the cursed GIL with a hand-rolled library loader

From its obscure beginnings in Amsterdam, the Python programming language has become a fundamental building block of our digital society. It is used literally everywhere and by everyone for a mind-boggingly wide variety of tasks.
Python is also the lingua franca of Data Science, tying together tools for data loading, wrangling, analysis and AI. There is a massive ecosystem of contributed Python packages, which - for example - allows reading every obscure data format under the sun. This makes Python and its ecosystem extremely valuable for analytical data management systems: Users are likely somewhat familiar with Python due to its immense popularity and the ecosystem provides solutions for most data problems. As a result, Python is being integrated into SQL systems, typically through so-called User-Defined Functions (UDFs). For example, Apache Spark relies heavily on Python UDFs to express more complex scalar transformations during query processing.
How exactly Python is being integrated differs between systems. DuckDB for example supports scalar Python UDFs. Here is a simple if nonsensical example of a scalar DuckDB Python UDF that just adds 42 to an integer:
duckdb.create_function('add42', lambda x : x + 42, [duckdb.typing.BIGINT], duckdb.typing.BIGINT)

duckdb.sql('select add42(range) from range(5)')
┌────────────────┐
│ add42("range") │
│     int64      │
├────────────────┤
│             42 │
│             43 │
│             44 │
│             45 │
│             46 │
└────────────────┘

Unfortunaly, despite its popularity, Python is not without some warts. One of the bigger problems in Python today is the harmless-sounding Global Interpreter Lock (GIL). The GIL protects access to Python's internal data structures such that only one thread can interact with the Python interpreter at a given time. DuckDB in general automatically parallelizes queries across many hardware threads for greatly improved performance. But because of the GIL, queries that include a Python UDF will be forced to only use a single thread only.
Other systems like Apache Spark take a different architectural approach where they launch many additional processes (not threads) that each run their own Python interpreter. This way, there are many instances of the lock and parallelism can be achieved. However, running the Python code in external processes requires us to ship the input to the function to the separate process the result back to the database process. It is also problematic for an unobtrusive in-process system like DuckDB to suddenly launch additional operating system processes.
While there are some attempts to get rid of the Python GIL in general, their time line reach many years into the future, and even then we would have to wait for the widespread adoption of the new Python versions. Overall, we end up with a pretty unfortunate situation: Running Python threads in parallel in a single process is effectively impossible because of the GIL. Using multiple processes creates a lot of overhead because data has to be shipped around.
But what if the GIL was not that global after all? To explore this question, we have to dive a bit deeper into how modern operating systems handle dynamic libraries. Say we would like to run some Python code from C. Below is a snippet that achieves this using dlopen() and dlsym(), the run-time interface to the dynamic linker.
#include <dlfcn.h>

int main() {
    // open the Python runtime dynamic library
    void* lib_hdl = dlopen("/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/Python", RTLD_NOW);

    // find the function pointers
    void (*initialize)(int) = dlsym(lib_hdl, "Py_InitializeEx");
    void (*run)(char*)      = dlsym(lib_hdl, "PyRun_SimpleString");
    void (*finalize)(void)  = dlsym(lib_hdl, "Py_FinalizeEx");

    // run some Python code
    initialize(0);
    run("print('Hello, World')");
    finalize();
}
Of course, your path to the Python dynamic library will be different, but this should just compile and print "Hello, World". Under the hood, dlopen() will instruct the operating system loader to load the Python runtime library into virtual address space. Then, dlsym() will find the addresses of the Python C API functions like PyRun_SimpleString. We can then call those functions.
It is important to understand that while dlopen() returns an opaque handle to pass to dlsym(), it will internally only load a library once per process, which means that while its perfectly possible to launch ten threads and call dlopen() in each thread, the operating system will only load the library (and all its dependencies) into the process exactly once. There is a good reason for this, in general we do not want to have multiple copies of the same library in the same address space. But in our case, that also means that the GIL will exist exactly once in a process which leads to the lack of parallelism described above.
But there is no reason to use the operating-system provided loader in the first place. After all, dynamic libraries just contain binary code that the operating system maps into the virtual address space. How hard could it possibly be?
First of all, each operating system has a different format for binaries. Windows uses the Portable Executable (PE) format, Linux has its aptly named Executable and Linking Format (ELF) and Mac OS uses something called Mach-O. They are all very similar, we will use Mach-O as an example in the following. As a gross oversimplification, A Mach-O dynamic library contains a number of symbols like for example PyRun_SimpleString, they can either be functions or global variables (like the GIL). Functions will refer to internal functions, globals, or functions and globals from other libraries.
For internal references, we cannot directly use memory addresses because the library will not be loaded with the same base address every time. Instead, modern libraries use position-independent code with relative addresses for internal references, e.g. "read from the global variable 1254 instrunctions ahead" instead of "read from global variable at address 5434234234". This way, the binary can be loaded at any address and will still run fine.
For references to other libraries, the linker that produced the dynamic library does not know where in the address space those symbols will be, because this is decided by the operating system at runtime. The locations are also randomized on purpose to make it harder to exploit bugs. This is why the loader will first load a library's dependencies, and then "fixes up" or a lookup table with the real addresses of the symbols that live in other libraries.
It turns out this whole process can be implemented from user space, we will need mmap() to place the library binary in the address space, we will have to perform the required fixups, and mark the executable part as such using mprotect(). This means we can completely ignore the system-provided dlopen() and its (for us) unhelpful behavior regarding loading the same library multiple times. We can "just" implement our own loader, put entire Python and its dependencies into memory multiple times. While this is somewhat wasteful of address space, it does not actually require significant additional physical memory: The libraries are memory-mapped are mostly unchanged, which means the operating system will use copy-on-write semantics.
Of course, Python itself also liberally uses dlopen to load the native code parts of its various built-in and contributed modules. We do not want to use the operating system's loader for those libraries either, they might also contain globals that would otherwise hamper our parallelism. But since we control the loading process, we can patch the dlopen calls done by Python to refer to our own version and so on recursively. This allows to completely control the loading process all all the way down the explicit (declared in library) and implicit (using dlopen) dependency path.
While implementing your own loader might sound dangerous and error-prone, Apple has thankfully open sourced their loader implementation. This means we can re-use large portions of the quite intricate relocation code. We have implemented an experimental proof-of-concept parallel_python extension for DuckDB that ships with the hacked loader.
The extension makes use of DuckDB's morsel-driven parallelism model, where queries are separated into individual streaming pipelines. Each pipeline is then parallelized over a number of worker threads in a work stealing approach. For scalar Python functions, execution takes place in a physical projection operator. In DuckDB, each physical operator can have two states, a global state that exists from the time the operator is created as part of a physical query plan and a local state that is local to each thread. We use the local state to load a thread-local Python interpreter. Then, during execution of the scalar Python function, we can use this fully isolated interpreter to run our Python code without suffering from the dreaded GIL. Using the proof-of-concept extension, we can run the following SQL code:
-- point the extension to our Python library
SET parallel_python_library='/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/Python';

-- create a table with 100m integersc
CREATE TABLE t AS FROM generate_series(100000000);

-- Use Python to add 42 to each value in t
SELECT min(python_integer_function('x + 42', generate_series)) FROM t;

On my ten-core MacBook, this query will finish in ca. 30 seconds, compared to ca. 250 seconds when using a single thread. This shows how the ten Python interpreters are effectively not bound by sharing the GIL and we can scale the computation without relying on external processes.