perrygeo/parallel_concurrent_python.md

## parallel_concurrent_python.md

      
    Raw
  

              parallel_concurrent_python.md
            
          
    Parallelism and Concurrency in Python


There should be one-- and preferably only one --obvious way to do it.


-- Tim Peters, Zen of Python

When we need our programs to run faster, the first place we often look is parallelism and concurrency. By doing more things at once, we hope for significant speed gains. Python has a number of techniques optimized for different use cases; there is no one obvious way to do it! So how do you decide the correct approach?
Here's my quick take on the primary tools for parallel execution in Python, their strengths and weaknesses. I'll focus mainly on the tools in the standard library, but I'll touch on a few third-party modules that provide interesting features.
Background

The difference between parallel and concurrent is subtle but important. Both refer to running multiple things at the same time. They both imply a non-linear flow to your program, usually with the goal of making the program as a whole faster.
Parallel code is actually doing multiple things at once. Think about stirring soup while simultaneously flipping a burger with the other hand. This often applies to CPU-bound tasks.
Concurrent code is doing a single thing while waiting on other things. To extend the cooking analogy, you can be making eggs while you wait on your coffee. This applies to IO-bound tasks such as making network requests.
Side note: The GIL

You hear a lot of talk about the infamous Python global interpretter lock; known by its pleasant acronym, the GIL. Much has been said on the GIL [TODO]
For our purposes, the GIL means that a single python process can only execute one thread at a time. Other threads can be waiting, just not running code. Therefore a single python process is generally capable only of achieving concurrency, not parallelism.
Python tools for concurrency and parellism

threading

The python threading module in the standard library provides everything you need to build concurrent applications. Raw threads are familiar to programmers from C and other langauges that use real operating system threads.  It's also very low level and very easy to make mistakes that result in Hiesenbugs. Recommended only if you have IO bound problems and you can confidently look in the mirror and say "I enjoy write thread-safe code that is provably correct!"
multiprocessing

The multiprocessing module gives you the ability to create separate python processes for acheiving parallelism. These processes can communicate through various mechanisms. In some ways, it provides a threading-like interface for handling processes. With potential for shared state, locks, message passing, etc. it can get complicated and you can shoot yourself in the foot in much the same way as threads. But it works well when your process is CPU-bound and you need to process on multiple cores, especially if your problem is embarssingly parallel.
asyncio

Asyncio provides an entirely new paradigm for concurrent programming. It uses Python 3-specific syntax to define asyncronous functions and run them in an explicit event loop. The primary improvement is the functions read more like linear code and are arguably easier to test and reason about compared to the threading equivalents. I would consider asyncio a low level framework for building applications where async is core to the problem (networking libraries, socket servers, etc). One of the downsides is that you need to fully buy-in to asyncio; most popular IO libraries are syncronous (blocking) by nature and they need to be replaced wholesale by their asyncio equivalents. That ecosystem is still evolving and best practices are hard to come by.
concurrent.futures executors

Providing a high-level interface over both threading and multiprocessing, Executors give us the ability to solve common parallelism and concurency problems with a nice abstraction. Just as we can use the builtin map to apply a syncronous function to every element of a sequence.
responses = map(requests.get, urls)  # not parallel
We can do the same thing with ThreadPoolExecutor.map, concurrently across threads
with futures.ThreadPoolExecutor(max_workers=20) as executor:
    responses = executor.map(requests.get, urls)
or with ProcessPoolExecutor.map, in parallel across processes
with futures.ProcessPoolExecutor(max_workers=4) as executor:
    results = executor.map(lengthy_calculation, responses)
You don't get as much fine-grained control as with the lower-level primitives but executors work very well for the majority of common tasks that could benefit from either parellel or concurrent execution.
Releasing the GIL

By writing intensive code in a C extension module, You can explicitly release the GIL and use C-level threading to exploit multiple cores for true parallelism. The downside is that none of your calculations can touch any Python object; everything happens in C land. Of course this means you need to write (or generate through something like Cython) safe multithreaded C code, not a task to be taken lightly.
Distributed computing

Some applications, think dedicated data crunching applications, require more than one computer to operate effectively. The learning curve from running an application locally to managing multiple nodes in a distributed system is steep. By running your code on multiple machines, you can effectively replicate the multiprocessing approach across huge clusters of processing power.
Queue-based systems are popular, allowing you to place tasks in a job queue to later get picked up by the next available worker. Parallelism is limited only by how many workers you have available. Systems like celery are mature solution for workers consisting of a python function. Something like ecs-watchbot is good if you're in the AWS ecosystem and want to run workers as a DOcker container. Services like AWS Lambda plus their Simple Queue Service (SQS) provides the building blocks for a queue-based distributed system without having to manage server infrastrcuture.
There are more advanced approaches to distributed scheduling that are optimized for larger systems. Airflow lets you express your workflow as a directed acyclic graphs written in Python. Dask is a distributed schduler optimized around interactive use in e.g. scientific computing in a Jupyter Notebook.
Abstractions over the standard lib

Now we're really delving into the third party modules. A search on PyPi reveals a number of modules for doing concurrent/parallel execution [TODO rough count]. I won't pretend to do an exhaustive search, and I wouldn't discourage anyone from looking at these but...


The abstractions tend to be pretty leaky. In practice you'll often need to fork the source code and tweak the underlying details to get optimal performance.


The quality, testing and support varies widely.


As a high-level abstraction for simple concurrency and parallelism, I haven't found one that works quite as well as concurrent executors.


In other words, if your use case is too tricky for Executors, you might be better off starting with the stdlib primitives.
That being said, python-pmap, multitasking look interesting.