Skip to content

Instantly share code, notes, and snippets.

@mangecoeur
Last active January 9, 2024 16:04
Show Gist options
  • Save mangecoeur/9540178 to your computer and use it in GitHub Desktop.
Save mangecoeur/9540178 to your computer and use it in GitHub Desktop.
Easy parallel python with concurrent.futures

Easy parallel python with concurrent.futures

As of version 3.3, python includes the very promising concurrent.futures module, with elegant context managers for running tasks concurrently. Thanks to the simple and consistent interface you can use both threads and processes with minimal effort.

For most CPU bound tasks - anything that is heavy number crunching - you want your program to use all the CPUs in your PC. The simplest way to get a CPU bound task to run in parallel is to use the ProcessPoolExecutor, which will create enough sub-processes to keep all your CPUs busy.

We use the context manager thusly:

with concurrent.futures.ProcessPoolExecutor() as executor:
     result = executor.map(function, iterable)

The snippet above uses the map() function which is by far the easiest way to split up a parallel taks. It simply appies the function provided to each element in iterable. The function could be complex and the iterable could be a list of large data tables - or even the same table, for instance, if you want to run a function 10 times on the same data, you could do:

executor.map(fun, [data] * 10)

I’ve found this is one of the easiest and most reliable approaches. Functionally it appears to be equivalent to using the multiprocessing module in this way:

pool = multiprocessing.Pool()
pool.map(…)

However when using this in combination with pandas I have run into odd, hard to track down bugs where pool.map sometimes refused to work with an iterator longer than the number of processes - even though other times it worked fine. I haven’t yet had this problem with concurrent.futures and I suspect it’s a result of some strange interaction between pandas data and the internal mechanisms used to pass python objects between processes.

By using the ProcessPoolExecutor we are creating new processes and so circumventing the GIL (Global Interpreter Lock). We wont go into the GIL just now - basically it means you can't use Threads for CPU intensive parallel processing in the way you might in other languages

You can use Threads via the ThreadPoolExecutor, like so:

with concurrent.futures.ThreadPoolExecutor() as executor:
     result = executor.map(function, iterable)

You just change the class name and you are good to go. Using Threads you run into the GIL - your threads won’t run truely in parallel. However if you are doing IO-bound work (like reading in files or accessing the network) these usually release the GIL, so in fact you will see a performance improvement. For example say you have to make 1000 network requests, using concurrent.futures will make it so you don’t get stuck waiting for a few slow network responses when you could be running all the other ones. Because you just have to change one bit of code, you can easily experiment to see what works.

@HelloGrayson
Copy link

+1 great little writeup :)

@pbreach
Copy link

pbreach commented Sep 20, 2015

So launching threads is only truly parallel if the function being mapped releases the GIL, correct? Say if I were to use the nogil=True argument when using jit to compile a function in Numba or when writing a function in Cython for example. Would the function be operating truly in parallel in your threading example?

@phoustona
Copy link

Looks like the answer is yes, from this thread:

https://groups.google.com/a/continuum.io/forum/#!msg/numba-users/hGsFferxCbI/1qTlWE2D4XIJ

"For multithreading, we have an example in the Numba manual:

http://numba.pydata.org/numba-doc/0.19.1/user/examples.html#multi-threading

To echo the note in the manual: concurrent.futures is very nice if you are using Python 3. It lets you apply a function (which you can write with Numba and pass the nogil=True option to @jit) to every element of an iterable."

@xoelop
Copy link

xoelop commented Dec 4, 2018

Great post, thanks!!!

@ankitbhatia8993
Copy link

How can I pass arguments to the called function?

@NameArtem
Copy link

NameArtem commented Apr 10, 2019

@ankitbhatia8993 hi! For passing arguments you could use this wa

def func(a, b):
     return a+b
 
#Process the rows in chunks in parallel
with concurrent.futures.ProcessPoolExecutor(num_processes) as pool:
    result = list(pool.map(func, a, b, chunksize=10))``` 

@areliai
Copy link

areliai commented Jun 3, 2022

Just one argument of my function is iterable, How should I use it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment