Skip to content

Instantly share code, notes, and snippets.

@lgarrison
Last active July 23, 2020 14:10
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save lgarrison/9e89a6f98cfa28f97af62450bc191822 to your computer and use it in GitHub Desktop.
Save lgarrison/9e89a6f98cfa28f97af62450bc191822 to your computer and use it in GitHub Desktop.
High-Performance Python, or When to Write For Loops in Python
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "89b0588a-2716-4250-8b76-f4a6e1a55573"
}
},
"source": [
"# High-Performance Python, or When to Write For Loops in Python\n",
"Author: Lehman Garrison ([lgarrison.github.io](https://lgarrison.github.io))\n",
"\n",
"Last updated: 7/16/2017"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An introduction to a few modern Python libraries for high-performance numerical operations."
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "eb59e0f8-9878-43e4-ac8e-37c202e96d5c"
}
},
"source": [
"## Numexpr\n",
"https://github.com/pydata/numexpr\n",
"\n",
"Numexpr helps accelerate simple, one-line NumPy expressions. It mainly does so by avoiding temporary arrays, but it's also has great parallelism and vectorization support.\n",
"\n",
"Unfortunately, only a subset of NumPy is presently supported. The feature set will be greatly expanded in [Numexpr 3.0](https://github.com/pydata/numexpr/tree/numexpr-3.0), but it's still in development."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example: converting a density field to a density contrast"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import numexpr as ne\n",
"ne.set_num_threads(10);"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rho = np.empty((512,512,512), dtype=np.float32)\n",
"rho[:] = np.random.random(rho.shape)\n",
"rho_mean = rho.mean(dtype=np.float64).astype(np.float32) # Use double precision for intermediate accumulations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The NumPy way:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 1.61 s per loop\n"
]
}
],
"source": [
"%%timeit\n",
"delta = np.exp((rho/rho_mean - 1.)**2.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Numexpr way:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 403 ms per loop\n"
]
}
],
"source": [
"%%timeit\n",
"delta = ne.evaluate('exp((rho/rho_mean - 1.)**2.)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We were using 10 cores. Did our speedup come from multi-threading or loop-blocking/vectorization?"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 329 ms per loop\n",
"1 loop, best of 3: 2.63 s per loop\n"
]
}
],
"source": [
"ne.set_num_threads(10)\n",
"%timeit ne.evaluate('exp((rho/rho_mean - 1.)**2.)')\n",
"ne.set_num_threads(1)\n",
"%timeit ne.evaluate('exp((rho/rho_mean - 1.)**2.)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait, what happened? Why is single-threaded NumExpr slower than NumPy?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"float32\n",
"float64\n"
]
}
],
"source": [
"np_delta = np.exp((rho/rho_mean - 1.)**2.)\n",
"ne_delta = ne.evaluate('exp((rho/rho_mean - 1.)**2.)')\n",
"print np_delta.dtype\n",
"print ne_delta.dtype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NumExpr changed the computation to double precision while we weren't looking! Floats like `1.` are always interpreted as `double`s (in Python and NumExpr), and NumExpr uses the highest precision of any operand (in symmetric binary operators, but not `**` for some reason). This is in contrast to NumPy, which respects the `dtype` of the array operands.\n",
"\n",
"There are two solutions:\n",
"- define an intermediate variable like `one = np.float32(1.)`\n",
"- use an integer `1`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"float32\n"
]
}
],
"source": [
"ne_delta = ne.evaluate('exp((rho/rho_mean - 1)**2.)')\n",
"print ne_delta.dtype"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 198 ms per loop\n",
"1 loop, best of 3: 1.5 s per loop\n"
]
}
],
"source": [
"ne.set_num_threads(10)\n",
"%timeit ne.evaluate('exp((rho/rho_mean - 1)**2.)')\n",
"ne.set_num_threads(1)\n",
"%timeit ne.evaluate('exp((rho/rho_mean - 1)**2.)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So even in the single-threaded case, we got a 10% speedup. NumExpr achieves this by doing the division and subtraction on small blocks of the array rather than doing the divison on the whole array then a second pass to do the subtraction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NumExpr also supports an `out` keyword arugment for true in-place operations, which can be useful for large arrays:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NumPy:\n",
"\t1 loop, best of 3: 1.18 s per loop\n",
"NumPy inplace:\n",
"\t1 loop, best of 3: 401 ms per loop\n",
"NumExpr:\n",
"\t1 loop, best of 3: 1.23 s per loop\n",
"NumExpr inplace:\n",
"\t1 loop, best of 3: 541 ms per loop\n"
]
}
],
"source": [
"rho_double = np.random.random((768,768,768))\n",
"\n",
"rho = rho_double.copy()\n",
"print 'NumPy:\\n\\t',\n",
"%timeit np_rho2 = rho**2\n",
"\n",
"rho = rho_double.copy()\n",
"print 'NumPy inplace:\\n\\t',\n",
"%timeit global rho; rho **= 2\n",
"\n",
"rho = rho_double.copy()\n",
"print 'NumExpr:\\n\\t',\n",
"ne.set_num_threads(1)\n",
"%timeit ne_rho2 = ne.evaluate('rho**2')\n",
"\n",
"rho = rho_double.copy()\n",
"print 'NumExpr inplace:\\n\\t',\n",
"ne_inplace_rho2 = rho\n",
"%timeit ne.evaluate('rho**2', out=ne_inplace_rho2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NumExpr is actually slower than NumPy in this simple case; it's possible that the NumExpr overhead isn't worth it for this simple operation. But if we were to switch to multi-threaded we would see a speed-up:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NumExpr inplace, multi-threaded:\n",
"\t10 loops, best of 3: 167 ms per loop\n"
]
}
],
"source": [
"ne.set_num_threads(10)\n",
"rho = rho_double.copy()\n",
"print 'NumExpr inplace, multi-threaded:\\n\\t',\n",
"ne_inplace_rho2 = rho\n",
"%timeit ne.evaluate('rho**2', out=ne_inplace_rho2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example: many operations\n",
"A quick example of an extreme speedup with math-heavy operations using numexpr"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rho = np.empty((512,512,512), dtype=np.float32)\n",
"rho[:] = np.random.random(rho.shape)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 11.9 s per loop\n",
"1 loop, best of 3: 2.51 s per loop\n",
"1 loop, best of 3: 318 ms per loop\n"
]
}
],
"source": [
"%timeit (np.sin(rho**2) + np.cos(rho**3))**.5\n",
"ne.set_num_threads(1)\n",
"%timeit ne.evaluate('(sin(rho**2) + cos(rho**3))**.5')\n",
"ne.set_num_threads(10)\n",
"%timeit ne.evaluate('(sin(rho**2) + cos(rho**3))**.5')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Numba\n",
"http://numba.pydata.org/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numba compiles Python functions into native code at the moment you call the function --- this is called \"just-in-time\" compilation. The resulting code can often match native C code while retaining Python flexibility *and being written in pure Python*.\n",
"\n",
"Numba is much more heavy-weight than NumExpr and is best seen as an alternative to something like Cython. There's no new language to learn, but you do have to write some funny-looking Python.\n",
"\n",
"Numba has much more developer support than NumExpr (and has a [Moore Foundation grant](https://www.continuum.io/blog/developer-blog/gordon-and-betty-moore-foundation-grant-numba-and-dask) for its development). It's likely going to continue to mature in the coming years."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, an annoyance: you have to set the number of threads before importing Numba for the first time."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import os\n",
"os.environ['NUMBA_NUM_THREADS'] = '10'\n",
"import numba as nb\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"arr = np.empty((4096,4096), dtype=np.float64)\n",
"arr[:] = np.random.random(arr.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numba has one main function: `nb.jit`. It's used as a decorator:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def py_sum2d(arr):\n",
" M, N = arr.shape\n",
" result = 0.0\n",
" for i in range(M):\n",
" for j in range(N):\n",
" result += arr[i,j]\n",
" return result\n",
"\n",
"@nb.jit\n",
"def nb_sum2d(arr):\n",
" M, N = arr.shape\n",
" result = 0.0\n",
" for i in range(M):\n",
" for j in range(N):\n",
" result += arr[i,j]\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 3.2 s per loop\n",
"The slowest run took 7.74 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
"100 loops, best of 3: 17 ms per loop\n"
]
}
],
"source": [
"%timeit py_sum2d(arr)\n",
"%timeit nb_sum2d(arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first time we run the timing test, we get a message that one of the Numba timing tests took way longer than the others. That's because the first time we call the jitted function, it has to be compiled for the dtype of the arguments. The compiled function is cached, though, making subsequent calls faster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's a good idea to always pass the `nopython=True` kwarg. This prevents Numba from falling back into object mode if it finds something it can't compile to machine code, e.g. a Python function call:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import scipy.special\n",
"\n",
"@nb.jit(nopython=True)\n",
"def nb_sum2d_j1(arr):\n",
" M, N = arr.shape\n",
" result = 0.\n",
" for i in range(M):\n",
" for j in range(N):\n",
" result += scipy.special.j1(arr[i,j])\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"ename": "UntypedAttributeError",
"evalue": "Caused By:\nTraceback (most recent call last):\n File \"/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py\", line 238, in run\n stage()\n File \"/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py\", line 452, in stage_nopython_frontend\n self.locals)\n File \"/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py\", line 841, in type_inference_stage\n infer.propagate()\n File \"/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/typeinfer.py\", line 773, in propagate\n raise errors[0]\nUntypedAttributeError: Unknown attribute 'j1' of type Module(<module 'scipy.special' from '/home/lgarrison/anaconda2/lib/python2.7/site-packages/scipy/special/__init__.pyc'>)\nFile \"<ipython-input-17-a93836d82e7b>\", line 9\n[1] During: typing of get attribute at <ipython-input-17-a93836d82e7b> (9)\n\nFailed at nopython (nopython frontend)\nUnknown attribute 'j1' of type Module(<module 'scipy.special' from '/home/lgarrison/anaconda2/lib/python2.7/site-packages/scipy/special/__init__.pyc'>)\nFile \"<ipython-input-17-a93836d82e7b>\", line 9\n[1] During: typing of get attribute at <ipython-input-17-a93836d82e7b> (9)",
"output_type": "error",
"traceback": [
"\u001b[0;31m\u001b[0m",
"\u001b[0;31mUntypedAttributeError\u001b[0mTraceback (most recent call last)",
"\u001b[0;32m<ipython-input-18-994225f75ce8>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnb_sum2d_j1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/dispatcher.pyc\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m 328\u001b[0m for i, err in failed_args))\n\u001b[1;32m 329\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpatch_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 330\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 331\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 332\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0minspect_llvm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msignature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mUntypedAttributeError\u001b[0m: Caused By:\nTraceback (most recent call last):\n File \"/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py\", line 238, in run\n stage()\n File \"/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py\", line 452, in stage_nopython_frontend\n self.locals)\n File \"/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py\", line 841, in type_inference_stage\n infer.propagate()\n File \"/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/typeinfer.py\", line 773, in propagate\n raise errors[0]\nUntypedAttributeError: Unknown attribute 'j1' of type Module(<module 'scipy.special' from '/home/lgarrison/anaconda2/lib/python2.7/site-packages/scipy/special/__init__.pyc'>)\nFile \"<ipython-input-17-a93836d82e7b>\", line 9\n[1] During: typing of get attribute at <ipython-input-17-a93836d82e7b> (9)\n\nFailed at nopython (nopython frontend)\nUnknown attribute 'j1' of type Module(<module 'scipy.special' from '/home/lgarrison/anaconda2/lib/python2.7/site-packages/scipy/special/__init__.pyc'>)\nFile \"<ipython-input-17-a93836d82e7b>\", line 9\n[1] During: typing of get attribute at <ipython-input-17-a93836d82e7b> (9)"
]
}
],
"source": [
"nb_sum2d_j1(arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numba does let you call ctypes and CFFI functions in `nopython` mode, as we'll see later."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of Numba's main drawbacks is lack of broad support for parallelism. `jit` has a `nogil` option to release the global interpreter lock, which is useful if you have a higher-level parallelism scheme in place (e.g. Python `threading`), but it gives no speedup on its own.\n",
"\n",
"However, by using the less-flexible `vectorize` or `guvectorize` decorators, we can get excellent parallelization in some specialized cases:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"@nb.vectorize([nb.float64(nb.float64, nb.float64)], nopython=True, target='cpu')\n",
"def vec_op(a, b):\n",
" return np.sin(a**b)**2.\n",
"\n",
"@nb.vectorize([nb.float64(nb.float64, nb.float64)], nopython=True, target='parallel')\n",
"def parallel_vec_op(a, b):\n",
" return np.sin(a**b)**2."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 1.32 s per loop\n",
"1 loop, best of 3: 215 ms per loop\n"
]
}
],
"source": [
"%timeit vec_op(arr, 2*arr)\n",
"%timeit parallel_vec_op(arr, 2*arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we got an excellent speedup, but we also had to specify the input dtypes. There are dynamic ufuncs that give more flexibility, but sometimes they don't behave as expected with the `parallel` target.\n",
"\n",
"Also note that `vectorize` only lets us operate element-wise on the input arrays. `guvectorize` gives us access to the whole array if we need, and has a `parallel` target. It's the closest thing to a parallel version of `jit`, but it has more programmer overhead. Here's an example:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 1.1 s per loop\n",
"1 loop, best of 3: 1.07 s per loop\n"
]
}
],
"source": [
"@nb.guvectorize([(nb.float64[:,:], nb.float64[:,:], nb.float64[:,:])], '(nx,ny),(nx,ny)->(nx,ny)', target='cpu', nopython=True)\n",
"def guvec(a, b, c):\n",
" M, N = a.shape\n",
" for i in range(M):\n",
" for j in range(N):\n",
" c[i,j] = a[i,j]**b[i,j]\n",
"\n",
" \n",
"@nb.guvectorize([(nb.float64[:,:], nb.float64[:,:], nb.float64[:,:])], '(nx,ny),(nx,ny)->(nx,ny)', target='parallel', nopython=True)\n",
"def parallel_guvec(a, b, c):\n",
" M, N = a.shape\n",
" for i in range(M):\n",
" for j in range(N):\n",
" c[i,j] = a[i,j]**b[i,j]\n",
"\n",
"%timeit guvec(arr, 2*arr)\n",
"%timeit parallel_guvec(arr, 2*arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait, why didn't we get any parallel speedup? In Numba, parallelization only happens when broadcasting, so let's write it in a way that broadcasts over the first dimension:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 loops, best of 3: 181 ms per loop\n"
]
}
],
"source": [
"@nb.guvectorize([(nb.float64[:], nb.float64[:], nb.float64[:])], '(ny),(ny)->(ny)', target='parallel', nopython=True)\n",
"def parallel_guvec(a, b, c):\n",
" N, = a.shape\n",
" for j in range(N):\n",
" c[j] = a[j]**b[j]\n",
"\n",
"%timeit parallel_guvec(arr, 2*arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I've hacked a fairly general solution that achieves parallelism by broadcasting over an array of indices (I call it `parallel_bcast`). I'm expecting it will be superceded by Numba functionality within a few years, though."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Update**: it's here! Numba now offers `numba.prange` as a replacement to any `range` statement. You must also pass `parallel=True` to the `jit` decorator. For example:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 182 ms per loop\n"
]
}
],
"source": [
"@nb.jit(nopython=True, parallel=True)\n",
"def nb_prange(a, b):\n",
" M, N = a.shape\n",
" c = np.empty_like(a)\n",
" for i in nb.prange(M):\n",
" for j in range(N):\n",
" c[i,j] = a[i,j]**b[i,j]\n",
" \n",
"%timeit nb_prange(arr, 2*arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See [below](#Experimental:-Using-nb.prange-for-histogramming) for a non-trivial example where `prange` fails to live up to its promise, however."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numba also offers `gpu` and `cuda` targets for transparent offloading to the GPU, but I haven't tried it myself."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example: radial histogramming"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above examples were pretty trivial. Numba really shines when you can use explicit looping to calculate intermediate values that you would otherwise have to store all at once in a big Numpy array. In this example, we'll bin values on a rectangular lattice into spherical shells. Thus, we need to know the radial distance to each lattice site. This is what we'll calculate on-the-fly with Numba."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def py_radial_hist(values, box, bin_edges):\n",
" nx,ny,nz = values.shape\n",
" X,Y,Z = np.ogrid[0:box:box/nx, 0:box:box/ny, 0:box:box/nz]\n",
" radii = X**2 + Y**2 + Z**2\n",
" \n",
" return np.histogram(radii, bins=bin_edges**2, weights=values)\n",
"\n",
"\n",
"@nb.jit(nopython=True)\n",
"def nb_radial_hist(values, boxsize, bin_edges):\n",
" nx,ny,nz = values.shape\n",
" \n",
" histogram = np.zeros(len(bin_edges)-1)\n",
"\n",
" # Do binning with squared distances\n",
" bin_edges = bin_edges**2\n",
" nbp1 = len(bin_edges)\n",
"\n",
" for i in range(nx):\n",
" dx2 = (boxsize/nx*i)**2\n",
" for j in range(ny):\n",
" dy2 = (boxsize/ny*j)**2\n",
" for k in range(nz):\n",
" dz2 = (boxsize/nz*k)**2\n",
" dist = dx2 + dy2 + dz2\n",
" if dist < bin_edges[0] or dist > bin_edges[-1]:\n",
" continue\n",
" for b in range(1,nbp1):\n",
" if dist < bin_edges[b]:\n",
" histogram[b-1] += values[i,j,k]\n",
" break\n",
" else: # last bin is closed\n",
" histogram[-1] += values[i,j,k]\n",
" \n",
" return histogram"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"box = 1.\n",
"bin_edges = np.linspace(0,box,100)\n",
"values = np.random.random((512,512,512))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 9.73 s per loop\n",
"1 loop, best of 3: 5.98 s per loop\n"
]
}
],
"source": [
"%timeit py_radial_hist(values, box, bin_edges)\n",
"%timeit nb_radial_hist(values, box, bin_edges)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we cut the runtime of the Numpy implementation by 1/3, and we didn't even consider the memory usage. The Numpy implementation doubles the memory usage of the input array, since it has to construct the radii. This can be a big problem when dealing with $2048^3$ FFT meshes, for example.\n",
"\n",
"In practice, I use a `parallel_bcast`-based parallel implementation, which gives a huge additional speedup. In theory, we could achieve this parallelism with `nb.prange`. Let's explore this below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Experimental: Using `nb.prange` for histogramming\n",
"\n",
"`nb.prange` has a lot of promise, but as an \"experimental feature\" it's often clunky to use in practice. Let's try to apply it to this histogramming example and see what happens.\n",
"\n",
"First, let's write a naive implementation with a race condition. Note that to complile, the `prange` must be in the innermost loop."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"@nb.jit(nopython=True, parallel=True)\n",
"def BAD_nb_parallel_radial_hist(values, boxsize, bin_edges):\n",
" nx,ny,nz = values.shape\n",
" \n",
" histogram = np.zeros(len(bin_edges)-1)\n",
" \n",
" # parallel=True quirk: some versions wouldn't compile without squaring in-place\n",
" bin_edges = bin_edges**2\n",
" nbp1 = len(bin_edges)\n",
" \n",
" for i in range(nx):\n",
" for j in range(ny):\n",
" # another quirk: prange must be in inner loop\n",
" for k in nb.prange(nz):\n",
" dx2 = (boxsize/nx*i)**2\n",
" dy2 = (boxsize/ny*j)**2\n",
" dz2 = (boxsize/nz*k)**2\n",
" dist = dx2 + dy2 + dz2\n",
" if dist < bin_edges[0] or dist > bin_edges[-1]:\n",
" continue\n",
" for b in range(1,nbp1):\n",
" if dist < bin_edges[b]:\n",
" histogram[b-1] += values[i,j,k]\n",
" break\n",
" else: # last bin is closed\n",
" histogram[-1] += values[i,j,k]\n",
" \n",
" # also some versions wouldn't compile without a return!\n",
" return histogram"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"py_answer = py_radial_hist(values, box, bin_edges)[0]\n",
"nb_answer = nb_radial_hist(values, box, bin_edges)\n",
"BAD_parallel_answer = BAD_nb_parallel_radial_hist(values, box, bin_edges)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n",
"False\n"
]
}
],
"source": [
"print np.allclose(py_answer, nb_answer)\n",
"print np.allclose(py_answer, BAD_parallel_answer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we expected, the race condition caused the parallel version to give the wrong answer. In fact, it doesn't even give a consistent answer from run to run:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False\n"
]
}
],
"source": [
"BAD_parallel_answer2 = BAD_nb_parallel_radial_hist(values, box, bin_edges)\n",
"print np.allclose(BAD_parallel_answer, BAD_parallel_answer2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can write a parallel-safe version by giving each `k` its own temporary histogram space to write into:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"@nb.jit(nopython=True, parallel=True)\n",
"def nb_parallel_radial_hist(values, boxsize, bin_edges):\n",
" nx,ny,nz = values.shape\n",
" \n",
" histogram = np.zeros((nz, len(bin_edges)-1))\n",
" \n",
" bin_edges = bin_edges**2\n",
" nbp1 = len(bin_edges)\n",
" \n",
" for i in range(nx):\n",
" for j in range(ny):\n",
" # prange only works on inner loop\n",
" for k in nb.prange(nz):\n",
" dx2 = (boxsize/nx*i)**2\n",
" dy2 = (boxsize/ny*j)**2\n",
" dz2 = (boxsize/nz*k)**2\n",
" dist = dx2 + dy2 + dz2\n",
" if dist < bin_edges[0] or dist > bin_edges[-1]:\n",
" continue\n",
" for b in range(1,nbp1):\n",
" if dist < bin_edges[b]:\n",
" histogram[k, b-1] += values[i,j,k]\n",
" break\n",
" else: # last bin is closed\n",
" histogram[k, -1] += values[i,j,k]\n",
" \n",
" # Silly! This could be written as\n",
" # reduced_hist = histogram.sum(axis=0)\n",
" # but Numba auto-parallelization doesn't support axis reductions\n",
" reduced_hist = np.zeros(len(bin_edges)-1)\n",
" for b in range(len(reduced_hist)):\n",
" for k in range(nz):\n",
" reduced_hist[b] += histogram[k, b]\n",
" \n",
" return reduced_hist"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n"
]
}
],
"source": [
"parallel_answer = nb_parallel_radial_hist(values, box, bin_edges)\n",
"print np.allclose(py_answer, parallel_answer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, how do the timings stack up?"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 5.95 s per loop\n",
"1 loop, best of 3: 10 s per loop\n",
"1 loop, best of 3: 10 s per loop\n"
]
}
],
"source": [
"%timeit nb_radial_hist(values, box, bin_edges)\n",
"%timeit BAD_nb_parallel_radial_hist(values, box, bin_edges)\n",
"%timeit nb_parallel_radial_hist(values, box, bin_edges)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That didn't go as expected. If each thread is only operating on one element at a time, that could explain the inefficiency. Hopefully it's implemented internally to operate on `nz/nthreads` elements at a time, though.\n",
"\n",
"We could try flattening the nested loops, but then we'd need `nx*ny*nz` temporary space! We actually only need `nthreads` temporary space, but the local thread ID is completely hidden from us. Let's write an egregiously wrong version just to test the speed:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"@nb.jit(nopython=True, parallel=True)\n",
"def BAD_nb_parallel_radial_hist_flat(values, boxsize, bin_edges):\n",
" nx,ny,nz = values.shape\n",
" \n",
" histogram = np.zeros(len(bin_edges)-1)\n",
" \n",
" bin_edges = bin_edges**2\n",
" nbp1 = len(bin_edges)\n",
" \n",
" for x in nb.prange(nx*ny*nz):\n",
" i = x / (ny*nz)\n",
" j = (x % (ny*nz)) / nz\n",
" k = x % nz\n",
" dx2 = (boxsize/nx*i)**2\n",
" dy2 = (boxsize/ny*j)**2\n",
" dz2 = (boxsize/nz*k)**2\n",
" dist = dx2 + dy2 + dz2\n",
" if dist < bin_edges[0] or dist > bin_edges[-1]:\n",
" continue\n",
" for b in range(1,nbp1):\n",
" if dist < bin_edges[b]:\n",
" histogram[b-1] += values[i,j,k]\n",
" break\n",
" else: # last bin is closed\n",
" histogram[-1] += values[i,j,k]\n",
" \n",
" return histogram"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 1.57 s per loop\n"
]
}
],
"source": [
"%timeit BAD_nb_parallel_radial_hist_flat(values, box, bin_edges)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That was about as fast as we expected, but it still gives the wrong answer. Maybe when `prange` supports nested loops it will be possible to write a correct version of this.\n",
"\n",
"For fun, let's time the implementation with my custom parallelization scheme (\"`parallel_bcast`\"). It's also doing a periodic wrap, unlike the above, so it's doing more work."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/lgarrison/abacus/python/Abacus/Analysis/PowerSpectrum/Histogram.py:215: RuntimeWarning: invalid value encountered in divide\n",
" radii /= counts\n",
"/home/lgarrison/abacus/python/Abacus/Analysis/PowerSpectrum/Histogram.py:222: RuntimeWarning: invalid value encountered in divide\n",
" weights /= counts\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 1.16 s per loop\n"
]
}
],
"source": [
"import Abacus.Analysis.PowerSpectrum.Histogram as AbacusHistogram\n",
"#reload(AbacusHistogram.Histogram.)\n",
"# Pre-compute the mean radius of the bin and the counts per bin\n",
"bin_info = AbacusHistogram.RadialBinGrid(box, values, bin_edges)\n",
"%timeit AbacusHistogram.RadialBinGrid(box, values, bin_edges, bin_info=bin_info)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CFFI\n",
"https://bitbucket.org/cffi/cffi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"CFFI (C Foreign Function Interface) is a way to call C code from Python. It's designed as a replacement to tools like SWIG, ctypes, or SciPy.weave (now deprecated).\n",
"\n",
"SWIG is probably still appropriate if you want to call your C code from many languages, not just Python. But ctypes or CFFI are the preferred Python solutions, with a preference for CFFI. We'll see why below."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Example: calling a GSL function from Numba"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's go back to the Numba example where we tried to call `scipy.special.j1` but Numba wouldn't allow it in `nopython` mode. We'll use the recommended \"API-level, out-of-line\" mode. This means the C compiler is invoked, and we have to run a build script beforehand.\n",
"\n",
"We're going to write the script to disk and run it, because this would probably be its own script in a real application."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overwriting build_ffilib.py\n"
]
}
],
"source": [
"%%writefile build_ffilib.py\n",
"if __name__ == '__main__':\n",
" # Compile the FFI lib\n",
" import cffi\n",
" ffibuilder = cffi.FFI()\n",
" ffibuilder.set_source('_gslffilib',\n",
" r\"\"\"\n",
" #include <gsl/gsl_sf_bessel.h> // This gets compiled\n",
" \"\"\", libraries=['gsl', 'gslcblas'])\n",
" ffibuilder.cdef(\"\"\"\n",
" double gsl_sf_bessel_j1 (double x); // This can be copied straight from the man page and is parsed by CFFI\n",
" \"\"\")\n",
" \n",
" ffibuilder.compile(verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"generating ./_gslffilib.c\n",
"(already up-to-date)\n",
"running build_ext\n",
"building '_gslffilib' extension\n",
"gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/lgarrison/anaconda2/include/python2.7 -c _gslffilib.c -o ./_gslffilib.o\n",
"gcc -pthread -shared -L/home/lgarrison/anaconda2/lib -Wl,-rpath=/home/lgarrison/anaconda2/lib,--no-as-needed ./_gslffilib.o -L/home/lgarrison/anaconda2/lib -lgsl -lgslcblas -lpython2.7 -o ./_gslffilib.so\n"
]
}
],
"source": [
"!python build_ffilib.py"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"build_ffilib.py _gslffilib.c _gslffilib.o \u001b[0m\u001b[01;32m_gslffilib.so\u001b[0m*\r\n"
]
}
],
"source": [
"%ls *ffi*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that we just compiled `_gslffilib.so`. Here's how we use it:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.30116867893975674"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import _gslffilib\n",
"_gslffilib.lib.gsl_sf_bessel_j1(1.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And here's how we use it with Numba:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import numba as nb\n",
"\n",
"import numba.cffi_support\n",
"numba.cffi_support.register_module(_gslffilib)\n",
"gsl_sf_bessel_j1 = _gslffilib.lib.gsl_sf_bessel_j1"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import scipy.special\n",
"\n",
"@nb.jit(nopython=True)\n",
"def nb_sum2d_cffi(arr):\n",
" M, N = arr.shape\n",
" result = 0.\n",
" for i in range(M):\n",
" for j in range(N):\n",
" result += gsl_sf_bessel_j1(arr[i,j])\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2593.4050247737537"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nb_sum2d_cffi(np.random.random((128,128)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are other ways to use CFFI (ABI vs API, in-line vs out-of-line), but the above is the preferred approach for most applications."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CFFI vs ctypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's how we would call `gsl_sf_bessel_j1` with ctypes."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import ctypes as ct\n",
"libgsl = ct.cdll.LoadLibrary(\"libgsl.so\")\n",
"\n",
"# Set the argument and return types\n",
"libgsl.gsl_sf_bessel_j1.restype = ct.c_double\n",
"libgsl.gsl_sf_bessel_j1.argtypes = [ct.c_double]"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.30116867893975674"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"libgsl.gsl_sf_bessel_j1(1.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That was a lot easier, so why not use ctypes all the time? First of all, ctypes is ABI-only --- it never invokes the C compiler. This is very error prone and can easily cause segfaults if the wrong argments are passed, wrong signatures are specified, etc. It's also hard to keep updated if the C call signature changes. CFFI, on the other hand, just requires that you copy/paste the C call signature (or load it programmatically). Most modern Python projects are moving away from ctypes towards CFFI.\n",
"\n",
"Finally, CFFI can compile your raw C code for you, if you don't already have it in a shared library. We didn't demonstrate that here, but it's a much more portable and low-maintenance solution than trying to detect what compilers are available on the target platform, for example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A final note: CFFI was developed as part of the PyPy project, which is a performance-oriented alternative to CPython. However, I wouldn't recommend including PyPy in your workflow, as their approach requires re-implementing basic libraries like Numpy to work with PyPy (\"NumPyPy\"). This is ultimately an unsustainable approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dask\n",
"http://dask.pydata.org/en/latest/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A library for out-of-core, asynchronous, or distributed jobs on big data. It has a lot of promise, but hasn't quite lived up to its potential in my experience. It's being funded by the same grant as Numba, so I'm expecting a lot of maturation in the next few years."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
},
"nbpresent": {
"slides": {
"37250ad3-4229-4ada-88f0-214599096b6d": {
"id": "37250ad3-4229-4ada-88f0-214599096b6d",
"prev": "9adc90d1-b908-481e-aa92-54aaf3a1bd77",
"regions": {
"46e46ae8-5023-4d92-9be3-710af3dabf80": {
"attrs": {
"height": 0.8,
"width": 0.8,
"x": 0.1,
"y": 0.1
},
"id": "46e46ae8-5023-4d92-9be3-710af3dabf80"
}
}
},
"9adc90d1-b908-481e-aa92-54aaf3a1bd77": {
"id": "9adc90d1-b908-481e-aa92-54aaf3a1bd77",
"prev": null,
"regions": {}
}
},
"themes": {}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment