mrocklin/numba-cuda-stencil.ipynb

## numba-cuda-stencil.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comparing Stencil Computations on the CPU and GPU\n",
    "=================================================\n",
    "\n",
    "This notebook builds off of a [recent blogpost](https://blog.dask.org/2019/04/09/numba-stencil) using Numba, Dask, NumPy, to build parallel compiled computations easily.\n",
    "\n",
    "At the end of that post I posed the question of how we could run these computations on the GPU.  I was curious both on how much harder it would be to write and on how much faster it would go.  This notebook explores this question by using [numba.cuda.jit](https://numba.pydata.org/numba-doc/dev/cuda/index.html).\n",
    "\n",
    "We learn that it is, in fact, much faster to use a GPU and that, if you're ok with copy-pasting a bit of code that you might not understand, it's also quite easy for someone who is unfamiliar with GPUs, like myself."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stencil computations on CPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import numba\n",
    "\n",
    "@numba.stencil\n",
    "def _smooth(x):\n",
    "    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +\n",
    "            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +\n",
    "            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9\n",
    "\n",
    "@numba.njit\n",
    "def smooth_cpu(x):\n",
    "    return _smooth(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "621 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "x_cpu = np.ones((10000, 10000), dtype='int8')\n",
    "\n",
    "%timeit smooth_cpu(x_cpu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stencil computations on GPU\n",
    "\n",
    "Using the `numba.cuda` module I'm able to get about a 200x increase with a modest increase in code complexity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numba import cuda\n",
    "\n",
    "@cuda.jit\n",
    "def smooth_gpu(x, out):\n",
    "    i, j = cuda.grid(2)\n",
    "    n, m = x.shape\n",
    "    if 1 <= i < n - 1 and 1 <= j < m - 1:\n",
    "        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +\n",
    "                     x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +\n",
    "                     x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.87 ms ± 90.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "import cupy, math\n",
    "\n",
    "x_gpu = cupy.ones((10000, 10000), dtype='int8')\n",
    "out_gpu = cupy.zeros((10000, 10000), dtype='int8')\n",
    "\n",
    "# I copied the four lines below from the Numba docs\n",
    "threadsperblock = (16, 16)\n",
    "blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])\n",
    "blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])\n",
    "blockspergrid = (blockspergrid_x, blockspergrid_y)\n",
    "\n",
    "%timeit smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Note: the GPU solution here cheats a bit because it pre-allocates the output array*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final thoughts\n",
    "\n",
    "GPUs are fast at doing local stencil computations.  This isn't surprising, but it's nice to see.  \n",
    "\n",
    "What is surprisingly pleasant is how easy it is to write CUDA-like array computing code from Python with Numba and CuPy.\n",
    "\n",
    "Actually that's not entirely true.  I still don't fully understand how I should determine the `blockspergrid` and `threadsperblock` parameters to the `smooth_gpu` kernel execution, and I suspect that many novice users share my uncertainty.  I am curious if there is a way to make a decent choice here given information that I do know (like the shape of the grid I'd like to compute over) and other information that can be automatically  pulled from the hardware.  I would much rather say, for example...\n",
    "\n",
    "```python\n",
    "@cuda.jit\n",
    "def smooth_gpu(x, out):\n",
    "    n, m = x.shape\n",
    "    i, j = cuda.grid[1:n - 1, 1:m - 1]  # <-- define grid here?\n",
    "    \n",
    "    out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +\n",
    "                 x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +\n",
    "                 x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9\n",
    "    \n",
    "\n",
    "smooth_gpu(x_gpu, out_gpu)\n",
    "```\n",
    "\n",
    "... or something similar (I don't know the technical constraints here well enough to pose solutions with much probability of success."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:cudf]",
   "language": "python",
   "name": "conda-env-cudf-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Comparing Stencil Computations on the CPU and GPU\n",
	"=================================================\n",
	"\n",
	"This notebook builds off of a [recent blogpost](https://blog.dask.org/2019/04/09/numba-stencil) using Numba, Dask, NumPy, to build parallel compiled computations easily.\n",
	"\n",
	"At the end of that post I posed the question of how we could run these computations on the GPU. I was curious both on how much harder it would be to write and on how much faster it would go. This notebook explores this question by using [numba.cuda.jit](https://numba.pydata.org/numba-doc/dev/cuda/index.html).\n",
	"\n",
	"We learn that it is, in fact, much faster to use a GPU and that, if you're ok with copy-pasting a bit of code that you might not understand, it's also quite easy for someone who is unfamiliar with GPUs, like myself."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Stencil computations on CPU"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"import numba\n",
	"\n",
	"@numba.stencil\n",
	"def _smooth(x):\n",
	" return (x[-1, -1] + x[-1, 0] + x[-1, 1] +\n",
	" x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +\n",
	" x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9\n",
	"\n",
	"@numba.njit\n",
	"def smooth_cpu(x):\n",
	" return _smooth(x)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"621 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"x_cpu = np.ones((10000, 10000), dtype='int8')\n",
	"\n",
	"%timeit smooth_cpu(x_cpu)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Stencil computations on GPU\n",
	"\n",
	"Using the `numba.cuda` module I'm able to get about a 200x increase with a modest increase in code complexity."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"from numba import cuda\n",
	"\n",
	"@cuda.jit\n",
	"def smooth_gpu(x, out):\n",
	" i, j = cuda.grid(2)\n",
	" n, m = x.shape\n",
	" if 1 <= i < n - 1 and 1 <= j < m - 1:\n",
	" out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +\n",
	" x[i , j - 1] + x[i , j] + x[i , j + 1] +\n",
	" x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"2.87 ms ± 90.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
	]
	}
	],
	"source": [
	"import cupy, math\n",
	"\n",
	"x_gpu = cupy.ones((10000, 10000), dtype='int8')\n",
	"out_gpu = cupy.zeros((10000, 10000), dtype='int8')\n",
	"\n",
	"# I copied the four lines below from the Numba docs\n",
	"threadsperblock = (16, 16)\n",
	"blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])\n",
	"blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])\n",
	"blockspergrid = (blockspergrid_x, blockspergrid_y)\n",
	"\n",
	"%timeit smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note: the GPU solution here cheats a bit because it pre-allocates the output array"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Final thoughts\n",
	"\n",
	"GPUs are fast at doing local stencil computations. This isn't surprising, but it's nice to see. \n",
	"\n",
	"What is surprisingly pleasant is how easy it is to write CUDA-like array computing code from Python with Numba and CuPy.\n",
	"\n",
	"Actually that's not entirely true. I still don't fully understand how I should determine the `blockspergrid` and `threadsperblock` parameters to the `smooth_gpu` kernel execution, and I suspect that many novice users share my uncertainty. I am curious if there is a way to make a decent choice here given information that I do know (like the shape of the grid I'd like to compute over) and other information that can be automatically pulled from the hardware. I would much rather say, for example...\n",
	"\n",
	"```python\n",
	"@cuda.jit\n",
	"def smooth_gpu(x, out):\n",
	" n, m = x.shape\n",
	" i, j = cuda.grid[1:n - 1, 1:m - 1] # <-- define grid here?\n",
	" \n",
	" out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +\n",
	" x[i , j - 1] + x[i , j] + x[i , j + 1] +\n",
	" x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9\n",
	" \n",
	"\n",
	"smooth_gpu(x_gpu, out_gpu)\n",
	"```\n",
	"\n",
	"... or something similar (I don't know the technical constraints here well enough to pose solutions with much probability of success."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [conda env:cudf]",
	"language": "python",
	"name": "conda-env-cudf-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}