mrocklin/memory-access.ipynb

## memory-access.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computations can be bound by memory access\n",
    "\n",
    "Even though we think of parallelism in terms of \"across cpus\" there are several other resources that can limit our ability to parallelize.  These include resources like disk and network I/O, which most people have some experience with, but also more subtle limitations, like memory access.\n",
    "\n",
    "This notebook shows a simple example of how (not) optimizing for memory access can limit parallelism even for simple computations.\n",
    "\n",
    "This is part of a conversation between Dask and SKImage developers about parallelizing scikit-image functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A small summation function\n",
    "\n",
    "We write a summation function that iterates along the last dimension first, and then the first dimension.  We compile this with numba for ease, but the same lessons apply to C, Fortran, or Cython."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numba import jit\n",
    "\n",
    "@jit(nopython=True, nogil=True)\n",
    "def mysum(x):\n",
    "    total = 0\n",
    "    for i in range(x.shape[0]):\n",
    "        for j in range(x.shape[1]):\n",
    "            total += x[i, j]\n",
    "    return total"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Apply to large numpy array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.2"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "x = np.empty((20000, 20000))\n",
    "x.nbytes / 1e9"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 732 ms, sys: 4 ms, total: 736 ms\n",
      "Wall time: 765 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%time mysum(x)  # First time takes a little bit longer due to JIT compilation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 488 ms, sys: 0 ns, total: 488 ms\n",
      "Wall time: 488 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%time mysum(x)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Measure parallelism\n",
    "\n",
    "We use the `ptime` magic, get this with \n",
    "\n",
    "    pip install ptime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext ptime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total serial time:   2.01 s\n",
      "Total parallel time: 0.52 s\n",
      "For a 3.86X speedup across 4 threads\n"
     ]
    }
   ],
   "source": [
    "%ptime -n 4 mysum(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This parallelizes well"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Change ordering to Fortran\n",
    "\n",
    "Now we switch to fortran ordering so that our for-loops proceed against the strides of the array.  They access different regions in memory that are not close to each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(160000, 8)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x.strides"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "del x # make space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(8, 160000)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = np.empty((20000, 20000), order='F')\n",
    "y.strides"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 688 ms, sys: 8 ms, total: 696 ms\n",
      "Wall time: 712 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%time mysum(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total serial time:   2.19 s\n",
      "Total parallel time: 1.03 s\n",
      "For a 2.13X speedup across 4 threads\n"
     ]
    }
   ],
   "source": [
    "%ptime -n 4 mysum(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So even though this seems to work decently well sequentially (the memory hierarchy can keep up) it fails to parallelize well (the memory can't quite support more than two threads thrashing memory at once."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Computations can be bound by memory access\n",
	"\n",
	"Even though we think of parallelism in terms of \"across cpus\" there are several other resources that can limit our ability to parallelize. These include resources like disk and network I/O, which most people have some experience with, but also more subtle limitations, like memory access.\n",
	"\n",
	"This notebook shows a simple example of how (not) optimizing for memory access can limit parallelism even for simple computations.\n",
	"\n",
	"This is part of a conversation between Dask and SKImage developers about parallelizing scikit-image functions."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### A small summation function\n",
	"\n",
	"We write a summation function that iterates along the last dimension first, and then the first dimension. We compile this with numba for ease, but the same lessons apply to C, Fortran, or Cython."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"from numba import jit\n",
	"\n",
	"@jit(nopython=True, nogil=True)\n",
	"def mysum(x):\n",
	" total = 0\n",
	" for i in range(x.shape[0]):\n",
	" for j in range(x.shape[1]):\n",
	" total += x[i, j]\n",
	" return total"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Apply to large numpy array"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3.2"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import numpy as np\n",
	"x = np.empty((20000, 20000))\n",
	"x.nbytes / 1e9"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 732 ms, sys: 4 ms, total: 736 ms\n",
	"Wall time: 765 ms\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"0.0"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%time mysum(x) # First time takes a little bit longer due to JIT compilation"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 488 ms, sys: 0 ns, total: 488 ms\n",
	"Wall time: 488 ms\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"0.0"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%time mysum(x) "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Measure parallelism\n",
	"\n",
	"We use the `ptime` magic, get this with \n",
	"\n",
	" pip install ptime"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"%load_ext ptime"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Total serial time: 2.01 s\n",
	"Total parallel time: 0.52 s\n",
	"For a 3.86X speedup across 4 threads\n"
	]
	}
	],
	"source": [
	"%ptime -n 4 mysum(x)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This parallelizes well"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Change ordering to Fortran\n",
	"\n",
	"Now we switch to fortran ordering so that our for-loops proceed against the strides of the array. They access different regions in memory that are not close to each other."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(160000, 8)"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"x.strides"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"del x # make space"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(8, 160000)"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"y = np.empty((20000, 20000), order='F')\n",
	"y.strides"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 688 ms, sys: 8 ms, total: 696 ms\n",
	"Wall time: 712 ms\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"0.0"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%time mysum(y)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Total serial time: 2.19 s\n",
	"Total parallel time: 1.03 s\n",
	"For a 2.13X speedup across 4 threads\n"
	]
	}
	],
	"source": [
	"%ptime -n 4 mysum(y)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So even though this seems to work decently well sequentially (the memory hierarchy can keep up) it fails to parallelize well (the memory can't quite support more than two threads thrashing memory at once."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}