Jessime/PearsonTime.ipynb

## PearsonTime.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Calculating a Pearson Correlation Matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is going to look at two different ways to calculate a Pearson's Correlation matrix. One way is significantly faster than the other. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###Watermark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Jessime Kirk Sat Oct 03 2015 \n",
      "\n",
      "CPython 2.7.10\n",
      "IPython 3.2.0\n",
      "\n",
      "numpy 1.9.2\n",
      "scipy 0.16.0\n",
      "\n",
      "compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)\n",
      "system     : Linux\n",
      "release    : 3.16.0-50-generic\n",
      "machine    : x86_64\n",
      "processor  : x86_64\n",
      "CPU cores  : 4\n",
      "interpreter: 64bit\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -a 'Jessime Kirk' -nmv --packages numpy,scipy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we get started, a little bit of context might be useful. We are going to generate a few random matricies which are going to stand in for actual data. Whatever the actual data is, think of each row of the matrix as a sample, or a specific instance of a set of measurements. Let's pretend we are working with mice data. Each row is a mouse. Each column, then, is a set of features describing the mouse. Its height, its weight, its glucose levels, whatever. A question we might want to ask is, how similar is each mouse? Pearson's Correlation is one way of doing that. \n",
    "\n",
    "Let's go ahead and make some data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[2, 8, 6, 1, 8, 9, 5, 3, 7],\n",
       "       [3, 8, 8, 6, 3, 7, 3, 2, 1],\n",
       "       [6, 8, 1, 7, 4, 2, 9, 9, 2],\n",
       "       [3, 1, 6, 6, 7, 9, 1, 5, 6],\n",
       "       [5, 2, 6, 3, 7, 1, 5, 7, 2]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "small_data = np.random.randint(1, 10, [5, 9])\n",
    "small_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[39, 27, 65, ..., 82, 71,  4],\n",
       "       [48, 59, 63, ..., 32, 29, 29],\n",
       "       [ 6, 80, 27, ..., 81, 66,  8],\n",
       "       ..., \n",
       "       [77, 19,  8, ...,  5, 88, 23],\n",
       "       [75, 50, 98, ..., 36,  5, 85],\n",
       "       [63, 31, 70, ..., 69, 79, 63]])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "medium_data = np.random.randint(1,100, [50, 90])\n",
    "medium_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[687, 707, 473, ..., 699,  59,  10],\n",
       "       [370, 713, 505, ...,   6, 376, 644],\n",
       "       [959, 545, 776, ..., 642, 714, 659],\n",
       "       ..., \n",
       "       [785, 828, 365, ..., 501, 738,  73],\n",
       "       [233, 950,  36, ..., 845, 262, 906],\n",
       "       [217, 407, 413, ..., 322, 268, 131]])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "large_data = np.random.randint(1,1000, [500, 900])\n",
    "large_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "large_data_rows = np.random.randint(1,1000, [5000, 900])\n",
    "large_data_cols = np.random.randint(1,1000, [500, 9000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Pearson with For Loop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Pearson matrix is going to be a symmetric, square matrix of pairwise comparisons. Element [0,0] will be the first row compaired to the first row, so it will result in a 1 (a thing is perfectly correlated with itself). Element [0,1] will be the first row compared to the second row, and so forth. \n",
    "\n",
    "The basic way to do this calculation is to one-by-one do these calculations. Let's write a function that will do it that way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from scipy.stats import pearsonr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def pearson_long(data):\n",
    "    \"\"\"Calculate a pearson correlation matrix element by element\"\"\"\n",
    "    corr_mat = np.zeros([len(data), len(data)])\n",
    "    for j, row1, in enumerate(data):\n",
    "        for k, row2 in enumerate(data):\n",
    "            dist = pearsonr(row1,row2)[0]\n",
    "            corr_mat[j][k] = dist\n",
    "    return corr_mat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can run this with our datasets. First, let's just see the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 1.     0.254 -0.499  0.247 -0.322]\n",
      " [ 0.254  1.    -0.2    0.061 -0.348]\n",
      " [-0.499 -0.2    1.    -0.709  0.214]\n",
      " [ 0.247  0.061 -0.709  1.    -0.097]\n",
      " [-0.322 -0.348  0.214 -0.097  1.   ]]\n"
     ]
    }
   ],
   "source": [
    "print np.around(pearson_long(small_data), decimals=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's use the timeit magic to see how long the math takes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000 loops, best of 3: 1.15 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_long(small_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, that seems reasonably quick. Let's time the medium dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 122 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_long(medium_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The medium dataset it 10x as big in both rows and columns, and took ~100x as long to compute. Let's see if that holds true for the larger ones as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 14.8 s per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_long(large_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 24min 58s per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_long(large_data_rows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 37.7 s per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_long(large_data_cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the 100x trend is still more or less holding. And 15s for the large_data doesn't seem that bad. The problem is that the 'large_data' matrix is actually small itself. \n",
    "\n",
    "The rows vs. cols experiments show that, as we might expect, the for loops are likely sucking a lot of our resources.\n",
    "\n",
    "We can use Ipython's build-in profiler magic on the large_data and see exactly where time is going."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " "
     ]
    }
   ],
   "source": [
    "profile = %prun -r pearson_long(large_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "         12998006 function calls in 19.456 seconds\n",
      "\n",
      "   Ordered by: internal time\n",
      "\n",
      "   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n",
      "   250000    4.167    0.000   18.940    0.000 stats.py:2498(pearsonr)\n",
      "  1250000    3.899    0.000    3.899    0.000 {method 'reduce' of 'numpy.ufunc' objects}\n",
      "   500000    2.039    0.000    6.532    0.000 _methods.py:53(_mean)\n",
      "   249500    1.594    0.000    2.493    0.000 stats.py:4570(betai)\n",
      "  1749500    1.252    0.000    1.252    0.000 {numpy.core.multiarray.array}\n",
      "   500000    1.085    0.000    4.141    0.000 stats.py:4688(ss)\n",
      "   500000    0.802    0.000    1.046    0.000 _methods.py:43(_count_reduce_items)\n",
      "  1249500    0.627    0.000    1.517    0.000 numeric.py:394(asarray)\n",
      "   500000    0.556    0.000    2.058    0.000 fromnumeric.py:1631(sum)\n",
      "  1500000    0.544    0.000    0.544    0.000 {isinstance}\n",
      "        1    0.517    0.517   19.456   19.456 <ipython-input-7-2e56e00094ae>:1(pearson_long)\n",
      "   249500    0.511    0.000    0.511    0.000 {numpy.core.multiarray.where}\n",
      "   500000    0.401    0.000    0.997    0.000 stats.py:208(_chk_asarray)\n",
      "   500000    0.281    0.000    0.643    0.000 numeric.py:464(asanyarray)\n",
      "   500000    0.238    0.000    6.770    0.000 {method 'mean' of 'numpy.ndarray' objects}\n",
      "   500000    0.188    0.000    1.304    0.000 _methods.py:31(_sum)\n",
      "   500000    0.181    0.000    0.181    0.000 {issubclass}\n",
      "   500000    0.167    0.000    0.167    0.000 {range}\n",
      "   500000    0.160    0.000    0.160    0.000 {hasattr}\n",
      "   250000    0.108    0.000    0.108    0.000 {min}\n",
      "   250000    0.057    0.000    0.057    0.000 {max}\n",
      "   250000    0.051    0.000    0.051    0.000 {abs}\n",
      "   250002    0.034    0.000    0.034    0.000 {len}\n",
      "        1    0.000    0.000    0.000    0.000 {numpy.core.multiarray.zeros}\n",
      "        1    0.000    0.000   19.456   19.456 <string>:1(<module>)\n",
      "        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}\n",
      "\n",
      "\n",
      "<pstats.Stats instance at 0x7fb13db01d40>\n"
     ]
    }
   ],
   "source": [
    "print profile.print_stats()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9734786184210528"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "18.940/19.456"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Virtually all of our time is spent in the pearsonr function, which is called 250,000 times. One option for a speedup would be to vectorize my code. The horror of looping over numpy arrays is a fairly common issue with people (myself included) transitioning from Python to numpy. \n",
    "\n",
    "All of the math that's done internally in each of the pearsonr calls is lost the next time we need it. So while vectorization would be useful, if we could do everything at once, that would be significantly better. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Pearson with Linear Algebra"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second method for calculating our desired matrix involves just a bit of linear algebra."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def pearson_short(data):\n",
    "    \"\"\"Calculate a pearson correlation matrix by matrix multipying a row standardized matrix\"\"\"\n",
    "    data = (data.T - np.mean(data, axis=1)).T\n",
    "    data = (data.T / np.std(data, axis=1)).T\n",
    "    data = np.inner(data, data)/data.shape[1]\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Despite the docstring, what's going on in this function may not be immediately obvious, even though the idea is pretty simple. There are three steps:\n",
    "\n",
    "1. For each row in the data, find out it's mean. Subtract that mean from each element in that row.\n",
    "2. For each row, find it's standard deviation. Divide each element in the row by that standard deviation.\n",
    "3. Take the inner product of the matrix and itself. \n",
    "\n",
    "Not so bad, right? Before we start timing, let's print out the small matrix produced my this function and verify that it looks the same as the other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 1.     0.254 -0.499  0.247 -0.322]\n",
      " [ 0.254  1.    -0.2    0.061 -0.348]\n",
      " [-0.499 -0.2    1.    -0.709  0.214]\n",
      " [ 0.247  0.061 -0.709  1.    -0.097]\n",
      " [-0.322 -0.348  0.214 -0.097  1.   ]]\n"
     ]
    }
   ],
   "source": [
    "print np.around(pearson_short(small_data), decimals=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ True,  True,  True,  True,  True],\n",
       "       [ True,  True,  True,  True,  True],\n",
       "       [ True,  True,  True,  True,  True],\n",
       "       [ True,  True,  True,  True,  True],\n",
       "       [ True,  True,  True,  True,  True]], dtype=bool)"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Round to 7 decimals to allow a little room for floating point error\n",
    "np.around(pearson_short(small_data), decimals=7) == np.around(pearson_long(small_data), decimals=7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 2547.23 times longer than the fastest. This could mean that an intermediate result is being cached \n",
      "1 loops, best of 3: 102 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_short(small_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, let's try the next one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 101.01 times longer than the fastest. This could mean that an intermediate result is being cached \n",
      "1000 loops, best of 3: 202 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_short(medium_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 34.3 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_short(large_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 2.46 s per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_short(large_data_rows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 314 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pearson_short(large_data_cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we expected to see, everything as achieved a significant speedup. Lets go ahead and see where we're spending our time in these matrix manipulations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " "
     ]
    }
   ],
   "source": [
    "profile = %prun -r pearson_short(large_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "         30 function calls in 0.037 seconds\n",
      "\n",
      "   Ordered by: internal time\n",
      "\n",
      "   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n",
      "        1    0.030    0.030    0.030    0.030 {numpy.core._dotblas.inner}\n",
      "        1    0.004    0.004    0.037    0.037 <ipython-input-38-d947a60be413>:1(pearson_short)\n",
      "        3    0.002    0.001    0.002    0.001 {method 'reduce' of 'numpy.ufunc' objects}\n",
      "        1    0.001    0.001    0.002    0.002 _methods.py:77(_var)\n",
      "        1    0.000    0.000    0.001    0.001 _methods.py:53(_mean)\n",
      "        2    0.000    0.000    0.000    0.000 _methods.py:43(_count_reduce_items)\n",
      "        3    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}\n",
      "        3    0.000    0.000    0.000    0.000 numeric.py:464(asanyarray)\n",
      "        1    0.000    0.000    0.002    0.002 _methods.py:123(_std)\n",
      "        1    0.000    0.000    0.001    0.001 fromnumeric.py:2651(mean)\n",
      "        1    0.000    0.000    0.002    0.002 fromnumeric.py:2737(std)\n",
      "        1    0.000    0.000    0.037    0.037 <string>:1(<module>)\n",
      "        6    0.000    0.000    0.000    0.000 {isinstance}\n",
      "        1    0.000    0.000    0.000    0.000 {max}\n",
      "        3    0.000    0.000    0.000    0.000 {issubclass}\n",
      "        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}\n",
      "\n",
      "\n",
      "<pstats.Stats instance at 0x7fb13db0be60>\n"
     ]
    }
   ],
   "source": [
    "print profile.print_stats()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8108108108108109"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    ".03/.037"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here 81.1% of our time is spent in the matrix multiplication calculation. 'numpy.core.\\_dotblas.inner' indicated that numpy is running the highly optimized [BLAS](http://www.netlib.org/blas/) library for us in the background. Especially for large matricies, this is the only way to go.\n",
    "\n",
    "Before we wrap up, we definitely need to calculate our fold speedup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "525.8378378378378"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "19.456/0.037"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Row standardizing and matrix multiplying an instances by features matrix to find a Pearson correlation matrix is significantly faster than an elementwise calculation.\n",
    "\n",
    "In the context of my work, the computer would run for **4 minutes** instead of **35 hours**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#Calculating a Pearson Correlation Matrix"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This notebook is going to look at two different ways to calculate a Pearson's Correlation matrix. One way is significantly faster than the other. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Watermark"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Jessime Kirk Sat Oct 03 2015 \n",
	"\n",
	"CPython 2.7.10\n",
	"IPython 3.2.0\n",
	"\n",
	"numpy 1.9.2\n",
	"scipy 0.16.0\n",
	"\n",
	"compiler : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)\n",
	"system : Linux\n",
	"release : 3.16.0-50-generic\n",
	"machine : x86_64\n",
	"processor : x86_64\n",
	"CPU cores : 4\n",
	"interpreter: 64bit\n"
	]
	}
	],
	"source": [
	"%load_ext watermark\n",
	"%watermark -a 'Jessime Kirk' -nmv --packages numpy,scipy"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#Data"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Before we get started, a little bit of context might be useful. We are going to generate a few random matricies which are going to stand in for actual data. Whatever the actual data is, think of each row of the matrix as a sample, or a specific instance of a set of measurements. Let's pretend we are working with mice data. Each row is a mouse. Each column, then, is a set of features describing the mouse. Its height, its weight, its glucose levels, whatever. A question we might want to ask is, how similar is each mouse? Pearson's Correlation is one way of doing that. \n",
	"\n",
	"Let's go ahead and make some data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import numpy as np"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[2, 8, 6, 1, 8, 9, 5, 3, 7],\n",
	" [3, 8, 8, 6, 3, 7, 3, 2, 1],\n",
	" [6, 8, 1, 7, 4, 2, 9, 9, 2],\n",
	" [3, 1, 6, 6, 7, 9, 1, 5, 6],\n",
	" [5, 2, 6, 3, 7, 1, 5, 7, 2]])"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"small_data = np.random.randint(1, 10, [5, 9])\n",
	"small_data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[39, 27, 65, ..., 82, 71, 4],\n",
	" [48, 59, 63, ..., 32, 29, 29],\n",
	" [ 6, 80, 27, ..., 81, 66, 8],\n",
	" ..., \n",
	" [77, 19, 8, ..., 5, 88, 23],\n",
	" [75, 50, 98, ..., 36, 5, 85],\n",
	" [63, 31, 70, ..., 69, 79, 63]])"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"medium_data = np.random.randint(1,100, [50, 90])\n",
	"medium_data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[687, 707, 473, ..., 699, 59, 10],\n",
	" [370, 713, 505, ..., 6, 376, 644],\n",
	" [959, 545, 776, ..., 642, 714, 659],\n",
	" ..., \n",
	" [785, 828, 365, ..., 501, 738, 73],\n",
	" [233, 950, 36, ..., 845, 262, 906],\n",
	" [217, 407, 413, ..., 322, 268, 131]])"
	]
	},
	"execution_count": 30,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"large_data = np.random.randint(1,1000, [500, 900])\n",
	"large_data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"large_data_rows = np.random.randint(1,1000, [5000, 900])\n",
	"large_data_cols = np.random.randint(1,1000, [500, 9000])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#Pearson with For Loop"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The Pearson matrix is going to be a symmetric, square matrix of pairwise comparisons. Element [0,0] will be the first row compaired to the first row, so it will result in a 1 (a thing is perfectly correlated with itself). Element [0,1] will be the first row compared to the second row, and so forth. \n",
	"\n",
	"The basic way to do this calculation is to one-by-one do these calculations. Let's write a function that will do it that way."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from scipy.stats import pearsonr"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def pearson_long(data):\n",
	" \"\"\"Calculate a pearson correlation matrix element by element\"\"\"\n",
	" corr_mat = np.zeros([len(data), len(data)])\n",
	" for j, row1, in enumerate(data):\n",
	" for k, row2 in enumerate(data):\n",
	" dist = pearsonr(row1,row2)[0]\n",
	" corr_mat[j][k] = dist\n",
	" return corr_mat"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now we can run this with our datasets. First, let's just see the results."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[ 1. 0.254 -0.499 0.247 -0.322]\n",
	" [ 0.254 1. -0.2 0.061 -0.348]\n",
	" [-0.499 -0.2 1. -0.709 0.214]\n",
	" [ 0.247 0.061 -0.709 1. -0.097]\n",
	" [-0.322 -0.348 0.214 -0.097 1. ]]\n"
	]
	}
	],
	"source": [
	"print np.around(pearson_long(small_data), decimals=3)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now let's use the timeit magic to see how long the math takes."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1000 loops, best of 3: 1.15 ms per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_long(small_data)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Okay, that seems reasonably quick. Let's time the medium dataset."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"10 loops, best of 3: 122 ms per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_long(medium_data)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The medium dataset it 10x as big in both rows and columns, and took ~100x as long to compute. Let's see if that holds true for the larger ones as well."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 32,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1 loops, best of 3: 14.8 s per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_long(large_data)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 48,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1 loops, best of 3: 24min 58s per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_long(large_data_rows)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1 loops, best of 3: 37.7 s per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_long(large_data_cols)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So the 100x trend is still more or less holding. And 15s for the large_data doesn't seem that bad. The problem is that the 'large_data' matrix is actually small itself. \n",
	"\n",
	"The rows vs. cols experiments show that, as we might expect, the for loops are likely sucking a lot of our resources.\n",
	"\n",
	"We can use Ipython's build-in profiler magic on the large_data and see exactly where time is going."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 52,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" "
	]
	}
	],
	"source": [
	"profile = %prun -r pearson_long(large_data)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 56,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" 12998006 function calls in 19.456 seconds\n",
	"\n",
	" Ordered by: internal time\n",
	"\n",
	" ncalls tottime percall cumtime percall filename:lineno(function)\n",
	" 250000 4.167 0.000 18.940 0.000 stats.py:2498(pearsonr)\n",
	" 1250000 3.899 0.000 3.899 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n",
	" 500000 2.039 0.000 6.532 0.000 _methods.py:53(_mean)\n",
	" 249500 1.594 0.000 2.493 0.000 stats.py:4570(betai)\n",
	" 1749500 1.252 0.000 1.252 0.000 {numpy.core.multiarray.array}\n",
	" 500000 1.085 0.000 4.141 0.000 stats.py:4688(ss)\n",
	" 500000 0.802 0.000 1.046 0.000 _methods.py:43(_count_reduce_items)\n",
	" 1249500 0.627 0.000 1.517 0.000 numeric.py:394(asarray)\n",
	" 500000 0.556 0.000 2.058 0.000 fromnumeric.py:1631(sum)\n",
	" 1500000 0.544 0.000 0.544 0.000 {isinstance}\n",
	" 1 0.517 0.517 19.456 19.456 <ipython-input-7-2e56e00094ae>:1(pearson_long)\n",
	" 249500 0.511 0.000 0.511 0.000 {numpy.core.multiarray.where}\n",
	" 500000 0.401 0.000 0.997 0.000 stats.py:208(_chk_asarray)\n",
	" 500000 0.281 0.000 0.643 0.000 numeric.py:464(asanyarray)\n",
	" 500000 0.238 0.000 6.770 0.000 {method 'mean' of 'numpy.ndarray' objects}\n",
	" 500000 0.188 0.000 1.304 0.000 _methods.py:31(_sum)\n",
	" 500000 0.181 0.000 0.181 0.000 {issubclass}\n",
	" 500000 0.167 0.000 0.167 0.000 {range}\n",
	" 500000 0.160 0.000 0.160 0.000 {hasattr}\n",
	" 250000 0.108 0.000 0.108 0.000 {min}\n",
	" 250000 0.057 0.000 0.057 0.000 {max}\n",
	" 250000 0.051 0.000 0.051 0.000 {abs}\n",
	" 250002 0.034 0.000 0.034 0.000 {len}\n",
	" 1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.zeros}\n",
	" 1 0.000 0.000 19.456 19.456 <string>:1(<module>)\n",
	" 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}\n",
	"\n",
	"\n",
	"<pstats.Stats instance at 0x7fb13db01d40>\n"
	]
	}
	],
	"source": [
	"print profile.print_stats()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 60,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.9734786184210528"
	]
	},
	"execution_count": 60,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"18.940/19.456"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Virtually all of our time is spent in the pearsonr function, which is called 250,000 times. One option for a speedup would be to vectorize my code. The horror of looping over numpy arrays is a fairly common issue with people (myself included) transitioning from Python to numpy. \n",
	"\n",
	"All of the math that's done internally in each of the pearsonr calls is lost the next time we need it. So while vectorization would be useful, if we could do everything at once, that would be significantly better. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#Pearson with Linear Algebra"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The second method for calculating our desired matrix involves just a bit of linear algebra."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 38,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def pearson_short(data):\n",
	" \"\"\"Calculate a pearson correlation matrix by matrix multipying a row standardized matrix\"\"\"\n",
	" data = (data.T - np.mean(data, axis=1)).T\n",
	" data = (data.T / np.std(data, axis=1)).T\n",
	" data = np.inner(data, data)/data.shape[1]\n",
	" return data"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Despite the docstring, what's going on in this function may not be immediately obvious, even though the idea is pretty simple. There are three steps:\n",
	"\n",
	"1. For each row in the data, find out it's mean. Subtract that mean from each element in that row.\n",
	"2. For each row, find it's standard deviation. Divide each element in the row by that standard deviation.\n",
	"3. Take the inner product of the matrix and itself. \n",
	"\n",
	"Not so bad, right? Before we start timing, let's print out the small matrix produced my this function and verify that it looks the same as the other."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 41,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[ 1. 0.254 -0.499 0.247 -0.322]\n",
	" [ 0.254 1. -0.2 0.061 -0.348]\n",
	" [-0.499 -0.2 1. -0.709 0.214]\n",
	" [ 0.247 0.061 -0.709 1. -0.097]\n",
	" [-0.322 -0.348 0.214 -0.097 1. ]]\n"
	]
	}
	],
	"source": [
	"print np.around(pearson_short(small_data), decimals=3)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 68,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ True, True, True, True, True],\n",
	" [ True, True, True, True, True],\n",
	" [ True, True, True, True, True],\n",
	" [ True, True, True, True, True],\n",
	" [ True, True, True, True, True]], dtype=bool)"
	]
	},
	"execution_count": 68,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"#Round to 7 decimals to allow a little room for floating point error\n",
	"np.around(pearson_short(small_data), decimals=7) == np.around(pearson_long(small_data), decimals=7)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 39,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The slowest run took 2547.23 times longer than the fastest. This could mean that an intermediate result is being cached \n",
	"1 loops, best of 3: 102 µs per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_short(small_data)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Okay, let's try the next one."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 40,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The slowest run took 101.01 times longer than the fastest. This could mean that an intermediate result is being cached \n",
	"1000 loops, best of 3: 202 µs per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_short(medium_data)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 42,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"10 loops, best of 3: 34.3 ms per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_short(large_data)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 43,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1 loops, best of 3: 2.46 s per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_short(large_data_rows)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 44,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1 loops, best of 3: 314 ms per loop\n"
	]
	}
	],
	"source": [
	"%timeit pearson_short(large_data_cols)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As we expected to see, everything as achieved a significant speedup. Lets go ahead and see where we're spending our time in these matrix manipulations."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 57,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" "
	]
	}
	],
	"source": [
	"profile = %prun -r pearson_short(large_data)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 58,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" 30 function calls in 0.037 seconds\n",
	"\n",
	" Ordered by: internal time\n",
	"\n",
	" ncalls tottime percall cumtime percall filename:lineno(function)\n",
	" 1 0.030 0.030 0.030 0.030 {numpy.core._dotblas.inner}\n",
	" 1 0.004 0.004 0.037 0.037 <ipython-input-38-d947a60be413>:1(pearson_short)\n",
	" 3 0.002 0.001 0.002 0.001 {method 'reduce' of 'numpy.ufunc' objects}\n",
	" 1 0.001 0.001 0.002 0.002 _methods.py:77(_var)\n",
	" 1 0.000 0.000 0.001 0.001 _methods.py:53(_mean)\n",
	" 2 0.000 0.000 0.000 0.000 _methods.py:43(_count_reduce_items)\n",
	" 3 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}\n",
	" 3 0.000 0.000 0.000 0.000 numeric.py:464(asanyarray)\n",
	" 1 0.000 0.000 0.002 0.002 _methods.py:123(_std)\n",
	" 1 0.000 0.000 0.001 0.001 fromnumeric.py:2651(mean)\n",
	" 1 0.000 0.000 0.002 0.002 fromnumeric.py:2737(std)\n",
	" 1 0.000 0.000 0.037 0.037 <string>:1(<module>)\n",
	" 6 0.000 0.000 0.000 0.000 {isinstance}\n",
	" 1 0.000 0.000 0.000 0.000 {max}\n",
	" 3 0.000 0.000 0.000 0.000 {issubclass}\n",
	" 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}\n",
	"\n",
	"\n",
	"<pstats.Stats instance at 0x7fb13db0be60>\n"
	]
	}
	],
	"source": [
	"print profile.print_stats()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 59,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.8108108108108109"
	]
	},
	"execution_count": 59,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	".03/.037"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here 81.1% of our time is spent in the matrix multiplication calculation. 'numpy.core.\\_dotblas.inner' indicated that numpy is running the highly optimized [BLAS](http://www.netlib.org/blas/) library for us in the background. Especially for large matricies, this is the only way to go.\n",
	"\n",
	"Before we wrap up, we definitely need to calculate our fold speedup."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 69,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"525.8378378378378"
	]
	},
	"execution_count": 69,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"19.456/0.037"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#Conclusion"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Row standardizing and matrix multiplying an instances by features matrix to find a Pearson correlation matrix is significantly faster than an elementwise calculation.\n",
	"\n",
	"In the context of my work, the computer would run for 4 minutes instead of 35 hours."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.10"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}