Skip to content

Instantly share code, notes, and snippets.

@sujitpal
Last active June 27, 2018 07:54
Show Gist options
  • Save sujitpal/9999328 to your computer and use it in GitHub Desktop.
Save sujitpal/9999328 to your computer and use it in GitHub Desktop.
Using Market Basket (Apriori) analysis to predict frequent gene expression itemsets to predict Lung Cancer type
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Market Basket Analysis to predict Gene Expressions in various Lung Cancers ##\n",
"\n",
"We use Lung Cancer data from the Dana Farber Cancer Institute, available from the [Kent Ridge Bio-Medical Datasets](http://datam.i2r.a-star.edu.sg/datasets/krbd/) page.\n",
"\n",
"The data contains gene expression data for 203 individuals. Each record has 12,600 gene expressions so its a very wide record. The target variable (ie, the type of Lung Cancer) can be one of 5 values each corresponding to one type of Lung Cancer - Adenocarcinoma of Lung (ADEN), Squamous Cell Carcinoma (SQUA), Carcinoid (COID), Small Cell Lung Cancer (SCLC) and Normal (NORMAL).\n",
"\n",
"Our objective is to analyze this data and identify sets of gene expressions that occur frequently for a given lung cancer type using the [Apriori algorithm](http://en.wikipedia.org/wiki/Apriori_algorithm). The hope is that being able to specify frequently co-occurring gene expressions to identify a disease would make it easier to predict or confirm a disease based on gene expressions.\n",
"\n",
"We will do our analysis with Python so we first import the necessary libraries."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn import preprocessing\n",
"from operator import itemgetter\n",
"%matplotlib inline"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then read in the data from the two files provided (lung-harvard.data, which contains the gene expressions and the target variable, and lung-harvard.names, which contains the column names)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"colf = open(\"../data/lung-harvard.names\", 'rb')\n",
"targets = set()\n",
"colnames = []\n",
"for col in colf:\n",
" col = col.strip()\n",
" if len(targets) == 0:\n",
" targets.update(col.split(\",\"))\n",
" continue\n",
" if len(col) == 0: continue\n",
" colnames.append(col.split(\":\")[0].strip())\n",
"colnames.append(\"clazz\")\n",
"colf.close()\n",
"data = pd.read_csv(\"../data/lung-harvard.data\", names=colnames)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then split the dataset by the lung cancer type (ie, the value of the target variable). The targets available are:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"targets"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"{'ADEN', 'COID', 'NORMAL', 'SCLC', 'SQUA'}"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We extract the subset of the data for each target and report the number of records in each subset."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data_ADEN = data[data.clazz == 'ADEN']\n",
"data_COID = data[data.clazz == 'COID']\n",
"data_NORMAL = data[data.clazz == 'NORMAL']\n",
"data_SCLC = data[data.clazz == 'SCLC']\n",
"data_SQUA = data[data.clazz == 'SQUA']\n",
"\n",
"[len(data_ADEN), len(data_COID), len(data_NORMAL), len(data_SCLC), len(data_SQUA)]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"[139, 20, 17, 6, 21]"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We know that cancer is caused by over- or under-expression of genes. In order to find the \"normal\" range, we will compute the mean and standard deviation for each gene expression (column) for the NORMAL subset."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"X = data_NORMAL.values[:, 0:-1]\n",
"\n",
"numcols = X.shape[1]\n",
"means = np.zeros(numcols)\n",
"stdevs = np.zeros(numcols)\n",
"\n",
"for col in range(0, X.shape[1]):\n",
" Xcol = X[:, col]\n",
" means[col] = np.mean(Xcol)\n",
" stdevs[col] = np.std(Xcol)\n",
"\n",
"means, stdevs"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
"(array([-18.12764706, -7.62941176, -10.63117647, ..., 34.87176471,\n",
" -86.04294118, 55.99117647]),\n",
" array([ 16.28329874, 17.10513183, 15.87074367, ..., 41.82676658,\n",
" 41.65590881, 27.18601217]))"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the [68-95-99.7 rule](http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule), we will consider gene expression values in the other subsets \"normal\" if they occur within mean +/- 1 standard deviation of the corresponding gene expression in the NORMAL subset. Normal values will be indicated by 0 and abnormal values by 1. \n",
"\n",
"We show the transformation described above for the subset of data for ADEN. The sequence of steps will be similar for the other subsets."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"X = data_ADEN.values[:, 0:-1]\n",
"for col in range(0, X.shape[1]):\n",
" Xcol = X[:, col]\n",
" Xcol[(Xcol >= means[col] - stdevs[col]) & (Xcol <= means[col] + stdevs[col])] = 0\n",
" Xcol[Xcol != 0] = 1\n",
" X[:, col] = Xcol\n",
"X"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"array([[0, 1, 0, ..., 0, 0, 0],\n",
" [1, 0, 1, ..., 1, 1, 1],\n",
" [0, 0, 0, ..., 0, 1, 0],\n",
" ..., \n",
" [0, 0, 1, ..., 0, 1, 0],\n",
" [1, 1, 1, ..., 1, 1, 1],\n",
" [0, 1, 0, ..., 1, 0, 1]], dtype=object)"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now want to find the gene expressions which most frequently co-occur in the subset for ADEN. To find this, we calculate the sum of the rows. The entries with high value are the ones which co-occur the most frequently. \n",
"\n",
"The graph below shows how the sum of the rows (indicating how frequent a gene expression is in ADEN patients) decrease. A few genes are expressed in all ADEN patients in the sample, but the frequency falls off as shown."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"top_features = sorted(enumerate(np.sum(X, axis=0).tolist()), key=itemgetter(1), reverse=True)\n",
"plt.plot([x[1] for x in top_features])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
"[<matplotlib.lines.Line2D at 0x69637d0>]"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAEACAYAAABVtcpZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xt0lNW9xvHvpAS8gAREJpoBAyEhhEtEBNSKDNKQooCI\nCIIHKKi1h6UtaBGP51TQCgl4aMVKl60IpNJyqctl0AMpCA544aZRqkRFIYGEXFQw3ALkts8fr4SE\nJAjJzLwzmeez1iwye2ayn7SSH+++vNthjDGIiEhICrM7gIiI2EdFQEQkhKkIiIiEMBUBEZEQpiIg\nIhLCVARERELYeYvAlClTcDqd9OzZs9ZrCxYsICwsjMOHD1e1paSkEBsbS3x8POvXr/d+WhER8arz\nFoHJkyeTkZFRqz03N5cNGzZw7bXXVrVlZWWxatUqsrKyyMjIYOrUqVRWVno/sYiIeM15i8CAAQNo\n06ZNrfZHH32U+fPn12hLT09n3LhxhIeHEx0dTZcuXdixY4d304qIiFdd9JxAeno6LpeLXr161WjP\nz8/H5XJVPXe5XBw8eLDxCUVExGeaXcybS0pKmDt3Lhs2bKhqO99dJxwOR8OTiYiIz11UEdi7dy85\nOTkkJiYCkJeXR58+fdi+fTtRUVHk5uZWvTcvL4+oqKha3yMqqgv5+XsbGVtEJLTExMTw9ddfe/8b\nmx+RnZ1tevToUedr0dHR5tChQ8YYY3bv3m0SExPN6dOnzb59+0znzp1NZWVlrc8A5tixH+s1cM2a\nNcvuCI2i/PZSfvsEc3ZjrN+dvnDeOYFx48Zx8803s2fPHjp06MDSpUtrvF59uCchIYExY8aQkJDA\n0KFD+fOf/1zvcNB77zW6domIiBecdzhoxYoV5/3wvn37ajx/8sknefLJJ3+003//G37+8wtIJyIi\nPmXLjuFgXjnqdrvtjtAoym8v5bdPMGf3JccPY03+69DhID7e8Pnn/uxVRCS4ORyO867GbPD3taMI\ngEHnmYmIXDhfFQHbbiBXUmJXzyIicoYtRSAiAnJy7OhZRESqs6UIdOpkrRASERF72VIE+vSBd9+1\no2cREanOliKQmAgFBXb0LCIi1dlSBKKjNScgIhIIbCkC114L+fl29CwiItXZsk/g8GFD27Zor4CI\nyAVqUvsEIiKsP995x47eRUTkDFuKgMMBEybApk129C4iImfYtmM4JgZ27bKrdxERARuLwK23ws6d\ndvUuIiJg08SwMYbiYmjTRpPDIiIXoklNDAO0bm39qU1jIiL2sa0IOBzQtSsUF9uVQEREbCsCAJdf\nDidO2JlARCS02VoESkshPd3OBCIioc3WIjB5Mrz5plUMRETE/2wtAkOGWHsFPv7YzhQiIqHL1iLQ\no4e1X0C3jxARscd5i8CUKVNwOp307Nmzqm3GjBl069aNxMRERo0axZEjR6peS0lJITY2lvj4eNav\nX39BAQYMgMzMBqYXEZFGOW8RmDx5MhkZGTXahgwZwu7du9m1axdxcXGkpKQAkJWVxapVq8jKyiIj\nI4OpU6dSWVn5owGSkuCf/4STJxvxU4iISIOctwgMGDCANm3a1GhLSkoiLMz6WP/+/cnLywMgPT2d\ncePGER4eTnR0NF26dGHHjh0/GmDgQLjsMti6taE/goiINFSj5gSWLFnC7bffDkB+fj4ul6vqNZfL\nxcGDBy/o+wwapCEhERE7NGvoB+fMmUPz5s0ZP358ve9xOBx1ts+ePbvqa7fbzY03ulm0CH7724am\nERFpWjweDx6Px+f9NKgILFu2jLVr17Jx48aqtqioKHJzc6ue5+XlERUVVefnqxcB67Pwu9/BsWPQ\nqlVDEomINC1utxu32131/Omnn/ZJPxc9HJSRkcFzzz1Heno6l1xySVX7iBEjWLlyJaWlpWRnZ/PV\nV1/Rr1+/C/qe114LV1wBmzdfbBoREWmM8xaBcePGcfPNN/Pll1/SoUMHlixZwiOPPMLx48dJSkqi\nd+/eTJ06FYCEhATGjBlDQkICQ4cO5c9//nO9w0Hnat4chg6Fcy4QRETEx2w7T+BcWVnQvTscPaoh\nIRGRczW58wTOFRsLLVrAokV2JxERCR0BUwTCw+HZZ60icAF7zERExAsCZjgIoKgIIiOt08YiI/2Z\nSkQksPlqOCigigCAywWdO8OWLX4MJSIS4Jr8nMAZ6enw/vuwf7/dSUREmr6AKwLdukGHDrB6td1J\nRESavoArApddBg88AC+8YHcSEZGmL+DmBAC++QacTjhyxNpJLCIS6kJmTgCgXTto3RqWLLE7iYhI\n0xaQRSAsDGbMgJdfhm3b7E4jItJ0BWQRAGteID7eWi0kIiK+EbBFwOmEu+6C1FTYs8fuNCIiTVNA\nTgxXN3AgXHcdLFgAzRp8BI6ISHALqYnh6h57DJYvh5wcu5OIiDQ9AX8lAHDLLXDgAGzfDldf7aNg\nIiIBLGSvBADWrbOWjb72GpSX251GRKTpCIoi0KoV3HcfzJoFO3bYnUZEpOkIiuGgM0aMgG+/hcWL\nrVPIRERCRUgPB52xYAG0bAn/+heUldmdRkQk+AVVEYiNhWHD4Omn4Xe/szuNiEjwC7qV97/5DURE\nwPPPQ5cu1s5iERFpmKC6EjhjyBAYPhymTbM7iYhIcAuqieHqjLHuNNqyJWzcaB1GIyLSVNkyMTxl\nyhScTic9e/asajt8+DBJSUnExcUxZMgQiouLq15LSUkhNjaW+Ph41q9f7/Ww1TkckJdnrRL6xz/g\n88992p2ISJN03iIwefJkMjIyarSlpqaSlJTEnj17GDx4MKmpqQBkZWWxatUqsrKyyMjIYOrUqVRW\nVvouOdaBM+PHg8cDjzzi065ERJqk8xaBAQMG0KZNmxpta9asYdKkSQBMmjSJN954A4D09HTGjRtH\neHg40dHRdOnShR1+2Nk1eTL86U+wcyc8+qjPuxMRaVIuemK4qKgIp9MJgNPppKioCID8/HxcLlfV\n+1wuFwcPHvRSzPPr1QvS0qxHdrZuLSEicqEatUTU4XDgcDjO+3pdZs+eXfW12+3G7XY3JgZhYdb+\ngQ4drNtO//738OtfN+pbiojYyuPx4PF4fN7PRRcBp9NJYWEhkZGRFBQU0L59ewCioqLIzc2tel9e\nXh5RUVF1fo/qRcBbmjWDTz6BZ5+FVaus1UJJSV7vRkTEL879B/LTTz/tk34uejhoxIgRpKWlAZCW\nlsbIkSOr2leuXElpaSnZ2dl89dVX9OvXz7tpL8C4cRATAy+95PeuRUSCznn3CYwbN47Nmzfz3Xff\n4XQ6eeaZZ7jzzjsZM2YMBw4cIDo6mtWrVxMREQHA3LlzWbJkCc2aNWPhwoUkJyfX7tBHa12r27bN\nOoOgZUt4/XW47Tafdici4nO++t0ZtJvFfsyxY9ay0ebN4Y47rDuQnmf6QkQkoOkuohepVSsYM8Yq\nBuPHQ36+3YlERAJPk70SqC4xEVq0gLlz4Wc/82vXIiJeoSuBRnjtNUhIgDVr4OuvrfsOiYhIiBSB\n2Fi46y5Yv97aWPbRR3YnEhEJDCExHFTdHXdYk8XR0daGsk6dbIsiInLBNBzkJf/93zBwIGzdat14\nTkQklIVcEbj5Zuswmp//HB58EC65xDqlTEQkFIXccNAZxsDp0/CXv8Cbb8J//Af06QPVjk4QEQkY\n2izmI1lZMH8+HDgAbdtaK4lERAKNioCPbd5s7Sru1g1+8Qv41a/sTiQicpaKgI9VVEBmprWMdMsW\nmDcP2rWDakckiIjYRkXAT3bvhgkT4NQpKCuDr76yO5GIiIqA3xUXQ2SkdXxlmzYwZ45uQCci9lER\n8DNjYMUKOHLEOrv4m2+sm9KJiNhBRcBGvXrBp59aXz/1FPjogB8RkXr56ndno84YDhW7dll//u1v\n1oll11xj3ZV0wgT4yU/szSYi0hght2O4IRwO63HrrdZtqTMzYfp0TRqLSPDTcFADDRhgzRNcdpn1\nfNo0mDTJ3kwi0nRpTiDAFBZCQYH19euvw2efwRNPWM+7doUfjl0WEfEKFYEAtnOndZ6xMVZxuPde\na7OZiIi3qAgEiaVLITXVulupwwGPPw7x8XanEpFgp9VBQeLOO61f/sZYq4k2bjx7cE2LFvZmExE5\nl64EfOh//9c6xAagvNy6Spg40d5MIhKcAm44KCUlheXLlxMWFkbPnj1ZunQpJ06cYOzYsezfv5/o\n6GhWr15NxDkzpKFUBKr7n/+BTz6BIUOgWTOYMsU60EZE5EIE1PGSOTk5vPzyy2RmZvLpp59SUVHB\nypUrSU1NJSkpiT179jB48GBSU1O9nTdojRoFnTvD11/Ds8/Cxx/bnUhEpIFF4IorriA8PJySkhLK\ny8spKSnhmmuuYc2aNUz6YbH8pEmTeOONN7waNphdfz288IL16NsXxo6FhATroZVEImKXBk0Mt23b\nlscee4yOHTty6aWXkpycTFJSEkVFRTidTgCcTidFRUVeDdtULFt2do/BO+/AypVw003W8/Bw6N8f\nwrSXW0T8oEFFYO/evTz//PPk5OTQunVr7rnnHpYvX17jPQ6HA0c9916ePXt21ddutxu3292QGEGr\nTRvrAda8wGuvWXMGYA0Tvf++ddM6EQldHo8Hj8fj834aNDG8atUqNmzYwOLFiwF49dVX2bZtG5s2\nbeKdd94hMjKSgoICBg0axBdffFGzwxCdGL5QgwdbS0zbt7f+fOYZiImxO5WI2C2gJobj4+PZtm0b\nJ0+exBjD22+/TUJCAsOHDyctLQ2AtLQ0Ro4c6dWwoeAPf7BWDg0bBvv3g8cDR49ajxMn7E4nIk1N\ng5eIzp8/n7S0NMLCwrj++utZvHgxx44dY8yYMRw4cEBLRL3g2Wdh/vyzz0tKrOMvu3a1L5OI2CPg\n9gk0uEMVgQYbNMgqAJ07W89bt4Zf/lLHXoqEgoAaDhJ7TJsGV1wB331nPaZNs85CFhFpKF0JBLH4\neGuIqFm1NV6zZulcA5GmSDeQk1o++KDmlcDSpbBmjXX85RlXXw09evg/m4gEB10JNCHbt5/dbwBw\n+jTk5cG+ffZlEhHv0MSwXLSjR6FdOxg+vGb7sGEwebI9mUSkYVQEpEE2bLCKwRkffwzbtsGKFTXf\n17Yt/OQn/s0mIhdORUC84osvwO2GioqzbSUlMHMmPPWUbbFE5EdoYli8Ij7eOge5updesiaVz91v\nMHo0dOvmv2wi4n8qAsLQoXDwIJSWnm17913raqHavf5EpAnScJDU6S9/gcces3YlVzd16tkjM0XE\nfzQnIH5VWXn2zIMz/u//4G9/s+YPqgsLg9tug0sv9V8+kVCjIiC227sXHn3UKhDVffQRLFoEd91l\nTy6RUKAiIAHrwQdhxw5wuWq2jxxpvSYijaciIAErLw927arZ9uGHsGWLdZTmuSIjrWM0ReTCqQhI\nUPnyS0hOrrkfAeDYMevup1p1JHJxtE9AgkrXrpCTU7v9pZdg8WIoL6/Z3r49/PrXfokmItXoSkD8\nKjcXfjiBtIbZs+HUqZq3xRaRszQcJE2aywUnT9betexwWPc5+tnP7MklEihUBKRJO3HCKgLnmjnT\nurHdkCG1X+vWDbp39302kUCgIiAhae1aWLKkdvu331orjN5+2/+ZROygIiBSza5dMGAA9OtX+7U7\n7oDp0/2fScSXVAREqqmstG5yV1ZWs33XLnjtNXj11bo/53LBJZf4Pp+It6kIiFyAAwes+YNzl6CC\ndR7zQw/BnDn+zyXSWAG3T6C4uJgHHniA3bt343A4WLp0KbGxsYwdO5b9+/cTHR3N6tWriYiI8GZe\nkfPq2NE6OKcur7wCCxdaZy/X9TntU5BQ1OArgUmTJjFw4ECmTJlCeXk5J06cYM6cObRr147HH3+c\nefPm8f3335OamlqzQ10JiE0KC+Hvf4dz//MrL4ff/95aoSQSqAJqOOjIkSP07t2bffv21WiPj49n\n8+bNOJ1OCgsLcbvdfHHOP8tUBCTQGANXXlnzUJ3qOnSAzz/3byaRcwXUcFB2djZXXXUVkydPZteu\nXfTp04fnn3+eoqIinE4nAE6nk6KiIq+GFfEFh8M6O+HcSWawCkTbttZEc1hYzdfCwuDOO+Gyy/yT\nU8QXGlQEysvLyczM5MUXX6Rv375MmzatzmEfx7nbP38wu9rdw9xuN263uyExRLymRQvrUZdHH4WM\njNrtmzdDRIR1PKeIt3k8Hjwej8/7adBwUGFhITfddBPZ2dkAvPfee6SkpLBv3z7eeecdIiMjKSgo\nYNCgQRoOkiZr4kTrHIWrrqr79RtugD/+0b+ZpOkKqDkBgFtvvZXFixcTFxfH7NmzKSkpAeDKK69k\n5syZpKamUlxcrIlhabIKC+Grr+p/bfp0+Ne/6n49NhaaN/ddNml6Aq4I7Nq1iwceeIDS0lJiYmJY\nunQpFRUVjBkzhgMHDtS7RFRFQELByZMwcGDdK44KC63VSFOn+j+XBK+AKwIN7lBFQELc7NnWHEPf\nvrVfS06GYcP8HkmCgIqASBORnQ1vvVW7/d//hkOH4PXX/Z9JAp+KgEgTt3Mn3Hhj7aWoYE1Cv/KK\n/zNJ4FAREAkB5eW1dzRv2WLNHzz+eO33OxzWXoUrr/RPPrFPQG0WExHfqOt4zd69we2GDz6o/dr7\n71tF4/77fR5NmihdCYgEsSeegH/8A37YqF/D6NHWyWzSNGg4SERqOXoUvvyydvuOHbB8OSxaVPu1\na6/V8FEwUhEQkQu2fz/cc0/tcxWKi6F/f1ixwp5c0nAqAiLSaG+/DZMn132/o+RkuPtu/2eSC6Mi\nICKNVlJiXQVUVNRs//RTyMmB9PTan3E4rIfYS0VARHxm1y7rhnfnFgdjYP58mDHDnlxylpaIiojP\nJCbWfZ7CokXwz3/C5ZfXbL/pJmvpqgQ/XQmISL2+/NI6l7m6ffugdWtYtcqeTKFKw0EiEhDeeQdG\njIBOnWq2X389LFtmS6SQoCIgIgGhstI6c7n6/MF338GYMbBmTc33xsbWf+iOXBwVAREJWGVl1i2w\njx072/btt9btLl5+2bZYTYqKgIgElfR0+PWvrcN1zggPhz/8wZpTkIujIiAiQeX4cXjjDWv46Ixn\nnrFuiX3mQJ3mzeu+aZ7UpiIgIkHvF7+A1autrysr4ac/hY0bbY0UNFQERKRJ2bMHbrkF/uu/rOeX\nXQYPPlj3oTqiIiAiTUxpKcyaBadOWc9feQU++ww6drQ3V6BSERCRJu2GG6ylpi1aWM8HDoS//tXe\nTIFERUBEmrTvvoNDh6yvs7Phl7+EtDTreViYdauK5s3ty2c3FQERCRklJdbJaCdPWs8//RT+9je4\n/XZ7c9kpIItARUUFN9xwAy6XizfffJPDhw8zduxY9u/fT3R0NKtXryYiIqJmhyoCInKRJk60JpKv\nvdbaa/CnP0GbNnan8i9f/e5s1Dz8woULSUhIwPHDzcZTU1NJSkpiz549DB48mNTUVK+EFJHQ9tRT\nMH06jBplHZ354Yfw/fdnrxSk4RpcBPLy8li7di0PPPBAVXVas2YNkyZNAmDSpEm88cYb3kkpIiGt\nSxcYO9Z6uN3WfYqio6F7d7uTBb8G79WbPn06zz33HEePHq1qKyoqwul0AuB0OikqKmp8QhGRav76\nV+tx+jS0bAkpKVa7ywUTJtibLRg1qAi89dZbtG/fnt69e+PxeOp8j8PhqBomOtfs2bOrvna73bjd\n7obEEJEQ1qIFpKZaq4rKyqxi0JSKgMfjqff3qzc1aGL4ySef5NVXX6VZs2acOnWKo0ePMmrUKHbu\n3InH4yEyMpKCggIGDRrEF198UbNDTQyLiJcZA61awZVXWsXh/feb3i2sA2pieO7cueTm5pKdnc3K\nlSu57bbbePXVVxkxYgRpPyzsTUtLY+TIkV4NKyJSF4cDcnJgyxarCKxaBTt32p0qOHjlLh1nhn2e\neOIJNmzYQFxcHJs2beKJJ57wxrcXEflR7dpZS0jvuw/efBNuvdW6QpDz02YxEWmSWrWCQYPgnnua\nxlxBQA0HiYgEun/9CxISYOVKa/JY6qYrARFpsv79bxg6FL75xpojuO46uxM1XEDeNqJBHaoIiIif\n/fzn1mqhvn2tIy+DkYaDREQa6LHHICYGfvtb6+Z0cpauBEQkZMTEwLFjcOONsGaN3Wkujq4EREQa\n6ZNP4O234b33YN06LSEFFQERCSGtWlkrhu64w7oZ3Z49dieyn4aDRCQk/fSnUFkJDz0Ev/iF3Wl+\nnIaDRES8aMkSuO02eO210D6XQEVAREJS167WuQQ7dwbvslFvUBEQkZCVmAgvvwybN8PWrXansYeK\ngIiEtAEDoFcv6yD7UKQiICIhrU0bmDQJXnoJNm60O43/aXWQiAjWLagrK2HFCruT1E33DhIR8aG1\na639A998E5inkmmJqIiID91+O7RvDwsW2J3Ev1QERER+MGsWrF9vdwr/0nCQiMgPcnKgUyc4fhwu\nv9zuNDVpOEhExMeio6F1a+uM4lChIiAiUs1dd4XWUlENB4mIVPPmmzBiRODdZlpLREVE/KC8HMLD\nobAQnE6705ylOQERET9o1gzi4mDVKruT+EeDikBubi6DBg2ie/fu9OjRgxdeeAGAw4cPk5SURFxc\nHEOGDKG4uNirYUVE/GH0aHj/fbtT+EeDikB4eDh//OMf2b17N9u2bWPRokV8/vnnpKamkpSUxJ49\nexg8eDCpqanezisi4nPXXWctFw0FXpkTGDlyJA8//DAPP/wwmzdvxul0UlhYiNvt5osvvqjZoeYE\nRCTAffWVNSQUSL+qAnZiOCcnh4EDB/LZZ5/RsWNHvv/+ewCMMbRt27bqeVWHKgIiEgQcDvj6a4iJ\nsTuJxVe/O5s15sPHjx/n7rvvZuHChbRq1arGaw6HA4fDUefnZs+eXfW12+3G7XY3JoaIiNclJkJm\npn1FwOPx4PF4fN5Pg68EysrKGDZsGEOHDmXatGkAxMfH4/F4iIyMpKCggEGDBmk4SESC0r33QkIC\nPPWU3UksAbVE1BjD/fffT0JCQlUBABgxYgRpaWkApKWlMXLkSO+kFBHxs0CbE/CVBl0JvPfee9x6\n66306tWrasgnJSWFfv36MWbMGA4cOEB0dDSrV68mIiKiZoe6EhCRIDB/Pnz7LTz3nN1JLAE1J3DL\nLbdQWVlZ52tvv/12owKJiASCFi3gs8/sTuF72jEsIlKHrl2tU8aaOhUBEZE6tG8PBQV2p/A9FQER\nkTpcdRWEhcBvyBD4EUVELl7LltYJY02dioCISB1atoQjR+DwYbuT+JaKgIhIHcLDrd3Chw7ZncS3\nVAREROpx+eVw4oTdKXxLRUBE5Dxyc+1O4FsqAiIi9ejcuenvFVAREBGpR6tWmhgWEQlZUVFQVmZ3\nCt9SERARqcell8LJk3an8C0VARGReqgIiIiEsEsvhexsu1P4loqAiEg9WrWydg03ZSoCIiL16NQJ\nTp+2O4VvqQiIiNSjZUsoLLQ7hW+pCIiI1KNVK8jPtzuFb6kIiIjUIyLCmhxuylQERETq0aKF5gRE\nREJW8+YqAiIiIat5c+u2EcbYncR3vF4EMjIyiI+PJzY2lnnz5nn724uI+E1YmHW4TGmp3Ul8x6tF\noKKigocffpiMjAyysrJYsWIFn3/+uTe7sJ3H47E7QqMov72U3z4Nzd7U5wW8WgR27NhBly5diI6O\nJjw8nHvvvZf09HRvdmG7YP5LAMpvN+W3T0Ozl5bCqVPezRJIvFoEDh48SIcOHaqeu1wuDh486M0u\nRET86ooroKjI7hS+49Ui4HA4vPntRERsFxsLx47ZncKHjBdt3brVJCcnVz2fO3euSU1NrfGemJgY\nA+ihhx566HERj5iYGG/+uq7iMMZ7i5/Ky8vp2rUrGzdu5JprrqFfv36sWLGCbt26easLERHxomZe\n/WbNmvHiiy+SnJxMRUUF999/vwqAiEgA8+qVgIiIBBe/7hgOxI1kubm5DBo0iO7du9OjRw9eeOEF\nAA4fPkxSUhJxcXEMGTKE4uLiqs+kpKQQGxtLfHw869evr2r/6KOP6NmzJ7GxsfzmN7/x689RUVFB\n7969GT58eNDlLy4uZvTo0XTr1o2EhAS2b98eVPlTUlLo3r07PXv2ZPz48Zw+fTqg80+ZMgWn00nP\nnj2r2ryZ9/Tp04wdO5bY2FhuvPFG9u/f79PsM2bMoFu3biQmJjJq1CiOVDsFJpCy15f/jAULFhAW\nFsbhw4f9m98nMw11KC8vNzExMSY7O9uUlpaaxMREk5WV5a/u61VQUGA+/vhjY4wxx44dM3FxcSYr\nK8vMmDHDzJs3zxhjTGpqqpk5c6Yxxpjdu3ebxMREU1paarKzs01MTIyprKw0xhjTt29fs337dmOM\nMUOHDjXr1q3z28+xYMECM378eDN8+HBjjAmq/BMnTjSvvPKKMcaYsrIyU1xcHDT5s7OzTadOncyp\nU6eMMcaMGTPGLFu2LKDzb9myxWRmZpoePXpUtXkz76JFi8x//ud/GmOMWblypRk7dqxPs69fv95U\nVFQYY4yZOXNmwGavL78xxhw4cMAkJyeb6Ohoc+jQIb/m91sR+OCDD2qsHEpJSTEpKSn+6v6C3Xnn\nnWbDhg2ma9euprCw0BhjFYquXbsaY2qveEpOTjZbt241+fn5Jj4+vqp9xYoV5qGHHvJL5tzcXDN4\n8GCzadMmM2zYMGOMCZr8xcXFplOnTrXagyX/oUOHTFxcnDl8+LApKyszw4YNM+vXrw/4/NnZ2TV+\nEXkzb3Jystm2bZsxxirq7dq182n26l5//XVz3333BWz2+vKPHj3a7Nq1q0YR8Fd+vw0HBcNGspyc\nHD7++GP69+9PUVERTqcTAKfTSdEPu0Xy8/NxuVxVnznzc5zbHhUV5befb/r06Tz33HOEhZ39vzNY\n8mdnZ3PVVVcxefJkrr/+eh588EFOnDgRNPnbtm3LY489RseOHbnmmmuIiIggKSkpaPKf4c281f+u\nN2vWjNatW9cY4vClJUuWcPvttwdV9vT0dFwuF7169arR7q/8fisCgb6R7Pjx49x9990sXLiQVq1a\n1XjN4XCnUIgUAAADJklEQVQEbP633nqL9u3b07t3b0w9c/yBnL+8vJzMzEymTp1KZmYml19+Oamp\nqTXeE8j59+7dy/PPP09OTg75+fkcP36c5cuX13hPIOevS7DlPWPOnDk0b96c8ePH2x3lgpWUlDB3\n7lyefvrpqrb6/h77it+KQFRUFLm5uVXPc3Nza1QzO5WVlXH33XczYcIERo4cCVj/Gir84XDRgoIC\n2rdvD9T+OfLy8nC5XERFRZGXl1ejPSoqyufZP/jgA9asWUOnTp0YN24cmzZtYsKECUGT3+Vy4XK5\n6Nu3LwCjR48mMzOTyMjIoMj/4YcfcvPNN3PllVfSrFkzRo0axdatW4Mm/xne+O/lzN/nqKgoDhw4\nAFhF/siRI7Rt29an+ZctW8batWv5+9//XtUWDNn37t1LTk4OiYmJdOrUiby8PPr06UNRUZH/8jdy\neOuClZWVmc6dO5vs7Gxz+vTpgJkYrqysNBMmTDDTpk2r0T5jxoyq8biUlJRak02nT582+/btM507\nd66arOnXr5/Ztm2bqays9PvEsDHGeDyeqjmBYMo/YMAA8+WXXxpjjJk1a5aZMWNG0OT/5JNPTPfu\n3U1JSYmprKw0EydONC+++GLA5z93XNqbeRctWmR+9atfGWOs8WpvT66em33dunUmISHBfPvttzXe\nF4jZ68pfXV0Tw77O77ciYIwxa9euNXFxcSYmJsbMnTvXn13X69133zUOh8MkJiaa6667zlx33XVm\n3bp15tChQ2bw4MEmNjbWJCUlme+//77qM3PmzDExMTGma9euJiMjo6r9ww8/ND169DAxMTHmkUce\n8fvP4vF4qlYHBVP+Tz75xNxwww2mV69e5q677jLFxcVBlX/evHkmISHB9OjRw0ycONGUlpYGdP57\n773XXH311SY8PNy4XC6zZMkSr+Y9deqUueeee0yXLl1M//79TXZ2ts+yv/LKK6ZLly6mY8eOVX9/\nz6yOCbTs1fM3b9686n/76jp16lRVBPyVX5vFRERCmI6XFBEJYSoCIiIhTEVARCSEqQiIiIQwFQER\nkRCmIiAiEsJUBEREQpiKgIhICPt/X9Qx17S0fPkAAAAASUVORK5CYII=\n",
"text": [
"<matplotlib.figure.Figure at 0x4628d50>"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to find the most frequent \"itemsets\" of gene expressions, we use the [Apriori algorithm](http://en.wikipedia.org/wiki/Apriori_algorithm). Unfortunately, the algorithm has order of complexity O(2^n) where n is the number of items, so we will limit n to the top 10 gene expressions identified in top_features above.\n",
"\n",
"The implementation we will use is from the blog post [Machine learning and Data Mining - Association Analysis with Python](http://aimotion.blogspot.com/2013/01/machine-learning-and-data-mining.html). The algorithm expects to see inputs as a list of lists, where each inner list corresponds to a row of data, with the entries corresponding to the actual gene name (or its index).\n",
"\n",
"We convert the data in X to market basket form below:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"baskets = []\n",
"top_gene_ids = [x[0] for x in top_features[0:10]]\n",
"for row in range(0, X.shape[0]):\n",
" Xrow = X[row, :]\n",
" basket = []\n",
" for col in top_gene_ids:\n",
" if X[row, col] > 0: basket.append(colnames[col])\n",
" baskets.append(basket)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code below is from the blog post referenced above. We will not explain the code block below, please refer to the blog post for more details."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def load_dataset():\n",
" \"Load the sample dataset.\"\n",
" return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]\n",
" \n",
"def createC1(dataset):\n",
" \"Create a list of candidate item sets of size one.\"\n",
" c1 = []\n",
" for transaction in dataset:\n",
" for item in transaction:\n",
" if not [item] in c1:\n",
" c1.append([item])\n",
" c1.sort()\n",
" return map(frozenset, c1)\n",
" \n",
"def scanD(dataset, candidates, min_support):\n",
" \"Returns all candidates that meets a minimum support level\"\n",
" sscnt = {}\n",
" for tid in dataset:\n",
" for can in candidates:\n",
" if can.issubset(tid):\n",
" sscnt.setdefault(can, 0)\n",
" sscnt[can] += 1\n",
" num_items = float(len(dataset))\n",
" retlist = []\n",
" support_data = {}\n",
" for key in sscnt:\n",
" support = sscnt[key] / num_items\n",
" if support >= min_support:\n",
" retlist.insert(0, key)\n",
" support_data[key] = support\n",
" return retlist, support_data\n",
" \n",
"def aprioriGen(freq_sets, k):\n",
" \"Generate the joint transactions from candidate sets\"\n",
" retList = []\n",
" lenLk = len(freq_sets)\n",
" for i in range(lenLk):\n",
" for j in range(i + 1, lenLk):\n",
" L1 = list(freq_sets[i])[:k - 2]\n",
" L2 = list(freq_sets[j])[:k - 2]\n",
" L1.sort()\n",
" L2.sort()\n",
" if L1 == L2:\n",
" retList.append(freq_sets[i] | freq_sets[j])\n",
" return retList\n",
" \n",
"def apriori(dataset, minsupport=0.5):\n",
" \"Generate a list of candidate item sets\"\n",
" C1 = createC1(dataset)\n",
" D = map(set, dataset)\n",
" L1, support_data = scanD(D, C1, minsupport)\n",
" L = [L1]\n",
" k = 2\n",
" while (len(L[k - 2]) > 0):\n",
" Ck = aprioriGen(L[k - 2], k)\n",
" Lk, supK = scanD(D, Ck, minsupport)\n",
" support_data.update(supK)\n",
" L.append(Lk)\n",
" k += 1\n",
" return L, support_data"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We call the code above to find the top 5 frequent itemsets that have a support of greater than 0.99, ie these itemsets occur in almost all the patients in the dataset.\n",
"\n",
"The extremely complex line at the end below is to pull the actual gene names and the support value out of an extremely complicated data structure returned by the Apriori algorithm, and to print out the top 5 (longest and with highest support) itemsets for the ADEN subset."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"L, support_data = apriori(baskets, minsupport=0.5)\n",
"[\"%s => %s\" % (\",\".join([g for g in x[0]]), x[2]) for x in \n",
" sorted([(x,len(x),support_data[x]) \n",
" for x in support_data], key=itemgetter(1,2), reverse=True)][0:5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
"['39631_at,38177_at,37398_at,34708_at,40841_at,36569_at,35868_at,38995_at,39760_at,268_at => 0.978417266187',\n",
" '39631_at,38177_at,37398_at,34708_at,40841_at,36569_at,35868_at,39760_at,268_at => 0.985611510791',\n",
" '39631_at,38177_at,37398_at,34708_at,40841_at,36569_at,35868_at,38995_at,268_at => 0.985611510791',\n",
" '39631_at,37398_at,34708_at,40841_at,36569_at,35868_at,38995_at,39760_at,268_at => 0.978417266187',\n",
" '39631_at,38177_at,37398_at,34708_at,40841_at,35868_at,38995_at,39760_at,268_at => 0.978417266187']"
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This kind of analysis can help doctors diagnose different types of lung cancer more effectively. Rather than having to look at all 12,500 different gene expressions, a doctor could only look at the top few sets (or baskets) of \"important\" gene expressions provided by market-basket analysis."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment