Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save antoine-tran/21899fee8ef77e5986c4af1dd4da9c5c to your computer and use it in GitHub Desktop.
Save antoine-tran/21899fee8ef77e5986c4af1dd4da9c5c to your computer and use it in GitHub Desktop.
Bootstrap resampling with Python
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "2013_10_07_bootstrap_SIDS"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": "!date",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Mon Oct 7 09:45:02 PDT 2013\r\n"
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#Bootstrap resampling with Python\n \nNext week my work on Iraqi mortality will come out! This has been quite an involved process and there are many details that I'm excited to share. And I'm very excited and a little nervous to see how this work is received.\n \nBut for today, I am focusing on a little piece of methodology that I need again for a completely different application: *bootstrap resampling*.\n \nWe used bootstrap in the Iraq mortality research to provide robust uncertainty bounds for all of our estimates. For the simple estimates, based on household deaths and crude mortality rates, the bootstrap was a convenience. But for the more complicated estimates, the ones based on the sibling survival method, the bootstrap was essential.\n\nAnd now for that completely different use: I have a great dataset on US hospital admissions, but I've promised not to share it in certain ways as a condition of use. And I have a great student, who is going to build me the ultimate data browser for this data. But he is not cleared to access the data yet! I don't want any delays, because UW is on the quarter system and this student's independent study will be over before I know it.\n\nEnter the bootstrap: I will make a dataset that looks just like my great dataset, but has all of the columns resampled independently. It will be just fine for testing, and even have simple statistics that make sense. But it will not reveal any data that I am not supposed to reveal. There is an expansive theory of this sort of privacy preserving database transformation (actually several theories). Is there an easily readable result that says bootstrap resampling the columns independently leaks no information beyond publishing marginal distributions? Or that it leaks more?? (That can't be, can it?)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import numpy as np, pandas as pd",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": "As far as doing the work, the bootstrap in Python is quite simple. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "def bootstrap_resample(X, n=None):\n \"\"\" Bootstrap resample an array_like\n Parameters\n ----------\n X : array_like\n data to resample\n n : int, optional\n length of resampled array, equal to len(X) if n==None\n Results\n -------\n returns X_resamples\n \"\"\"\n if n == None:\n n = len(X)\n \n resample_i = np.floor(np.random.rand(n)*len(X)).astype(int)\n X_resample = X[resample_i]\n return X_resample",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 34
},
{
"cell_type": "code",
"collapsed": false,
"input": "X = arange(10000)\nX_resample = bootstrap_resample(X, n=5000)\nprint 'original mean:', X.mean()\nprint 'resampled mean:', X_resample.mean()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "original mean: 4999.5\nresampled mean: 5030.4096\n"
}
],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": "def test_bsr_shape():\n # test without resampling length parameter\n X = arange(10000)\n X_resample = bootstrap_resample(X)\n assert X_resample.shape == (10000,), 'resampled length should be 10000'\n \n # test with resampling length parameter\n n = 5000\n X_resample = bootstrap_resample(X, n=n)\n assert X_resample.shape == (n,), 'resampled length should be %d' % n\ntest_bsr_shape()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": "def test_bsr_mean():\n # test that means are close\n np.random.seed(123456) # set seed so that randomness does not lead to failed test\n X = arange(10000)\n X_resample = bootstrap_resample(X, 5000)\n assert abs(X_resample.mean() - X.mean()) / X.mean() < 1e-2, 'means should be approximately equal'\ntest_bsr_mean()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": "I want to use this easily with Pandas, but there is a little bit of trouble with indexing to watch out for:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "def test_bsr_on_df():\n # test that means are close for pd.DataFrame with unusual index\n np.random.seed(123456) # set seed so that randomness does not lead to failed test\n X = pd.Series(arange(10000), index=arange(10000)*10)\n \n X_resample = bootstrap_resample(X, 5000)\n print X_resample.mean(), X.mean()\n assert abs(X_resample.mean() - X.mean()) / X.mean() < 1e-2, 'means should be approximately equal'\n \ntest_bsr_on_df()",
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "AssertionError",
"evalue": "means should be approximately equal",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mAssertionError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-47-41ba30c94f56>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 8\u001b[0m \u001b[1;32massert\u001b[0m \u001b[0mabs\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX_resample\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m/\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m<\u001b[0m \u001b[1;36m1e-2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'means should be approximately equal'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 9\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 10\u001b[1;33m \u001b[0mtest_bsr_on_df\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m<ipython-input-47-41ba30c94f56>\u001b[0m in \u001b[0;36mtest_bsr_on_df\u001b[1;34m()\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[0mX_resample\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mbootstrap_resample\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m5000\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[1;32mprint\u001b[0m \u001b[0mX_resample\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[1;32massert\u001b[0m \u001b[0mabs\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX_resample\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m/\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m<\u001b[0m \u001b[1;36m1e-2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'means should be approximately equal'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 9\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[0mtest_bsr_on_df\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mAssertionError\u001b[0m: means should be approximately equal"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": "504.862785863 4999.5\n"
}
],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Why didn't that work? Because Pandas has silently dealt with the rows for which the indices are missing, returning NaNs, and then silently deat with the NaNs, dropping them from the average calculation."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def bootstrap_resample(X, n=None):\n \"\"\" Bootstrap resample an array_like\n Parameters\n ----------\n X : array_like\n data to resample\n n : int, optional\n length of resampled array, equal to len(X) if n==None\n Results\n -------\n returns X_resamples\n \"\"\"\n if isinstance(X, pd.Series):\n X = X.copy()\n X.index = range(len(X.index))\n if n == None:\n n = len(X)\n \n resample_i = np.floor(np.random.rand(n)*len(X)).astype(int)\n X_resample = np.array(X[resample_i]) # TODO: write a test demonstrating why array() is important\n return X_resample\ntest_bsr_on_df()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "5035.0484 4999.5\n"
}
],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": "And that is all I need to resample my dataset and make a version suitable for quick-starting an independent study:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "df = pd.read_csv('/home/j/Project/Models/network_LoS/IA_2007.csv', index_col=0, low_memory=False)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 62
},
{
"cell_type": "code",
"collapsed": false,
"input": "# moderately large file, for a single state/year\ndf.shape",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 63,
"text": "(367736, 361)"
}
],
"prompt_number": 63
},
{
"cell_type": "code",
"collapsed": false,
"input": "df_resampled = pd.DataFrame(index=df.index, columns=df.columns, dtype=df.dtypes)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 64
},
{
"cell_type": "code",
"collapsed": false,
"input": "for col in df.columns:\n df_resampled[col] = bootstrap_resample(df[col])",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 65
},
{
"cell_type": "code",
"collapsed": false,
"input": "df.ix[:10,:10]",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>age</th>\n <th>ageday</th>\n <th>agemonth</th>\n <th>amonth</th>\n <th>asource</th>\n <th>asourceu</th>\n <th>asource_</th>\n <th>atype</th>\n <th>aweekend</th>\n <th>died</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0 </th>\n <td> 85</td>\n <td>-999</td>\n <td>-999</td>\n <td> 1</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>1 </th>\n <td> 92</td>\n <td>-999</td>\n <td>-999</td>\n <td> 1</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 0</td>\n <td> 1</td>\n </tr>\n <tr>\n <th>2 </th>\n <td> 79</td>\n <td>-999</td>\n <td>-999</td>\n <td> 2</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 2</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>3 </th>\n <td> 85</td>\n <td>-999</td>\n <td>-999</td>\n <td> 1</td>\n <td> 1</td>\n <td> 7</td>\n <td> 7</td>\n <td> 1</td>\n <td> 1</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>4 </th>\n <td> 79</td>\n <td>-999</td>\n <td>-999</td>\n <td> 1</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 1</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>5 </th>\n <td> 70</td>\n <td>-999</td>\n <td>-999</td>\n <td> 1</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>6 </th>\n <td> 86</td>\n <td>-999</td>\n <td>-999</td>\n <td> 1</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>7 </th>\n <td> 86</td>\n <td>-999</td>\n <td>-999</td>\n <td> 3</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 2</td>\n <td> 1</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>8 </th>\n <td> 86</td>\n <td>-999</td>\n <td>-999</td>\n <td> 3</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 1</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>9 </th>\n <td> 79</td>\n <td>-999</td>\n <td>-999</td>\n <td> 2</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>10</th>\n <td> 84</td>\n <td>-999</td>\n <td>-999</td>\n <td> 3</td>\n <td> 5</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 66,
"text": " age ageday agemonth amonth asource asourceu asource_ atype aweekend \\\n0 85 -999 -999 1 5 1 1 3 0 \n1 92 -999 -999 1 5 1 1 3 0 \n2 79 -999 -999 2 5 1 1 2 0 \n3 85 -999 -999 1 1 7 7 1 1 \n4 79 -999 -999 1 5 1 1 3 1 \n5 70 -999 -999 1 5 1 1 3 0 \n6 86 -999 -999 1 5 1 1 3 0 \n7 86 -999 -999 3 5 1 1 2 1 \n8 86 -999 -999 3 5 1 1 3 1 \n9 79 -999 -999 2 5 1 1 3 0 \n10 84 -999 -999 3 5 1 1 3 0 \n\n died \n0 0 \n1 1 \n2 0 \n3 0 \n4 0 \n5 0 \n6 0 \n7 0 \n8 0 \n9 0 \n10 0 "
}
],
"prompt_number": 66
},
{
"cell_type": "code",
"collapsed": false,
"input": "df_resampled.ix[:10,:10]",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>age</th>\n <th>ageday</th>\n <th>agemonth</th>\n <th>amonth</th>\n <th>asource</th>\n <th>asourceu</th>\n <th>asource_</th>\n <th>atype</th>\n <th>aweekend</th>\n <th>died</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0 </th>\n <td> 45</td>\n <td>-999</td>\n <td>-999</td>\n <td> 5</td>\n <td> 1</td>\n <td> NaN</td>\n <td> 7</td>\n <td> 3</td>\n <td> 1</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>1 </th>\n <td> 72</td>\n <td>-999</td>\n <td>-999</td>\n <td> 4</td>\n <td> 5</td>\n <td> 1</td>\n <td> 7</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>2 </th>\n <td> 66</td>\n <td>-999</td>\n <td>-999</td>\n <td> 1</td>\n <td>-999</td>\n <td> 1</td>\n <td> NaN</td>\n <td> 4</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>3 </th>\n <td> 50</td>\n <td>-999</td>\n <td>-999</td>\n <td> 5</td>\n <td>-999</td>\n <td> 1</td>\n <td> 7</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>4 </th>\n <td> 37</td>\n <td>-999</td>\n <td>-999</td>\n <td> 12</td>\n <td> 5</td>\n <td> 3</td>\n <td> 2</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>5 </th>\n <td> 41</td>\n <td>-999</td>\n <td>-999</td>\n <td> 6</td>\n <td> 5</td>\n <td> NaN</td>\n <td> 7</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>6 </th>\n <td> 68</td>\n <td>-999</td>\n <td>-999</td>\n <td> 7</td>\n <td>-999</td>\n <td> 7</td>\n <td> 7</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>7 </th>\n <td> 66</td>\n <td>-999</td>\n <td> 0</td>\n <td> 7</td>\n <td> 5</td>\n <td> 1</td>\n <td> 7</td>\n <td> 1</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>8 </th>\n <td> 22</td>\n <td>-999</td>\n <td>-999</td>\n <td> 10</td>\n <td> 1</td>\n <td> 7</td>\n <td> 1</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>9 </th>\n <td> 62</td>\n <td>-999</td>\n <td>-999</td>\n <td> 8</td>\n <td> 5</td>\n <td> 1</td>\n <td> NaN</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n <tr>\n <th>10</th>\n <td> 76</td>\n <td>-999</td>\n <td> 0</td>\n <td> 10</td>\n <td> 3</td>\n <td> 1</td>\n <td> 1</td>\n <td> 3</td>\n <td> 0</td>\n <td> 0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 67,
"text": " age ageday agemonth amonth asource asourceu asource_ atype aweekend \\\n0 45 -999 -999 5 1 NaN 7 3 1 \n1 72 -999 -999 4 5 1 7 3 0 \n2 66 -999 -999 1 -999 1 NaN 4 0 \n3 50 -999 -999 5 -999 1 7 3 0 \n4 37 -999 -999 12 5 3 2 3 0 \n5 41 -999 -999 6 5 NaN 7 3 0 \n6 68 -999 -999 7 -999 7 7 3 0 \n7 66 -999 0 7 5 1 7 1 0 \n8 22 -999 -999 10 1 7 1 3 0 \n9 62 -999 -999 8 5 1 NaN 3 0 \n10 76 -999 0 10 3 1 1 3 0 \n\n died \n0 0 \n1 0 \n2 0 \n3 0 \n4 0 \n5 0 \n6 0 \n7 0 \n8 0 \n9 0 \n10 0 "
}
],
"prompt_number": 67
},
{
"cell_type": "code",
"collapsed": false,
"input": "df.age.mean(), df_resampled.age.mean()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 68,
"text": "(51.52543128766289, 51.475596623664806)"
}
],
"prompt_number": 68
},
{
"cell_type": "code",
"collapsed": false,
"input": "df_resampled.to_csv('/home/j/Project/Models/network_LoS/IA_2007_resampled.csv')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 69
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment