Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save stharrold/5843861c089156c4fd0665cfe8501989 to your computer and use it in GitHub Desktop.
Save stharrold/5843861c089156c4fd0665cfe8501989 to your computer and use it in GitHub Desktop.
20170927_pandas_timeseries_rolling_large-window_unique-count.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to efficiently compute a rolling unique count in a `pandas` time series?\n",
"\n",
"I have a time series of people visiting a building. Each person has a unique ID. I want to know the number of unique people visiting the building in the last 365 days.\n",
"\n",
"`pandas` does not seem to have a built-in method for this calculation. The calculation becomes computationally intensive when there are a large number of unique visitors and/or a large window. (The actual data is larger than this example.)\n",
"\n",
"Is there a better way to calculate than what I've done below? I'm not sure why the method I made `windowed_nunique` (under \"Speed test 3\") is off by 1.\n",
"\n",
"Related links:\n",
"* https://github.com/pandas-dev/pandas/issues/14336"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import libraries.\n",
"import pandas as pd\n",
"import numba\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Created data of people visiting a building:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>PersonId</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2010-01-01</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2010-01-01</td>\n",
" <td>63</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2010-01-01</td>\n",
" <td>89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2010-01-01</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2010-01-01</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2010-01-02</td>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2010-01-02</td>\n",
" <td>83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2010-01-02</td>\n",
" <td>78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2010-01-02</td>\n",
" <td>47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>2010-01-02</td>\n",
" <td>68</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>2010-01-02</td>\n",
" <td>72</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>2010-01-03</td>\n",
" <td>89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>2010-01-03</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>2010-01-03</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>2010-01-04</td>\n",
" <td>67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>2010-01-04</td>\n",
" <td>88</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>2010-01-04</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>2010-01-05</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>2010-01-05</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>2010-01-05</td>\n",
" <td>70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>2010-01-06</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>2010-01-06</td>\n",
" <td>77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>2010-01-07</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>2010-01-08</td>\n",
" <td>78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>2010-01-08</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>2010-01-08</td>\n",
" <td>49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>2010-01-08</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>2010-01-08</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>2010-01-09</td>\n",
" <td>35</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>2010-01-09</td>\n",
" <td>69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9151</th>\n",
" <td>2014-12-26</td>\n",
" <td>89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9152</th>\n",
" <td>2014-12-26</td>\n",
" <td>54</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9153</th>\n",
" <td>2014-12-26</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9154</th>\n",
" <td>2014-12-26</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9155</th>\n",
" <td>2014-12-26</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9156</th>\n",
" <td>2014-12-26</td>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9157</th>\n",
" <td>2014-12-27</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9158</th>\n",
" <td>2014-12-27</td>\n",
" <td>73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9159</th>\n",
" <td>2014-12-28</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9160</th>\n",
" <td>2014-12-28</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9161</th>\n",
" <td>2014-12-28</td>\n",
" <td>88</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9162</th>\n",
" <td>2014-12-28</td>\n",
" <td>49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9163</th>\n",
" <td>2014-12-28</td>\n",
" <td>93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9164</th>\n",
" <td>2014-12-29</td>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9165</th>\n",
" <td>2014-12-29</td>\n",
" <td>63</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9166</th>\n",
" <td>2014-12-29</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9167</th>\n",
" <td>2014-12-29</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9168</th>\n",
" <td>2014-12-29</td>\n",
" <td>53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9169</th>\n",
" <td>2014-12-30</td>\n",
" <td>66</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9170</th>\n",
" <td>2014-12-30</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9171</th>\n",
" <td>2014-12-30</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9172</th>\n",
" <td>2014-12-30</td>\n",
" <td>75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9173</th>\n",
" <td>2014-12-30</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9174</th>\n",
" <td>2014-12-30</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9175</th>\n",
" <td>2014-12-31</td>\n",
" <td>83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9176</th>\n",
" <td>2014-12-31</td>\n",
" <td>42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9177</th>\n",
" <td>2014-12-31</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9178</th>\n",
" <td>2015-01-01</td>\n",
" <td>93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9179</th>\n",
" <td>2015-01-01</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9180</th>\n",
" <td>2015-01-01</td>\n",
" <td>80</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9181 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" Date PersonId\n",
"0 2010-01-01 76\n",
"1 2010-01-01 63\n",
"2 2010-01-01 89\n",
"3 2010-01-01 81\n",
"4 2010-01-01 7\n",
"5 2010-01-02 22\n",
"6 2010-01-02 83\n",
"7 2010-01-02 78\n",
"8 2010-01-02 47\n",
"9 2010-01-02 68\n",
"10 2010-01-02 72\n",
"11 2010-01-03 89\n",
"12 2010-01-03 94\n",
"13 2010-01-03 44\n",
"14 2010-01-04 67\n",
"15 2010-01-04 88\n",
"16 2010-01-04 90\n",
"17 2010-01-05 30\n",
"18 2010-01-05 90\n",
"19 2010-01-05 70\n",
"20 2010-01-06 10\n",
"21 2010-01-06 77\n",
"22 2010-01-07 15\n",
"23 2010-01-08 78\n",
"24 2010-01-08 81\n",
"25 2010-01-08 49\n",
"26 2010-01-08 96\n",
"27 2010-01-08 92\n",
"28 2010-01-09 35\n",
"29 2010-01-09 69\n",
"... ... ...\n",
"9151 2014-12-26 89\n",
"9152 2014-12-26 54\n",
"9153 2014-12-26 56\n",
"9154 2014-12-26 76\n",
"9155 2014-12-26 95\n",
"9156 2014-12-26 32\n",
"9157 2014-12-27 90\n",
"9158 2014-12-27 73\n",
"9159 2014-12-28 90\n",
"9160 2014-12-28 55\n",
"9161 2014-12-28 88\n",
"9162 2014-12-28 49\n",
"9163 2014-12-28 93\n",
"9164 2014-12-29 51\n",
"9165 2014-12-29 63\n",
"9166 2014-12-29 27\n",
"9167 2014-12-29 92\n",
"9168 2014-12-29 53\n",
"9169 2014-12-30 66\n",
"9170 2014-12-30 92\n",
"9171 2014-12-30 94\n",
"9172 2014-12-30 75\n",
"9173 2014-12-30 27\n",
"9174 2014-12-30 99\n",
"9175 2014-12-31 83\n",
"9176 2014-12-31 42\n",
"9177 2014-12-31 44\n",
"9178 2015-01-01 93\n",
"9179 2015-01-01 30\n",
"9180 2015-01-01 80\n",
"\n",
"[9181 rows x 2 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create data of people visiting a building.\n",
"\n",
"np.random.seed(seed=0)\n",
"dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')\n",
"window = 365 # days\n",
"num_pids = 100\n",
"probs = np.linspace(start=0.001, stop=0.1, num=num_pids)\n",
"\n",
"df = pd.\\\n",
" DataFrame(\n",
" data=[(date, pid)\n",
" for (pid, prob) in zip(range(num_pids), probs)\n",
" for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],\n",
" columns=['Date', 'PersonId'])\\\n",
" .sort_values(by='Date')\\\n",
" .reset_index(drop=True)\n",
"\n",
"print(\"Created data of people visiting a building:\")\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed reference"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.41 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"# This counts the number of people visiting the building, not the number of unique people.\n",
"# Provided as a speed reference.\n",
"df.rolling(window='{:d}D'.format(window), on='Date').count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed test 1"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.25 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Save results as a reference to check calculation accuracy.\n",
"ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed test 2"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Define a custom function and implement a just-in-time compiler.\n",
"@numba.jit(nopython=True)\n",
"def nunique(arr):\n",
" return len(set(arr))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"444 ms ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Check accuracy of results.\n",
"test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values\n",
"assert all((ref == test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed test 3"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Define a custom function and implement a just-in-time compiler.\n",
"@numba.jit(nopython=True)\n",
"def windowed_nunique(dates, pids, window):\n",
" r\"\"\"Track number of unique persons in window,\n",
" reading through arrays only once.\n",
" \n",
" Args:\n",
" dates (numpy.ndarray): Array of dates as number of days since epoch.\n",
" pids (numpy.ndarray): Array of integer person identifiers.\n",
" \n",
" window (int): Width of window in units of difference of `dates`.\n",
" \n",
" Returns:\n",
" ucts (numpy.ndarray): Array of unique counts\n",
" \n",
" Raises:\n",
" AssertionError: Raised if `len(dates) != len(pids)`\n",
" \n",
" Notes:\n",
" * May be off by 1 compared to `pandas.core.window.Rolling`\n",
" with a time series alias offset.\n",
" \n",
" \"\"\"\n",
"\n",
" # Check arguments.\n",
" assert dates.shape == pids.shape\n",
" \n",
" # Initialize counters.\n",
" idx_min = 0\n",
" idx_max = dates.shape[0]\n",
" date_min = dates[idx_min]\n",
" pid_min = pids[idx_min]\n",
" pid_max = np.max(pids)\n",
" pid_cts = np.zeros(pid_max, dtype=np.int64)\n",
" pid_cts[pid_min] = 1\n",
" uct = 1\n",
" ucts = np.zeros(idx_max, dtype=np.int64)\n",
" ucts[idx_min] = uct\n",
" idx = 1\n",
" \n",
" # For each (date, person)...\n",
" while idx < idx_max:\n",
" \n",
" # If person count went from 0 to 1, increment unique person count.\n",
" date = dates[idx]\n",
" pid = pids[idx]\n",
" pid_cts[pid] += 1\n",
" if pid_cts[pid] == 1:\n",
" uct += 1\n",
" \n",
" # For past dates outside of window...\n",
" while (date - date_min) > window:\n",
" \n",
" # If person count went from 1 to 0, decrement unique person count.\n",
" pid_cts[pid_min] -= 1\n",
" if pid_cts[pid_min] == 0:\n",
" uct -= 1\n",
" idx_min += 1\n",
" date_min = dates[idx_min]\n",
" pid_min = pids[idx_min]\n",
" \n",
" # Record unique person count.\n",
" ucts[idx] = uct\n",
" idx += 1\n",
" \n",
" return ucts"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Format dates and person IDs.\n",
"df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')\n",
"df['DateEpoch'] = df['DateEpoch'].astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"232 µs ± 110 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"windowed_nunique(\n",
" dates=df['DateEpoch'].astype(int).values,\n",
" pids=df['PersonId'].astype(int).values,\n",
" window=window)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Check accuracy of results.\n",
"test = windowed_nunique(\n",
" dates=df['DateEpoch'].values,\n",
" pids=df['PersonId'].values,\n",
" window=window)\n",
"# Note: Method may be off by 1.\n",
"assert all(np.isclose(ref, np.asarray(test), atol=1))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Where reference ('ref') calculation of number of unique people doesn't match 'test':\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>PersonId</th>\n",
" <th>DateEpoch</th>\n",
" <th>ref</th>\n",
" <th>test</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>2010-01-19</td>\n",
" <td>99</td>\n",
" <td>14628</td>\n",
" <td>56.0</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>2010-01-19</td>\n",
" <td>96</td>\n",
" <td>14628</td>\n",
" <td>56.0</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>2010-01-19</td>\n",
" <td>88</td>\n",
" <td>14628</td>\n",
" <td>56.0</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>2010-01-20</td>\n",
" <td>94</td>\n",
" <td>14629</td>\n",
" <td>56.0</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>2010-01-20</td>\n",
" <td>48</td>\n",
" <td>14629</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>2010-01-20</td>\n",
" <td>74</td>\n",
" <td>14629</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>2010-01-20</td>\n",
" <td>95</td>\n",
" <td>14629</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>2010-01-20</td>\n",
" <td>70</td>\n",
" <td>14629</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>2010-01-21</td>\n",
" <td>71</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>2010-01-21</td>\n",
" <td>62</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>2010-01-21</td>\n",
" <td>77</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>2010-01-21</td>\n",
" <td>65</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>2010-01-21</td>\n",
" <td>63</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>2010-01-21</td>\n",
" <td>74</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>2010-01-21</td>\n",
" <td>54</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>2010-01-21</td>\n",
" <td>86</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>2010-01-22</td>\n",
" <td>32</td>\n",
" <td>14631</td>\n",
" <td>58.0</td>\n",
" <td>57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>2010-01-22</td>\n",
" <td>85</td>\n",
" <td>14631</td>\n",
" <td>58.0</td>\n",
" <td>57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>2010-01-22</td>\n",
" <td>80</td>\n",
" <td>14631</td>\n",
" <td>59.0</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>2010-01-22</td>\n",
" <td>72</td>\n",
" <td>14631</td>\n",
" <td>59.0</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>2010-01-22</td>\n",
" <td>97</td>\n",
" <td>14631</td>\n",
" <td>59.0</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>2010-01-22</td>\n",
" <td>57</td>\n",
" <td>14631</td>\n",
" <td>59.0</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>2010-01-23</td>\n",
" <td>50</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>2010-01-23</td>\n",
" <td>96</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>2010-01-23</td>\n",
" <td>57</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>2010-01-23</td>\n",
" <td>30</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>2010-01-23</td>\n",
" <td>92</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>105</th>\n",
" <td>2010-01-23</td>\n",
" <td>61</td>\n",
" <td>14632</td>\n",
" <td>61.0</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106</th>\n",
" <td>2010-01-23</td>\n",
" <td>52</td>\n",
" <td>14632</td>\n",
" <td>61.0</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>107</th>\n",
" <td>2010-01-24</td>\n",
" <td>67</td>\n",
" <td>14633</td>\n",
" <td>61.0</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9151</th>\n",
" <td>2014-12-26</td>\n",
" <td>89</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9152</th>\n",
" <td>2014-12-26</td>\n",
" <td>54</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9153</th>\n",
" <td>2014-12-26</td>\n",
" <td>56</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9154</th>\n",
" <td>2014-12-26</td>\n",
" <td>76</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9155</th>\n",
" <td>2014-12-26</td>\n",
" <td>95</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9156</th>\n",
" <td>2014-12-26</td>\n",
" <td>32</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9157</th>\n",
" <td>2014-12-27</td>\n",
" <td>90</td>\n",
" <td>16431</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9158</th>\n",
" <td>2014-12-27</td>\n",
" <td>73</td>\n",
" <td>16431</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9159</th>\n",
" <td>2014-12-28</td>\n",
" <td>90</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9160</th>\n",
" <td>2014-12-28</td>\n",
" <td>55</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9161</th>\n",
" <td>2014-12-28</td>\n",
" <td>88</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9162</th>\n",
" <td>2014-12-28</td>\n",
" <td>49</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9163</th>\n",
" <td>2014-12-28</td>\n",
" <td>93</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9164</th>\n",
" <td>2014-12-29</td>\n",
" <td>51</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9165</th>\n",
" <td>2014-12-29</td>\n",
" <td>63</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9166</th>\n",
" <td>2014-12-29</td>\n",
" <td>27</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9167</th>\n",
" <td>2014-12-29</td>\n",
" <td>92</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9168</th>\n",
" <td>2014-12-29</td>\n",
" <td>53</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9169</th>\n",
" <td>2014-12-30</td>\n",
" <td>66</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9170</th>\n",
" <td>2014-12-30</td>\n",
" <td>92</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9171</th>\n",
" <td>2014-12-30</td>\n",
" <td>94</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9172</th>\n",
" <td>2014-12-30</td>\n",
" <td>75</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9173</th>\n",
" <td>2014-12-30</td>\n",
" <td>27</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9174</th>\n",
" <td>2014-12-30</td>\n",
" <td>99</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9175</th>\n",
" <td>2014-12-31</td>\n",
" <td>83</td>\n",
" <td>16435</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9176</th>\n",
" <td>2014-12-31</td>\n",
" <td>42</td>\n",
" <td>16435</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9177</th>\n",
" <td>2014-12-31</td>\n",
" <td>44</td>\n",
" <td>16435</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9178</th>\n",
" <td>2015-01-01</td>\n",
" <td>93</td>\n",
" <td>16436</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9179</th>\n",
" <td>2015-01-01</td>\n",
" <td>30</td>\n",
" <td>16436</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9180</th>\n",
" <td>2015-01-01</td>\n",
" <td>80</td>\n",
" <td>16436</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9044 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" Date PersonId DateEpoch ref test\n",
"78 2010-01-19 99 14628 56.0 55\n",
"79 2010-01-19 96 14628 56.0 55\n",
"80 2010-01-19 88 14628 56.0 55\n",
"81 2010-01-20 94 14629 56.0 55\n",
"82 2010-01-20 48 14629 57.0 56\n",
"83 2010-01-20 74 14629 57.0 56\n",
"84 2010-01-20 95 14629 57.0 56\n",
"85 2010-01-20 70 14629 57.0 56\n",
"86 2010-01-21 71 14630 57.0 56\n",
"87 2010-01-21 62 14630 57.0 56\n",
"88 2010-01-21 77 14630 57.0 56\n",
"89 2010-01-21 65 14630 57.0 56\n",
"90 2010-01-21 63 14630 57.0 56\n",
"91 2010-01-21 74 14630 57.0 56\n",
"92 2010-01-21 54 14630 57.0 56\n",
"93 2010-01-21 86 14630 57.0 56\n",
"94 2010-01-22 32 14631 58.0 57\n",
"95 2010-01-22 85 14631 58.0 57\n",
"96 2010-01-22 80 14631 59.0 58\n",
"97 2010-01-22 72 14631 59.0 58\n",
"98 2010-01-22 97 14631 59.0 58\n",
"99 2010-01-22 57 14631 59.0 58\n",
"100 2010-01-23 50 14632 60.0 59\n",
"101 2010-01-23 96 14632 60.0 59\n",
"102 2010-01-23 57 14632 60.0 59\n",
"103 2010-01-23 30 14632 60.0 59\n",
"104 2010-01-23 92 14632 60.0 59\n",
"105 2010-01-23 61 14632 61.0 60\n",
"106 2010-01-23 52 14632 61.0 60\n",
"107 2010-01-24 67 14633 61.0 60\n",
"... ... ... ... ... ...\n",
"9151 2014-12-26 89 16430 97.0 96\n",
"9152 2014-12-26 54 16430 97.0 96\n",
"9153 2014-12-26 56 16430 97.0 96\n",
"9154 2014-12-26 76 16430 97.0 96\n",
"9155 2014-12-26 95 16430 97.0 96\n",
"9156 2014-12-26 32 16430 97.0 96\n",
"9157 2014-12-27 90 16431 97.0 96\n",
"9158 2014-12-27 73 16431 97.0 96\n",
"9159 2014-12-28 90 16432 97.0 96\n",
"9160 2014-12-28 55 16432 97.0 96\n",
"9161 2014-12-28 88 16432 97.0 96\n",
"9162 2014-12-28 49 16432 97.0 96\n",
"9163 2014-12-28 93 16432 97.0 96\n",
"9164 2014-12-29 51 16433 97.0 96\n",
"9165 2014-12-29 63 16433 97.0 96\n",
"9166 2014-12-29 27 16433 97.0 96\n",
"9167 2014-12-29 92 16433 97.0 96\n",
"9168 2014-12-29 53 16433 97.0 96\n",
"9169 2014-12-30 66 16434 97.0 96\n",
"9170 2014-12-30 92 16434 97.0 96\n",
"9171 2014-12-30 94 16434 97.0 96\n",
"9172 2014-12-30 75 16434 97.0 96\n",
"9173 2014-12-30 27 16434 97.0 96\n",
"9174 2014-12-30 99 16434 97.0 96\n",
"9175 2014-12-31 83 16435 97.0 96\n",
"9176 2014-12-31 42 16435 97.0 96\n",
"9177 2014-12-31 44 16435 97.0 96\n",
"9178 2015-01-01 93 16436 97.0 96\n",
"9179 2015-01-01 30 16436 97.0 96\n",
"9180 2015-01-01 80 16436 97.0 96\n",
"\n",
"[9044 rows x 5 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show where the calculation doesn't match.\n",
"print(\"Where reference ('ref') calculation of number of unique people doesn't match 'test':\")\n",
"df['ref'] = ref\n",
"df['test'] = test\n",
"df.loc[df['ref'] != df['test']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed test 4"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# Define a custom function and implement a just-in-time compiler.\n",
"@numba.jit(nopython=True)\n",
"def windowed_nunique2(dates, pids, window):\n",
" r\"\"\"Track number of unique persons in window,\n",
" reading through arrays only once.\n",
" \n",
" Args:\n",
" dates (numpy.ndarray): Array of dates as number of days since epoch.\n",
" pids (numpy.ndarray): Array of integer person identifiers.\n",
" \n",
" window (int): Width of window in units of difference of `dates`.\n",
" \n",
" Returns:\n",
" ucts (numpy.ndarray): Array of unique counts\n",
" \n",
" Raises:\n",
" AssertionError: Raised if `len(dates) != len(pids)`\n",
" \n",
" Notes:\n",
" * May be off by 1 compared to `pandas.core.window.Rolling`\n",
" with a time series alias offset.\n",
" * Decrements when `date - date_min >= window`\n",
" \n",
" \"\"\"\n",
"\n",
" # Check arguments.\n",
" assert dates.shape == pids.shape\n",
" \n",
" # Initialize counters.\n",
" idx_min = 0\n",
" idx_max = dates.shape[0]\n",
" date_min = dates[idx_min]\n",
" pid_min = pids[idx_min]\n",
" pid_max = np.max(pids)\n",
" pid_cts = np.zeros(pid_max, dtype=np.int64)\n",
" pid_cts[pid_min] = 1\n",
" uct = 1\n",
" ucts = np.zeros(idx_max, dtype=np.int64)\n",
" ucts[idx_min] = uct\n",
" idx = 1\n",
" \n",
" # For each (date, person)...\n",
" while idx < idx_max:\n",
" \n",
" # If person count went from 0 to 1, increment unique person count.\n",
" date = dates[idx]\n",
" pid = pids[idx]\n",
" pid_cts[pid] += 1\n",
" if pid_cts[pid] == 1:\n",
" uct += 1\n",
" \n",
" # For past dates outside of window...\n",
" while (date - date_min) >= window:\n",
" \n",
" # If person count went from 1 to 0, decrement unique person count.\n",
" pid_cts[pid_min] -= 1\n",
" if pid_cts[pid_min] == 0:\n",
" uct -= 1\n",
" idx_min += 1\n",
" date_min = dates[idx_min]\n",
" pid_min = pids[idx_min]\n",
" \n",
" # Record unique person count.\n",
" ucts[idx] = uct\n",
" idx += 1\n",
" \n",
" return ucts"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# Format dates and person IDs.\n",
"df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')\n",
"df['DateEpoch'] = df['DateEpoch'].astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"183 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"windowed_nunique2(\n",
" dates=df['DateEpoch'].astype(int).values,\n",
" pids=df['PersonId'].astype(int).values,\n",
" window=window)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# Check accuracy of results.\n",
"test = windowed_nunique2(\n",
" dates=df['DateEpoch'].values,\n",
" pids=df['PersonId'].values,\n",
" window=window)\n",
"# Note: Method may be off by 1.\n",
"assert all(np.isclose(ref, np.asarray(test), atol=1))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Where reference ('ref') calculation of number of unique people doesn't match 'test':\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>PersonId</th>\n",
" <th>DateEpoch</th>\n",
" <th>ref</th>\n",
" <th>test</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>2010-01-19</td>\n",
" <td>99</td>\n",
" <td>14628</td>\n",
" <td>56.0</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>2010-01-19</td>\n",
" <td>96</td>\n",
" <td>14628</td>\n",
" <td>56.0</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>2010-01-19</td>\n",
" <td>88</td>\n",
" <td>14628</td>\n",
" <td>56.0</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>2010-01-20</td>\n",
" <td>94</td>\n",
" <td>14629</td>\n",
" <td>56.0</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>2010-01-20</td>\n",
" <td>48</td>\n",
" <td>14629</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>2010-01-20</td>\n",
" <td>74</td>\n",
" <td>14629</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>2010-01-20</td>\n",
" <td>95</td>\n",
" <td>14629</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>2010-01-20</td>\n",
" <td>70</td>\n",
" <td>14629</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>2010-01-21</td>\n",
" <td>71</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>2010-01-21</td>\n",
" <td>62</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>2010-01-21</td>\n",
" <td>77</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>2010-01-21</td>\n",
" <td>65</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>2010-01-21</td>\n",
" <td>63</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>2010-01-21</td>\n",
" <td>74</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>2010-01-21</td>\n",
" <td>54</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>2010-01-21</td>\n",
" <td>86</td>\n",
" <td>14630</td>\n",
" <td>57.0</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>2010-01-22</td>\n",
" <td>32</td>\n",
" <td>14631</td>\n",
" <td>58.0</td>\n",
" <td>57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>2010-01-22</td>\n",
" <td>85</td>\n",
" <td>14631</td>\n",
" <td>58.0</td>\n",
" <td>57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>2010-01-22</td>\n",
" <td>80</td>\n",
" <td>14631</td>\n",
" <td>59.0</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>2010-01-22</td>\n",
" <td>72</td>\n",
" <td>14631</td>\n",
" <td>59.0</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>2010-01-22</td>\n",
" <td>97</td>\n",
" <td>14631</td>\n",
" <td>59.0</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>2010-01-22</td>\n",
" <td>57</td>\n",
" <td>14631</td>\n",
" <td>59.0</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>2010-01-23</td>\n",
" <td>50</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>2010-01-23</td>\n",
" <td>96</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>2010-01-23</td>\n",
" <td>57</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>2010-01-23</td>\n",
" <td>30</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>2010-01-23</td>\n",
" <td>92</td>\n",
" <td>14632</td>\n",
" <td>60.0</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>105</th>\n",
" <td>2010-01-23</td>\n",
" <td>61</td>\n",
" <td>14632</td>\n",
" <td>61.0</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106</th>\n",
" <td>2010-01-23</td>\n",
" <td>52</td>\n",
" <td>14632</td>\n",
" <td>61.0</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>107</th>\n",
" <td>2010-01-24</td>\n",
" <td>67</td>\n",
" <td>14633</td>\n",
" <td>61.0</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9151</th>\n",
" <td>2014-12-26</td>\n",
" <td>89</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9152</th>\n",
" <td>2014-12-26</td>\n",
" <td>54</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9153</th>\n",
" <td>2014-12-26</td>\n",
" <td>56</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9154</th>\n",
" <td>2014-12-26</td>\n",
" <td>76</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9155</th>\n",
" <td>2014-12-26</td>\n",
" <td>95</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9156</th>\n",
" <td>2014-12-26</td>\n",
" <td>32</td>\n",
" <td>16430</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9157</th>\n",
" <td>2014-12-27</td>\n",
" <td>90</td>\n",
" <td>16431</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9158</th>\n",
" <td>2014-12-27</td>\n",
" <td>73</td>\n",
" <td>16431</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9159</th>\n",
" <td>2014-12-28</td>\n",
" <td>90</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9160</th>\n",
" <td>2014-12-28</td>\n",
" <td>55</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9161</th>\n",
" <td>2014-12-28</td>\n",
" <td>88</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9162</th>\n",
" <td>2014-12-28</td>\n",
" <td>49</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9163</th>\n",
" <td>2014-12-28</td>\n",
" <td>93</td>\n",
" <td>16432</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9164</th>\n",
" <td>2014-12-29</td>\n",
" <td>51</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9165</th>\n",
" <td>2014-12-29</td>\n",
" <td>63</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9166</th>\n",
" <td>2014-12-29</td>\n",
" <td>27</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9167</th>\n",
" <td>2014-12-29</td>\n",
" <td>92</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9168</th>\n",
" <td>2014-12-29</td>\n",
" <td>53</td>\n",
" <td>16433</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9169</th>\n",
" <td>2014-12-30</td>\n",
" <td>66</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9170</th>\n",
" <td>2014-12-30</td>\n",
" <td>92</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9171</th>\n",
" <td>2014-12-30</td>\n",
" <td>94</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9172</th>\n",
" <td>2014-12-30</td>\n",
" <td>75</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9173</th>\n",
" <td>2014-12-30</td>\n",
" <td>27</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9174</th>\n",
" <td>2014-12-30</td>\n",
" <td>99</td>\n",
" <td>16434</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9175</th>\n",
" <td>2014-12-31</td>\n",
" <td>83</td>\n",
" <td>16435</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9176</th>\n",
" <td>2014-12-31</td>\n",
" <td>42</td>\n",
" <td>16435</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9177</th>\n",
" <td>2014-12-31</td>\n",
" <td>44</td>\n",
" <td>16435</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9178</th>\n",
" <td>2015-01-01</td>\n",
" <td>93</td>\n",
" <td>16436</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9179</th>\n",
" <td>2015-01-01</td>\n",
" <td>30</td>\n",
" <td>16436</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9180</th>\n",
" <td>2015-01-01</td>\n",
" <td>80</td>\n",
" <td>16436</td>\n",
" <td>97.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9103 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" Date PersonId DateEpoch ref test\n",
"78 2010-01-19 99 14628 56.0 55\n",
"79 2010-01-19 96 14628 56.0 55\n",
"80 2010-01-19 88 14628 56.0 55\n",
"81 2010-01-20 94 14629 56.0 55\n",
"82 2010-01-20 48 14629 57.0 56\n",
"83 2010-01-20 74 14629 57.0 56\n",
"84 2010-01-20 95 14629 57.0 56\n",
"85 2010-01-20 70 14629 57.0 56\n",
"86 2010-01-21 71 14630 57.0 56\n",
"87 2010-01-21 62 14630 57.0 56\n",
"88 2010-01-21 77 14630 57.0 56\n",
"89 2010-01-21 65 14630 57.0 56\n",
"90 2010-01-21 63 14630 57.0 56\n",
"91 2010-01-21 74 14630 57.0 56\n",
"92 2010-01-21 54 14630 57.0 56\n",
"93 2010-01-21 86 14630 57.0 56\n",
"94 2010-01-22 32 14631 58.0 57\n",
"95 2010-01-22 85 14631 58.0 57\n",
"96 2010-01-22 80 14631 59.0 58\n",
"97 2010-01-22 72 14631 59.0 58\n",
"98 2010-01-22 97 14631 59.0 58\n",
"99 2010-01-22 57 14631 59.0 58\n",
"100 2010-01-23 50 14632 60.0 59\n",
"101 2010-01-23 96 14632 60.0 59\n",
"102 2010-01-23 57 14632 60.0 59\n",
"103 2010-01-23 30 14632 60.0 59\n",
"104 2010-01-23 92 14632 60.0 59\n",
"105 2010-01-23 61 14632 61.0 60\n",
"106 2010-01-23 52 14632 61.0 60\n",
"107 2010-01-24 67 14633 61.0 60\n",
"... ... ... ... ... ...\n",
"9151 2014-12-26 89 16430 97.0 96\n",
"9152 2014-12-26 54 16430 97.0 96\n",
"9153 2014-12-26 56 16430 97.0 96\n",
"9154 2014-12-26 76 16430 97.0 96\n",
"9155 2014-12-26 95 16430 97.0 96\n",
"9156 2014-12-26 32 16430 97.0 96\n",
"9157 2014-12-27 90 16431 97.0 96\n",
"9158 2014-12-27 73 16431 97.0 96\n",
"9159 2014-12-28 90 16432 97.0 96\n",
"9160 2014-12-28 55 16432 97.0 96\n",
"9161 2014-12-28 88 16432 97.0 96\n",
"9162 2014-12-28 49 16432 97.0 96\n",
"9163 2014-12-28 93 16432 97.0 96\n",
"9164 2014-12-29 51 16433 97.0 96\n",
"9165 2014-12-29 63 16433 97.0 96\n",
"9166 2014-12-29 27 16433 97.0 96\n",
"9167 2014-12-29 92 16433 97.0 96\n",
"9168 2014-12-29 53 16433 97.0 96\n",
"9169 2014-12-30 66 16434 97.0 96\n",
"9170 2014-12-30 92 16434 97.0 96\n",
"9171 2014-12-30 94 16434 97.0 96\n",
"9172 2014-12-30 75 16434 97.0 96\n",
"9173 2014-12-30 27 16434 97.0 96\n",
"9174 2014-12-30 99 16434 97.0 96\n",
"9175 2014-12-31 83 16435 97.0 96\n",
"9176 2014-12-31 42 16435 97.0 96\n",
"9177 2014-12-31 44 16435 97.0 96\n",
"9178 2015-01-01 93 16436 97.0 96\n",
"9179 2015-01-01 30 16436 97.0 96\n",
"9180 2015-01-01 80 16436 97.0 96\n",
"\n",
"[9103 rows x 5 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show where the calculation doesn't match.\n",
"print(\"Where reference ('ref') calculation of number of unique people doesn't match 'test':\")\n",
"df['ref'] = ref\n",
"df['test'] = test\n",
"df.loc[df['ref'] != df['test']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment