Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save stharrold/17589e6809d249942debe3a5c43d38cc to your computer and use it in GitHub Desktop.
Save stharrold/17589e6809d249942debe3a5c43d38cc to your computer and use it in GitHub Desktop.
20170928_pandas_timeseries_rolling_large-window_unique-count.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to efficiently compute a rolling unique count in a `pandas` time series?\n",
"\n",
"I have a time series of people visiting a building. Each person has a unique ID. For every record in the time series, I want to know the number of unique people visiting the building in the last 365 days (i.e. a rolling unique count with a window of 365 days).\n",
"\n",
"`pandas` does not seem to have a built-in method for this calculation. The calculation becomes computationally intensive when there are a large number of unique visitors and/or a large window. (The actual data is larger than this example.)\n",
"\n",
"Is there a better way to calculate than what I've done below? I'm not sure why the fast method I made, `windowed_nunique` (under \"Speed test 3\"), is off by 1.\n",
"\n",
"Thanks for any help!\n",
"\n",
"Related links:\n",
"* https://github.com/pandas-dev/pandas/issues/14336"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import libraries.\n",
"import pandas as pd\n",
"import numba\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Created data of people visiting a building:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>PersonId</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2010-01-01</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2010-01-01</td>\n",
" <td>63</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2010-01-01</td>\n",
" <td>89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2010-01-01</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2010-01-01</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2010-01-02</td>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2010-01-02</td>\n",
" <td>83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2010-01-02</td>\n",
" <td>78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2010-01-02</td>\n",
" <td>47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>2010-01-02</td>\n",
" <td>68</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>2010-01-02</td>\n",
" <td>72</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>2010-01-03</td>\n",
" <td>89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>2010-01-03</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>2010-01-03</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>2010-01-04</td>\n",
" <td>67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>2010-01-04</td>\n",
" <td>88</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>2010-01-04</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>2010-01-05</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>2010-01-05</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>2010-01-05</td>\n",
" <td>70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>2010-01-06</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>2010-01-06</td>\n",
" <td>77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>2010-01-07</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>2010-01-08</td>\n",
" <td>78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>2010-01-08</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>2010-01-08</td>\n",
" <td>49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>2010-01-08</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>2010-01-08</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>2010-01-09</td>\n",
" <td>35</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>2010-01-09</td>\n",
" <td>69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9151</th>\n",
" <td>2014-12-26</td>\n",
" <td>89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9152</th>\n",
" <td>2014-12-26</td>\n",
" <td>54</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9153</th>\n",
" <td>2014-12-26</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9154</th>\n",
" <td>2014-12-26</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9155</th>\n",
" <td>2014-12-26</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9156</th>\n",
" <td>2014-12-26</td>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9157</th>\n",
" <td>2014-12-27</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9158</th>\n",
" <td>2014-12-27</td>\n",
" <td>73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9159</th>\n",
" <td>2014-12-28</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9160</th>\n",
" <td>2014-12-28</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9161</th>\n",
" <td>2014-12-28</td>\n",
" <td>88</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9162</th>\n",
" <td>2014-12-28</td>\n",
" <td>49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9163</th>\n",
" <td>2014-12-28</td>\n",
" <td>93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9164</th>\n",
" <td>2014-12-29</td>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9165</th>\n",
" <td>2014-12-29</td>\n",
" <td>63</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9166</th>\n",
" <td>2014-12-29</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9167</th>\n",
" <td>2014-12-29</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9168</th>\n",
" <td>2014-12-29</td>\n",
" <td>53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9169</th>\n",
" <td>2014-12-30</td>\n",
" <td>66</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9170</th>\n",
" <td>2014-12-30</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9171</th>\n",
" <td>2014-12-30</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9172</th>\n",
" <td>2014-12-30</td>\n",
" <td>75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9173</th>\n",
" <td>2014-12-30</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9174</th>\n",
" <td>2014-12-30</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9175</th>\n",
" <td>2014-12-31</td>\n",
" <td>83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9176</th>\n",
" <td>2014-12-31</td>\n",
" <td>42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9177</th>\n",
" <td>2014-12-31</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9178</th>\n",
" <td>2015-01-01</td>\n",
" <td>93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9179</th>\n",
" <td>2015-01-01</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9180</th>\n",
" <td>2015-01-01</td>\n",
" <td>80</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9181 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" Date PersonId\n",
"0 2010-01-01 76\n",
"1 2010-01-01 63\n",
"2 2010-01-01 89\n",
"3 2010-01-01 81\n",
"4 2010-01-01 7\n",
"5 2010-01-02 22\n",
"6 2010-01-02 83\n",
"7 2010-01-02 78\n",
"8 2010-01-02 47\n",
"9 2010-01-02 68\n",
"10 2010-01-02 72\n",
"11 2010-01-03 89\n",
"12 2010-01-03 94\n",
"13 2010-01-03 44\n",
"14 2010-01-04 67\n",
"15 2010-01-04 88\n",
"16 2010-01-04 90\n",
"17 2010-01-05 30\n",
"18 2010-01-05 90\n",
"19 2010-01-05 70\n",
"20 2010-01-06 10\n",
"21 2010-01-06 77\n",
"22 2010-01-07 15\n",
"23 2010-01-08 78\n",
"24 2010-01-08 81\n",
"25 2010-01-08 49\n",
"26 2010-01-08 96\n",
"27 2010-01-08 92\n",
"28 2010-01-09 35\n",
"29 2010-01-09 69\n",
"... ... ...\n",
"9151 2014-12-26 89\n",
"9152 2014-12-26 54\n",
"9153 2014-12-26 56\n",
"9154 2014-12-26 76\n",
"9155 2014-12-26 95\n",
"9156 2014-12-26 32\n",
"9157 2014-12-27 90\n",
"9158 2014-12-27 73\n",
"9159 2014-12-28 90\n",
"9160 2014-12-28 55\n",
"9161 2014-12-28 88\n",
"9162 2014-12-28 49\n",
"9163 2014-12-28 93\n",
"9164 2014-12-29 51\n",
"9165 2014-12-29 63\n",
"9166 2014-12-29 27\n",
"9167 2014-12-29 92\n",
"9168 2014-12-29 53\n",
"9169 2014-12-30 66\n",
"9170 2014-12-30 92\n",
"9171 2014-12-30 94\n",
"9172 2014-12-30 75\n",
"9173 2014-12-30 27\n",
"9174 2014-12-30 99\n",
"9175 2014-12-31 83\n",
"9176 2014-12-31 42\n",
"9177 2014-12-31 44\n",
"9178 2015-01-01 93\n",
"9179 2015-01-01 30\n",
"9180 2015-01-01 80\n",
"\n",
"[9181 rows x 2 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create data of people visiting a building.\n",
"\n",
"np.random.seed(seed=0)\n",
"dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')\n",
"window = 365 # days\n",
"num_pids = 100\n",
"probs = np.linspace(start=0.001, stop=0.1, num=num_pids)\n",
"\n",
"df = pd\\\n",
" .DataFrame(\n",
" data=[(date, pid)\n",
" for (pid, prob) in zip(range(num_pids), probs)\n",
" for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],\n",
" columns=['Date', 'PersonId'])\\\n",
" .sort_values(by='Date')\\\n",
" .reset_index(drop=True)\n",
"\n",
"print(\"Created data of people visiting a building:\")\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed reference"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.37 ms ± 82.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"# This counts the number of people visiting the building, not the number of unique people.\n",
"# Provided as a speed reference.\n",
"df.rolling(window='{:d}D'.format(window), on='Date').count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed test 1"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.49 s ± 275 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Save results as a reference to check calculation accuracy.\n",
"ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed test 2"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Define a custom function and implement a just-in-time compiler.\n",
"@numba.jit(nopython=True)\n",
"def nunique(arr):\n",
" return len(set(arr))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"443 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Check accuracy of results.\n",
"test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values\n",
"assert all(ref == test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speed test 3"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Define a custom function and implement a just-in-time compiler.\n",
"@numba.jit(nopython=True)\n",
"def windowed_nunique(dates, pids, window):\n",
" r\"\"\"Track number of unique persons in window,\n",
" reading through arrays only once.\n",
"\n",
" Args:\n",
" dates (numpy.ndarray): Array of dates as number of days since epoch.\n",
" pids (numpy.ndarray): Array of integer person identifiers.\n",
" window (int): Width of window in units of difference of `dates`.\n",
"\n",
" Returns:\n",
" ucts (numpy.ndarray): Array of unique counts.\n",
"\n",
" Raises:\n",
" AssertionError: Raised if `len(dates) != len(pids)`\n",
"\n",
" Notes:\n",
" * May be off by 1 compared to `pandas.core.window.Rolling`\n",
" with a time series alias offset.\n",
"\n",
" \"\"\"\n",
"\n",
" # Check arguments.\n",
" assert dates.shape == pids.shape\n",
"\n",
" # Initialize counters.\n",
" idx_min = 0\n",
" idx_max = dates.shape[0]\n",
" date_min = dates[idx_min]\n",
" pid_min = pids[idx_min]\n",
" pid_max = np.max(pids)\n",
" pid_cts = np.zeros(pid_max, dtype=np.int64)\n",
" pid_cts[pid_min] = 1\n",
" uct = 1\n",
" ucts = np.zeros(idx_max, dtype=np.int64)\n",
" ucts[idx_min] = uct\n",
" idx = 1\n",
"\n",
" # For each (date, person)...\n",
" while idx < idx_max:\n",
"\n",
" # If person count went from 0 to 1, increment unique person count.\n",
" date = dates[idx]\n",
" pid = pids[idx]\n",
" pid_cts[pid] += 1\n",
" if pid_cts[pid] == 1:\n",
" uct += 1\n",
"\n",
" # For past dates outside of window...\n",
" while (date - date_min) > window:\n",
"\n",
" # If person count went from 1 to 0, decrement unique person count.\n",
" pid_cts[pid_min] -= 1\n",
" if pid_cts[pid_min] == 0:\n",
" uct -= 1\n",
" idx_min += 1\n",
" date_min = dates[idx_min]\n",
" pid_min = pids[idx_min]\n",
"\n",
" # Record unique person count.\n",
" ucts[idx] = uct\n",
" idx += 1\n",
"\n",
" return ucts"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Cast dates to integers.\n",
"df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')\n",
"df['DateEpoch'] = df['DateEpoch'].astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"110 µs ± 56.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"windowed_nunique(\n",
" dates=df['DateEpoch'].values,\n",
" pids=df['PersonId'].values,\n",
" window=window)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Check accuracy of results.\n",
"test = windowed_nunique(\n",
" dates=df['DateEpoch'].values,\n",
" pids=df['PersonId'].values,\n",
" window=window)\n",
"# Note: Method may be off by 1.\n",
"assert all(np.isclose(ref, np.asarray(test), atol=1))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Where reference ('ref') calculation of number of unique people doesn't match 'test':\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>PersonId</th>\n",
" <th>DateEpoch</th>\n",
" <th>ref</th>\n",
" <th>test</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2331</th>\n",
" <td>2011-04-14</td>\n",
" <td>83</td>\n",
" <td>15078</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2332</th>\n",
" <td>2011-04-14</td>\n",
" <td>91</td>\n",
" <td>15078</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2333</th>\n",
" <td>2011-04-14</td>\n",
" <td>33</td>\n",
" <td>15078</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2393</th>\n",
" <td>2011-04-29</td>\n",
" <td>95</td>\n",
" <td>15093</td>\n",
" <td>95.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2394</th>\n",
" <td>2011-04-29</td>\n",
" <td>70</td>\n",
" <td>15093</td>\n",
" <td>95.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2395</th>\n",
" <td>2011-04-29</td>\n",
" <td>65</td>\n",
" <td>15093</td>\n",
" <td>95.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2396</th>\n",
" <td>2011-04-29</td>\n",
" <td>59</td>\n",
" <td>15093</td>\n",
" <td>95.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2397</th>\n",
" <td>2011-04-29</td>\n",
" <td>98</td>\n",
" <td>15093</td>\n",
" <td>95.0</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2464</th>\n",
" <td>2011-05-11</td>\n",
" <td>58</td>\n",
" <td>15105</td>\n",
" <td>94.0</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2465</th>\n",
" <td>2011-05-11</td>\n",
" <td>35</td>\n",
" <td>15105</td>\n",
" <td>94.0</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2466</th>\n",
" <td>2011-05-11</td>\n",
" <td>92</td>\n",
" <td>15105</td>\n",
" <td>94.0</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2467</th>\n",
" <td>2011-05-11</td>\n",
" <td>28</td>\n",
" <td>15105</td>\n",
" <td>94.0</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3337</th>\n",
" <td>2011-10-29</td>\n",
" <td>84</td>\n",
" <td>15276</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3338</th>\n",
" <td>2011-10-29</td>\n",
" <td>30</td>\n",
" <td>15276</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3339</th>\n",
" <td>2011-10-29</td>\n",
" <td>91</td>\n",
" <td>15276</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3340</th>\n",
" <td>2011-10-29</td>\n",
" <td>81</td>\n",
" <td>15276</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3383</th>\n",
" <td>2011-11-07</td>\n",
" <td>78</td>\n",
" <td>15285</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3384</th>\n",
" <td>2011-11-07</td>\n",
" <td>53</td>\n",
" <td>15285</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3385</th>\n",
" <td>2011-11-07</td>\n",
" <td>42</td>\n",
" <td>15285</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3386</th>\n",
" <td>2011-11-07</td>\n",
" <td>99</td>\n",
" <td>15285</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3387</th>\n",
" <td>2011-11-07</td>\n",
" <td>83</td>\n",
" <td>15285</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3388</th>\n",
" <td>2011-11-07</td>\n",
" <td>69</td>\n",
" <td>15285</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3389</th>\n",
" <td>2011-11-07</td>\n",
" <td>26</td>\n",
" <td>15285</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3390</th>\n",
" <td>2011-11-07</td>\n",
" <td>63</td>\n",
" <td>15285</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6017</th>\n",
" <td>2013-04-19</td>\n",
" <td>70</td>\n",
" <td>15814</td>\n",
" <td>98.0</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6018</th>\n",
" <td>2013-04-19</td>\n",
" <td>92</td>\n",
" <td>15814</td>\n",
" <td>98.0</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6019</th>\n",
" <td>2013-04-19</td>\n",
" <td>79</td>\n",
" <td>15814</td>\n",
" <td>98.0</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6020</th>\n",
" <td>2013-04-19</td>\n",
" <td>60</td>\n",
" <td>15814</td>\n",
" <td>98.0</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6021</th>\n",
" <td>2013-04-19</td>\n",
" <td>51</td>\n",
" <td>15814</td>\n",
" <td>98.0</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6022</th>\n",
" <td>2013-04-19</td>\n",
" <td>80</td>\n",
" <td>15814</td>\n",
" <td>98.0</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6107</th>\n",
" <td>2013-05-05</td>\n",
" <td>88</td>\n",
" <td>15830</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6108</th>\n",
" <td>2013-05-05</td>\n",
" <td>97</td>\n",
" <td>15830</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6109</th>\n",
" <td>2013-05-05</td>\n",
" <td>98</td>\n",
" <td>15830</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6698</th>\n",
" <td>2013-09-01</td>\n",
" <td>93</td>\n",
" <td>15949</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6699</th>\n",
" <td>2013-09-01</td>\n",
" <td>49</td>\n",
" <td>15949</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6700</th>\n",
" <td>2013-09-01</td>\n",
" <td>61</td>\n",
" <td>15949</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6701</th>\n",
" <td>2013-09-01</td>\n",
" <td>35</td>\n",
" <td>15949</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6702</th>\n",
" <td>2013-09-01</td>\n",
" <td>70</td>\n",
" <td>15949</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6703</th>\n",
" <td>2013-09-01</td>\n",
" <td>83</td>\n",
" <td>15949</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6704</th>\n",
" <td>2013-09-01</td>\n",
" <td>30</td>\n",
" <td>15949</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6882</th>\n",
" <td>2013-10-06</td>\n",
" <td>80</td>\n",
" <td>15984</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6883</th>\n",
" <td>2013-10-06</td>\n",
" <td>71</td>\n",
" <td>15984</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6884</th>\n",
" <td>2013-10-06</td>\n",
" <td>31</td>\n",
" <td>15984</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6885</th>\n",
" <td>2013-10-06</td>\n",
" <td>53</td>\n",
" <td>15984</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6886</th>\n",
" <td>2013-10-06</td>\n",
" <td>76</td>\n",
" <td>15984</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6887</th>\n",
" <td>2013-10-06</td>\n",
" <td>99</td>\n",
" <td>15984</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6888</th>\n",
" <td>2013-10-06</td>\n",
" <td>85</td>\n",
" <td>15984</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6889</th>\n",
" <td>2013-10-06</td>\n",
" <td>82</td>\n",
" <td>15984</td>\n",
" <td>96.0</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7998</th>\n",
" <td>2014-05-12</td>\n",
" <td>96</td>\n",
" <td>16202</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7999</th>\n",
" <td>2014-05-12</td>\n",
" <td>32</td>\n",
" <td>16202</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8000</th>\n",
" <td>2014-05-12</td>\n",
" <td>24</td>\n",
" <td>16202</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8001</th>\n",
" <td>2014-05-12</td>\n",
" <td>29</td>\n",
" <td>16202</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9057</th>\n",
" <td>2014-12-08</td>\n",
" <td>73</td>\n",
" <td>16412</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9058</th>\n",
" <td>2014-12-08</td>\n",
" <td>66</td>\n",
" <td>16412</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9059</th>\n",
" <td>2014-12-08</td>\n",
" <td>37</td>\n",
" <td>16412</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9060</th>\n",
" <td>2014-12-08</td>\n",
" <td>48</td>\n",
" <td>16412</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9061</th>\n",
" <td>2014-12-08</td>\n",
" <td>99</td>\n",
" <td>16412</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9062</th>\n",
" <td>2014-12-08</td>\n",
" <td>81</td>\n",
" <td>16412</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9063</th>\n",
" <td>2014-12-08</td>\n",
" <td>40</td>\n",
" <td>16412</td>\n",
" <td>97.0</td>\n",
" <td>98</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date PersonId DateEpoch ref test\n",
"2331 2011-04-14 83 15078 96.0 97\n",
"2332 2011-04-14 91 15078 96.0 97\n",
"2333 2011-04-14 33 15078 96.0 97\n",
"2393 2011-04-29 95 15093 95.0 96\n",
"2394 2011-04-29 70 15093 95.0 96\n",
"2395 2011-04-29 65 15093 95.0 96\n",
"2396 2011-04-29 59 15093 95.0 96\n",
"2397 2011-04-29 98 15093 95.0 96\n",
"2464 2011-05-11 58 15105 94.0 95\n",
"2465 2011-05-11 35 15105 94.0 95\n",
"2466 2011-05-11 92 15105 94.0 95\n",
"2467 2011-05-11 28 15105 94.0 95\n",
"3337 2011-10-29 84 15276 97.0 98\n",
"3338 2011-10-29 30 15276 97.0 98\n",
"3339 2011-10-29 91 15276 97.0 98\n",
"3340 2011-10-29 81 15276 97.0 98\n",
"3383 2011-11-07 78 15285 96.0 97\n",
"3384 2011-11-07 53 15285 96.0 97\n",
"3385 2011-11-07 42 15285 96.0 97\n",
"3386 2011-11-07 99 15285 96.0 97\n",
"3387 2011-11-07 83 15285 96.0 97\n",
"3388 2011-11-07 69 15285 96.0 97\n",
"3389 2011-11-07 26 15285 96.0 97\n",
"3390 2011-11-07 63 15285 96.0 97\n",
"6017 2013-04-19 70 15814 98.0 99\n",
"6018 2013-04-19 92 15814 98.0 99\n",
"6019 2013-04-19 79 15814 98.0 99\n",
"6020 2013-04-19 60 15814 98.0 99\n",
"6021 2013-04-19 51 15814 98.0 99\n",
"6022 2013-04-19 80 15814 98.0 99\n",
"6107 2013-05-05 88 15830 97.0 98\n",
"6108 2013-05-05 97 15830 97.0 98\n",
"6109 2013-05-05 98 15830 97.0 98\n",
"6698 2013-09-01 93 15949 97.0 98\n",
"6699 2013-09-01 49 15949 97.0 98\n",
"6700 2013-09-01 61 15949 97.0 98\n",
"6701 2013-09-01 35 15949 97.0 98\n",
"6702 2013-09-01 70 15949 97.0 98\n",
"6703 2013-09-01 83 15949 97.0 98\n",
"6704 2013-09-01 30 15949 97.0 98\n",
"6882 2013-10-06 80 15984 96.0 97\n",
"6883 2013-10-06 71 15984 96.0 97\n",
"6884 2013-10-06 31 15984 96.0 97\n",
"6885 2013-10-06 53 15984 96.0 97\n",
"6886 2013-10-06 76 15984 96.0 97\n",
"6887 2013-10-06 99 15984 96.0 97\n",
"6888 2013-10-06 85 15984 96.0 97\n",
"6889 2013-10-06 82 15984 96.0 97\n",
"7998 2014-05-12 96 16202 97.0 98\n",
"7999 2014-05-12 32 16202 97.0 98\n",
"8000 2014-05-12 24 16202 97.0 98\n",
"8001 2014-05-12 29 16202 97.0 98\n",
"9057 2014-12-08 73 16412 97.0 98\n",
"9058 2014-12-08 66 16412 97.0 98\n",
"9059 2014-12-08 37 16412 97.0 98\n",
"9060 2014-12-08 48 16412 97.0 98\n",
"9061 2014-12-08 99 16412 97.0 98\n",
"9062 2014-12-08 81 16412 97.0 98\n",
"9063 2014-12-08 40 16412 97.0 98"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show where the calculation doesn't match.\n",
"print(\"Where reference ('ref') calculation of number of unique people doesn't match 'test':\")\n",
"df['ref'] = ref\n",
"df['test'] = test\n",
"df.loc[df['ref'] != df['test']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solution"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# Define a custom function and implement a just-in-time compiler.\n",
"@numba.jit(nopython=True)\n",
"def windowed_nunique_corrected(dates, pids, window):\n",
" r\"\"\"Track number of unique persons in window,\n",
" reading through arrays only once.\n",
"\n",
" Args:\n",
" dates (numpy.ndarray): Array of dates as number of days since epoch.\n",
" pids (numpy.ndarray): Array of integer person identifiers.\n",
" Required: min(pids) >= 0\n",
" window (int): Width of window in units of difference of `dates`.\n",
" Required: window >= 1\n",
"\n",
" Returns:\n",
" ucts (numpy.ndarray): Array of unique counts.\n",
"\n",
" Raises:\n",
" AssertionError: Raised if not...\n",
" * len(dates) == len(pids)\n",
" * min(pids) >= 0\n",
" * window >= 1\n",
"\n",
" Notes:\n",
" * Matches `pandas.core.window.Rolling`\n",
" with a time series alias offset.\n",
"\n",
" \"\"\"\n",
"\n",
" # Check arguments.\n",
" assert len(dates) == len(pids)\n",
" assert np.min(pids) >= 0\n",
" assert window >= 1\n",
"\n",
" # Initialize counters.\n",
" idx_min = 0\n",
" idx_max = dates.shape[0]\n",
" date_min = dates[idx_min]\n",
" pid_min = pids[idx_min]\n",
" pid_max = np.max(pids) + 1\n",
" pid_cts = np.zeros(pid_max, dtype=np.int64)\n",
" pid_cts[pid_min] = 1\n",
" uct = 1\n",
" ucts = np.zeros(idx_max, dtype=np.int64)\n",
" ucts[idx_min] = uct\n",
" idx = 1\n",
"\n",
" # For each (date, person)...\n",
" while idx < idx_max:\n",
"\n",
" # Lookup date, person.\n",
" date = dates[idx]\n",
" pid = pids[idx]\n",
"\n",
" # If person count went from 0 to 1, increment unique person count.\n",
" pid_cts[pid] += 1\n",
" if pid_cts[pid] == 1:\n",
" uct += 1\n",
"\n",
" # For past dates outside of window...\n",
" # Note: If window=3, it includes day0,day1,day2.\n",
" while (date - date_min + 1) > window:\n",
"\n",
" # If person count went from 1 to 0, decrement unique person count.\n",
" pid_cts[pid_min] -= 1\n",
" if pid_cts[pid_min] == 0:\n",
" uct -= 1\n",
" idx_min += 1\n",
" date_min = dates[idx_min]\n",
" pid_min = pids[idx_min]\n",
"\n",
" # Record unique person count.\n",
" ucts[idx] = uct\n",
" idx += 1\n",
"\n",
" return ucts"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# Cast dates to integers.\n",
"df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')\n",
"df['DateEpoch'] = df['DateEpoch'].astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"98.8 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"windowed_nunique_corrected(\n",
" dates=df['DateEpoch'].values,\n",
" pids=df['PersonId'].values,\n",
" window=window)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# Check accuracy of results.\n",
"test = windowed_nunique_corrected(\n",
" dates=df['DateEpoch'].values,\n",
" pids=df['PersonId'].values,\n",
" window=window)\n",
"assert all(ref == test)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@stharrold
Copy link
Author

Updated with solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment