Skip to content

Instantly share code, notes, and snippets.

@drorata
Created August 11, 2022 06:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save drorata/bb63e87c9baa3d5bc1fc088d43bee8f3 to your computer and use it in GitHub Desktop.
Save drorata/bb63e87c9baa3d5bc1fc088d43bee8f3 to your computer and use it in GitHub Desktop.
Original notebook of avg-means+counts-with-groupby.md
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Averages, sums and counts when grouping by"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assume we have a table where each row corresponds to a transaction and indexed by timestamp.\n",
"By re-sampling this table by `1H` frequency and counting the number of transaction per hour, we get a new aggregated view when each row corresponds to an hour and the value is the count of transaction happened within that hour.\n",
"\n",
"As an example, we start with the following table.\n",
"This demonstrates the result of the re-sampling mentioned above."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2017-01-01 00:00:00</th>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2017-01-01 01:00:00</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2017-01-01 02:00:00</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2017-01-01 03:00:00</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2017-01-01 04:00:00</th>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count\n",
"2017-01-01 00:00:00 4\n",
"2017-01-01 01:00:00 5\n",
"2017-01-01 02:00:00 3\n",
"2017-01-01 03:00:00 5\n",
"2017-01-01 04:00:00 5"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.random.seed(42)\n",
"idx = pd.date_range('2017-01-01', '2017-01-14', freq='1H')\n",
"df = pd.DataFrame(np.random.choice([1,2,3,4,5,6], size=idx.shape[0]), index=idx, columns=['count'])\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The objective is to measure behavior as depending on the day of the week (Mo., Tue. etc.)\n",
"First, let us count how many events happened during each weekday.\n",
"This can be achieved in couple of ways:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>161</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>133</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>169</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>172</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count\n",
"0 161\n",
"1 170\n",
"2 164\n",
"3 133\n",
"4 169\n",
"5 98\n",
"6 172"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_table('count', index=df.index.dayofweek, aggfunc='sum')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>161</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>133</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>169</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>172</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count\n",
"0 161\n",
"1 170\n",
"2 164\n",
"3 133\n",
"4 169\n",
"5 98\n",
"6 172"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(df.index.dayofweek).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, being more explicit (and less Pythonic):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>161</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>133</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>169</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>172</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count\n",
"0 161\n",
"1 170\n",
"2 164\n",
"3 133\n",
"4 169\n",
"5 98\n",
"6 172"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(\n",
" [\n",
" df[df.index.dayofweek==i].sum()[0] for i in range(7)\n",
" ], \n",
" columns=['count']\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we want to compute the *average* number of transaction per weekday.\n",
"First, let's do it explicitly, for Monday (day 0).\n",
"The total number of transactions is:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 161\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df.index.dayofweek == 0].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the dates range used, there are two Mondays, therefore, the average is:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"((df.resample('1d').sum()).index.dayofweek == 0).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Therefore, the average is:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 80.5\n",
"dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df.index.dayofweek == 0].sum() / ((df.resample('1d').sum()).index.dayofweek == 0).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One may think the following syntax is a quicker way:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3.354167</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.541667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3.416667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2.770833</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3.520833</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>3.920000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3.583333</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count\n",
"0 3.354167\n",
"1 3.541667\n",
"2 3.416667\n",
"3 2.770833\n",
"4 3.520833\n",
"5 3.920000\n",
"6 3.583333"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_table('count', index=df.index.dayofweek, aggfunc='mean')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this returns the wrong average. \n",
"To be more precise, it divides the number of transactions per day-of-week by the number of hours that occurred in this day-of-week in the time range, given by:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>48</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count\n",
"0 48\n",
"1 48\n",
"2 48\n",
"3 48\n",
"4 48\n",
"5 25\n",
"6 48"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(df.index.dayofweek).count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or, put together:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3.354167</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.541667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3.416667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2.770833</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3.520833</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>3.920000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3.583333</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count\n",
"0 3.354167\n",
"1 3.541667\n",
"2 3.416667\n",
"3 2.770833\n",
"4 3.520833\n",
"5 3.920000\n",
"6 3.583333"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(df.index.dayofweek).sum() / df.groupby(df.index.dayofweek).count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A way around it, would be:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>80.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>85.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>82.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>66.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>84.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>49.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>86.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count\n",
"0 80.5\n",
"1 85.0\n",
"2 82.0\n",
"3 66.5\n",
"4 84.5\n",
"5 49.0\n",
"6 86.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count_per_day_df = df.resample('1d').sum()\n",
"count_per_day_df.groupby(count_per_day_df.index.dayofweek).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I mentioned this maybe counter intuitive behavior on [SO](https://stackoverflow.com/q/45922291/671013)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda root]",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment