Skip to content

Instantly share code, notes, and snippets.

@tubone24
Created August 31, 2020 16:51
Show Gist options
  • Save tubone24/2a8356d58e80efa086f17c394e2e5f1f to your computer and use it in GitHub Desktop.
Save tubone24/2a8356d58e80efa086f17c394e2e5f1f to your computer and use it in GitHub Desktop.
groupby_test.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:03.923696Z",
"start_time": "2020-08-31T16:47:03.394154Z"
},
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd\nimport numpy as np",
"execution_count": 1,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Dataframe groupby検証\n\n## 概要\n\nDataframeのgroupbyの挙動がいまいちわからなかったのでまとめてみた。\n\n## 結論\n\n### applyの挙動\n- apply実行時はgroup_key分(のはずだがなぜか+1)回る(pandasコード読んで要検証)\n- as_index=Falseはapply後に出力されるdfで適用されるため、apply実行時には評価されない\n- as_index=Falseにしたからと言って何らかのエラーを回避するものではない。\n - set_indexしてないdfに対しては、as_index=Trueの場合は、groupbyのキーがindexに。\n - set_indexしてないdfに対しては、as_index=Falseの場合は、groupbyのキーに対応するRangeIndexが付与。\n - set_indexしている場合は、indexがgroupbyのキーと同一のとき、as_index=Falseでindex自体が消えて、対応するRangeIndexが付与される\n- applyで渡す関数の戻り値がスカラー、Series、DataFrameいずれでも試してみたが、差はなかった。\n- Groupbyでapplyした場合はSeriesが返ることが知られているが、groupby、apply関数内でのset_indexがindexの形で付与されることがわかった。\n \n### aggの検証\n- groupby.aggの場合はSeriesではなくDFが返る\n- 複数のgroupbyキーを用いるなど、複数のindexが貼られるケースは複合indexを持ったSeriesが返却される\n - indexが設定されていないDFに対しては、as_index=Falseするとgroupbyのキーはカラムとして連携される\n - ただし、indexとgroupbyが同カラムの場合は問答無用でas_index=Falseするとキーが消えちゃうので注意\n \n### observedの検証\n\n#### カテゴリー型\n- カテゴリー型というR言語のFactor型のような性質を持ったデータ型が存在する。\n - 宣言はseriesやdfに対して、astype='category' か、pd.categoricalで可能\n - カテゴリ型は、データとカテゴリ(label)を持つ\n- 統計で言うところの名義尺度や順序尺度を表す。\n - enumのような使い方ができるかと思ったが、そうでもないらしい...\n - 名義尺度を示すときはpd.categorical(orderd=False)、順序尺度は orderd=Trueで表現\n- カテゴリに存在しないデータは、宣言時np.NaNになる\n - 機械学習の特定のラベリングデータを抽出するとき、カテゴリーに取得したいカテゴリーだけ宣言して、np.NaNにしてdropnaとかするのかしら?\n- observedはカテゴリー型のcolumnsやindexに対してgroupbyしたときの挙動を示す\n\n#### observed=True\n- カテゴりー型で宣言したDFのSeriesに対してGroupbyをかけると、カテゴリー分、groupingされるが、observedをすると、観測値(データとして存在しているもののみ)出力される\n"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:03.949269Z",
"start_time": "2020-08-31T16:47:03.925755Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# DF定義\ndf = pd.DataFrame({\n 'city': ['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'],\n 'food': ['apple', 'orange', 'banana', 'banana', 'apple', 'apple', 'banana'],\n 'price': [100, 200, 250, 300, 150, 200, 400],\n 'quantity': [1, 2, 3, 4, 5, 6, 7]\n})\ndf",
"execution_count": 2,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>city</th>\n <th>food</th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>osaka</td>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>osaka</td>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>2</th>\n <td>osaka</td>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>3</th>\n <td>osaka</td>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>4</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>5</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>6</th>\n <td>tokyo</td>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:03.960378Z",
"start_time": "2020-08-31T16:47:03.951651Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# set_index('city')\ndf_city_index = df.set_index('city')\ndf_city_index",
"execution_count": 3,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>food</th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>osaka</th>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " food price quantity\ncity \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7"
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:03.969361Z",
"start_time": "2020-08-31T16:47:03.962107Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# set_index('food')\ndf_food_index = df.set_index('food')\ndf_food_index",
"execution_count": 4,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>city</th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n <tr>\n <th>food</th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>apple</th>\n <td>osaka</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>orange</th>\n <td>osaka</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>banana</th>\n <td>osaka</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>banana</th>\n <td>osaka</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>apple</th>\n <td>tokyo</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>apple</th>\n <td>tokyo</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>banana</th>\n <td>tokyo</td>\n <td>400</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city price quantity\nfood \napple osaka 100 1\norange osaka 200 2\nbanana osaka 250 3\nbanana osaka 300 4\napple tokyo 150 5\napple tokyo 200 6\nbanana tokyo 400 7"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:03.978825Z",
"start_time": "2020-08-31T16:47:03.971007Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# set_index: マルチインデックス\ndf_multi_index = df.set_index(['city', 'food'])\ndf_multi_index",
"execution_count": 5,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n <tr>\n <th>city</th>\n <th>food</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"4\" valign=\"top\">osaka</th>\n <th>apple</th>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>orange</th>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>banana</th>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>banana</th>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th rowspan=\"3\" valign=\"top\">tokyo</th>\n <th>apple</th>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>apple</th>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>banana</th>\n <td>400</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity\ncity food \nosaka apple 100 1\n orange 200 2\n banana 250 3\n banana 300 4\ntokyo apple 150 5\n apple 200 6\n banana 400 7"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:03.984959Z",
"start_time": "2020-08-31T16:47:03.981377Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# 各dfのcolumns\nprint(df.columns)\nprint(df_city_index.columns)\nprint(df_food_index.columns)\nprint(df_multi_index.columns)",
"execution_count": 6,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Index(['city', 'food', 'price', 'quantity'], dtype='object')\nIndex(['food', 'price', 'quantity'], dtype='object')\nIndex(['city', 'price', 'quantity'], dtype='object')\nIndex(['price', 'quantity'], dtype='object')\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:03.991575Z",
"start_time": "2020-08-31T16:47:03.987812Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# 各dfのindex\nprint(df.index)\nprint(df_city_index.index)\nprint(df_food_index.index)\nprint(df_multi_index.index)",
"execution_count": 7,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "RangeIndex(start=0, stop=7, step=1)\nIndex(['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'], dtype='object', name='city')\nIndex(['apple', 'orange', 'banana', 'banana', 'apple', 'apple', 'banana'], dtype='object', name='food')\nMultiIndex(levels=[['osaka', 'tokyo'], ['apple', 'banana', 'orange']],\n codes=[[0, 0, 0, 0, 1, 1, 1], [0, 2, 1, 1, 0, 0, 1]],\n names=['city', 'food'])\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:03.996785Z",
"start_time": "2020-08-31T16:47:03.993007Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# df.groupbyしたときの型 pandas.core.groupby.generic.DataFrameGroupBy\n# cityでgroupbyした状態では、group_nameでまとめられたRangeIndexがosaka, tokyoでそれぞれ**レコード分**作成される\nprint(type(df.groupby('city')))\nprint(df.groupby('city').groups)",
"execution_count": 8,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n{'osaka': Int64Index([0, 1, 2, 3], dtype='int64'), 'tokyo': Int64Index([4, 5, 6], dtype='int64')}\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.002589Z",
"start_time": "2020-08-31T16:47:03.998219Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# group byのキーと同columsをset_indexしたときの型は変化せず\n# group byのキーと同columsをset_indexした状態では、group_nameでまとめられた各index(city)がosaka, tokyoでそれぞれ**レコード分**作成される\nprint(type(df_city_index.groupby('city')))\nprint(df_city_index.groupby('city').groups)",
"execution_count": 9,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n{'osaka': Index(['osaka', 'osaka', 'osaka', 'osaka'], dtype='object', name='city'), 'tokyo': Index(['tokyo', 'tokyo', 'tokyo'], dtype='object', name='city')}\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.009092Z",
"start_time": "2020-08-31T16:47:04.004339Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# group byのキーと別columsをset_indexしたときの型は変化せず\n# group byのキーと別columsをset_indexした状態では、group_nameでまとめられた各index(food)がosaka, tokyoでそれぞれ**レコード分**作成される\n# 特にエラーはない。\nprint(type(df_food_index.groupby('city')))\nprint(df_food_index.groupby('city').groups)",
"execution_count": 10,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n{'osaka': Index(['apple', 'orange', 'banana', 'banana'], dtype='object', name='food'), 'tokyo': Index(['apple', 'apple', 'banana'], dtype='object', name='food')}\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.020142Z",
"start_time": "2020-08-31T16:47:04.011588Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# マルチインデックスでset_indexしたときのgroupby型は変化せず\n# group_nameでまとめられた各multi_index(city, food)がosaka, tokyoでそれぞれ**レコード分**作成される\n# 特にエラーはない。\nprint(type(df_multi_index.groupby('city')))\nprint(df_multi_index.groupby('city').groups)",
"execution_count": 11,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n{'osaka': MultiIndex(levels=[['osaka', 'tokyo'], ['apple', 'banana', 'orange']],\n codes=[[0, 0, 0, 0], [0, 2, 1, 1]],\n names=['city', 'food']), 'tokyo': MultiIndex(levels=[['osaka', 'tokyo'], ['apple', 'banana', 'orange']],\n codes=[[1, 1, 1], [0, 0, 1]],\n names=['city', 'food'])}\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.036496Z",
"start_time": "2020-08-31T16:47:04.022485Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# as_index=Falseした場合でもDataFrameGroupByのままでは特に変化がない\n# 当然エラーもでない\nprint(type(df.groupby('city', as_index=False)))\nprint(df.groupby('city', as_index=False).groups)\nprint(type(df_city_index.groupby('city', as_index=False)))\nprint(df_city_index.groupby('city', as_index=False).groups)\nprint(type(df_food_index.groupby('city', as_index=False)))\nprint(df_food_index.groupby('city', as_index=False).groups)\nprint(type(df_multi_index.groupby('city', as_index=False)))\nprint(df_multi_index.groupby('city', as_index=False).groups)",
"execution_count": 12,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n{'osaka': Int64Index([0, 1, 2, 3], dtype='int64'), 'tokyo': Int64Index([4, 5, 6], dtype='int64')}\n<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n{'osaka': Index(['osaka', 'osaka', 'osaka', 'osaka'], dtype='object', name='city'), 'tokyo': Index(['tokyo', 'tokyo', 'tokyo'], dtype='object', name='city')}\n<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n{'osaka': Index(['apple', 'orange', 'banana', 'banana'], dtype='object', name='food'), 'tokyo': Index(['apple', 'apple', 'banana'], dtype='object', name='food')}\n<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n{'osaka': MultiIndex(levels=[['osaka', 'tokyo'], ['apple', 'banana', 'orange']],\n codes=[[0, 0, 0, 0], [0, 2, 1, 1]],\n names=['city', 'food']), 'tokyo': MultiIndex(levels=[['osaka', 'tokyo'], ['apple', 'banana', 'orange']],\n codes=[[1, 1, 1], [0, 0, 1]],\n names=['city', 'food'])}\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.052924Z",
"start_time": "2020-08-31T16:47:04.039283Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# meanなどのaggrigation処理に書けたときに、as_indexは機能するらしい\n# set_indexがないdfではgroupbyのキーがindexになる。\ndf.groupby('city').mean()",
"execution_count": 13,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>osaka</th>\n <td>212.5</td>\n <td>2.5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>250.0</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity\ncity \nosaka 212.5 2.5\ntokyo 250.0 6.0"
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.067155Z",
"start_time": "2020-08-31T16:47:04.055659Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# meanなどのaggrigation処理に書けたときに、as_indexは機能するらしい\n# set_indexがないdfでas_index=Falseした場合、はRangeIndexが追加で付与される\ndf.groupby('city', as_index=False).mean()",
"execution_count": 14,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>city</th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>osaka</td>\n <td>212.5</td>\n <td>2.5</td>\n </tr>\n <tr>\n <th>1</th>\n <td>tokyo</td>\n <td>250.0</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city price quantity\n0 osaka 212.5 2.5\n1 tokyo 250.0 6.0"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.075858Z",
"start_time": "2020-08-31T16:47:04.069129Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# set_index済みのdfに対して、aggrigationした場合、セットしたindexがそのまま表現される\ndf_city_index.groupby('city').mean()",
"execution_count": 15,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>osaka</th>\n <td>212.5</td>\n <td>2.5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>250.0</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity\ncity \nosaka 212.5 2.5\ntokyo 250.0 6.0"
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.084765Z",
"start_time": "2020-08-31T16:47:04.077927Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# set_index済みのdfに対して、as_index=Falseしてaggrigationした場合、index自体がいなくなるので、\n# あとでreset_indexしたときに、あれれってなると思う。\ndf_city_index.groupby('city', as_index=False).mean()",
"execution_count": 16,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>212.5</td>\n <td>2.5</td>\n </tr>\n <tr>\n <th>1</th>\n <td>250.0</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity\n0 212.5 2.5\n1 250.0 6.0"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.095315Z",
"start_time": "2020-08-31T16:47:04.086788Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# set_indexのキーとgroupbyのキーが不一致の場合は、indexなしと同じ挙動\ndf_food_index.groupby('city').mean()",
"execution_count": 17,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>osaka</th>\n <td>212.5</td>\n <td>2.5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>250.0</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity\ncity \nosaka 212.5 2.5\ntokyo 250.0 6.0"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.110305Z",
"start_time": "2020-08-31T16:47:04.101507Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_food_index.groupby('city', as_index=False).mean()",
"execution_count": 18,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>city</th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>osaka</td>\n <td>212.5</td>\n <td>2.5</td>\n </tr>\n <tr>\n <th>1</th>\n <td>tokyo</td>\n <td>250.0</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city price quantity\n0 osaka 212.5 2.5\n1 tokyo 250.0 6.0"
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.126243Z",
"start_time": "2020-08-31T16:47:04.116158Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# マルチインデックスの場合も挙動変化なし\ndf_multi_index.groupby('city').mean()",
"execution_count": 19,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>osaka</th>\n <td>212.5</td>\n <td>2.5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>250.0</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity\ncity \nosaka 212.5 2.5\ntokyo 250.0 6.0"
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.139539Z",
"start_time": "2020-08-31T16:47:04.128470Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_food_index.groupby('city', as_index=False).mean()",
"execution_count": 20,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>city</th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>osaka</td>\n <td>212.5</td>\n <td>2.5</td>\n </tr>\n <tr>\n <th>1</th>\n <td>tokyo</td>\n <td>250.0</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city price quantity\n0 osaka 212.5 2.5\n1 tokyo 250.0 6.0"
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.152831Z",
"start_time": "2020-08-31T16:47:04.141878Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# マルチインデックスの複合キーとgroupbyのキーが完全に重なったときは\ndf_multi_index.groupby(['city', 'food']).mean()",
"execution_count": 21,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n <tr>\n <th>city</th>\n <th>food</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"3\" valign=\"top\">osaka</th>\n <th>apple</th>\n <td>100.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>banana</th>\n <td>275.0</td>\n <td>3.5</td>\n </tr>\n <tr>\n <th>orange</th>\n <td>200.0</td>\n <td>2.0</td>\n </tr>\n <tr>\n <th rowspan=\"2\" valign=\"top\">tokyo</th>\n <th>apple</th>\n <td>175.0</td>\n <td>5.5</td>\n </tr>\n <tr>\n <th>banana</th>\n <td>400.0</td>\n <td>7.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity\ncity food \nosaka apple 100.0 1.0\n banana 275.0 3.5\n orange 200.0 2.0\ntokyo apple 175.0 5.5\n banana 400.0 7.0"
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.164576Z",
"start_time": "2020-08-31T16:47:04.154762Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# as_index=Falseでindexが消失する\ndf_multi_index.groupby(['city', 'food'], as_index=False).mean()",
"execution_count": 22,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n <th>quantity</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>100.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>275.0</td>\n <td>3.5</td>\n </tr>\n <tr>\n <th>2</th>\n <td>200.0</td>\n <td>2.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>175.0</td>\n <td>5.5</td>\n </tr>\n <tr>\n <th>4</th>\n <td>400.0</td>\n <td>7.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity\n0 100.0 1.0\n1 275.0 3.5\n2 200.0 2.0\n3 175.0 5.5\n4 400.0 7.0"
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.169802Z",
"start_time": "2020-08-31T16:47:04.166607Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# apply用関数\n# apply関数から受け取った値、型、group_nameをprintし、スカラ値を返却\ndef lambda_apply_scalar(x):\n print(\"====TEST=====\")\n print(x)\n print(type(x))\n print(x.name) # x.nameでgroupbyDataframeのgroup_nameが取れる\n print(\"=============\")\n return x.price * 100",
"execution_count": 23,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.187195Z",
"start_time": "2020-08-31T16:47:04.171983Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# group_nameは2つなのになぜか3回動いている ToDo: ソース見る\n# 最初の一つを除けば、group_nameごとにapplyがかかっていると思う\ndf_lambda_apply_scalar = df.groupby('city').apply(lambda_apply_scalar)",
"execution_count": 24,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:04.193376Z",
"start_time": "2020-08-31T16:47:04.189209Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# スカラ値を返却する関数をapplyに食わせると、返却されるのはSeries\n# そしてなんと、indexは、groupbyのキーとRangeindexのマルチインデックス\nprint(df_lambda_apply_scalar)\ntype(df_lambda_apply_scalar)\nprint(df_lambda_apply_scalar.index)",
"execution_count": 25,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "city \nosaka 0 10000\n 1 20000\n 2 25000\n 3 30000\ntokyo 4 15000\n 5 20000\n 6 40000\nName: price, dtype: int64\nMultiIndex(levels=[['osaka', 'tokyo'], [0, 1, 2, 3, 4, 5, 6]],\n codes=[[0, 0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6]],\n names=['city', None])\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:36.388879Z",
"start_time": "2020-08-31T16:47:36.373479Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_lambda_apply_scalar_index_false = df.groupby('city', as_index=False).apply(lambda_apply_scalar)",
"execution_count": 27,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:39.162436Z",
"start_time": "2020-08-31T16:47:39.153209Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# as_index=Falseの場合、基本的挙動は上と同じだが、\n# マルチインデックスのキーからcityがぬけ、代わりに対応するRangeIndexキーが作成されている。\nprint(df_lambda_apply_scalar_index_false)\nprint(type(df_lambda_apply_scalar_index_false))\nprint(df_lambda_apply_scalar_index_false.index)",
"execution_count": 28,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "0 0 10000\n 1 20000\n 2 25000\n 3 30000\n1 4 15000\n 5 20000\n 6 40000\nName: price, dtype: int64\n<class 'pandas.core.series.Series'>\nMultiIndex(levels=[[0, 1], [0, 1, 2, 3, 4, 5, 6]],\n codes=[[0, 0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6]])\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:42.238957Z",
"start_time": "2020-08-31T16:47:42.222634Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# set_indexしたdfでもapply関数内は挙動に大きな変化はない\ndf_set_index_lambda_apply_scalar_index = df_city_index.groupby('city').apply(lambda_apply_scalar)",
"execution_count": 29,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n food price quantity\ncity \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n====TEST=====\n food price quantity\ncity \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n====TEST=====\n food price quantity\ncity \ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:44.206482Z",
"start_time": "2020-08-31T16:47:44.201165Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# groupbyのキーともともとのキーを複合したマルチインデックスを持ったSeriesができた。\nprint(df_set_index_lambda_apply_scalar_index)\nprint(type(df_set_index_lambda_apply_scalar_index))\nprint(df_set_index_lambda_apply_scalar_index.index)",
"execution_count": 30,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "city city \nosaka osaka 10000\n osaka 20000\n osaka 25000\n osaka 30000\ntokyo tokyo 15000\n tokyo 20000\n tokyo 40000\nName: price, dtype: int64\n<class 'pandas.core.series.Series'>\nMultiIndex(levels=[['osaka', 'tokyo'], ['osaka', 'tokyo']],\n codes=[[0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1]],\n names=['city', 'city'])\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:46.672837Z",
"start_time": "2020-08-31T16:47:46.658281Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_set_index_lambda_apply_scalar_index_false = df_city_index.groupby('city', as_index=False).apply(lambda_apply_scalar)",
"execution_count": 31,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n food price quantity\ncity \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n====TEST=====\n food price quantity\ncity \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n====TEST=====\n food price quantity\ncity \ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:48.135393Z",
"start_time": "2020-08-31T16:47:48.130323Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# なるほど。as_index=Falseの場合はgroupbyのキーがぬけて、もともとのindexとRangeIndexのマルチインデックスになった。\n# どうやら、groupbyにapplyをカマスと、Index付きSeriesが返るらしい。\n# ただし、複数のindexが返却されうるケースは複合インデックスを用いて、Seriesを返しているということのようだ。\n# また、as_index=Falseにした場合でもgroupingを表現するために、groupbyのキーに対応したRangeIndexは付与される。\nprint(df_set_index_lambda_apply_scalar_index_false)\nprint(type(df_set_index_lambda_apply_scalar_index_false))\nprint(df_set_index_lambda_apply_scalar_index_false.index)",
"execution_count": 32,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": " city \n0 osaka 10000\n osaka 20000\n osaka 25000\n osaka 30000\n1 tokyo 15000\n tokyo 20000\n tokyo 40000\nName: price, dtype: int64\n<class 'pandas.core.series.Series'>\nMultiIndex(levels=[[0, 1], ['osaka', 'tokyo']],\n codes=[[0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1]],\n names=[None, 'city'])\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:48.967232Z",
"start_time": "2020-08-31T16:47:48.942707Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# なので、groupbyのキーが複数の際は、複合キーを持ったSeriesが返される。\nprint(df.groupby(['city', 'food']).apply(lambda_apply_scalar))",
"execution_count": 33,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n city food price quantity\n0 osaka apple 100 1\n<class 'pandas.core.frame.DataFrame'>\n('osaka', 'apple')\n=============\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n<class 'pandas.core.frame.DataFrame'>\n('osaka', 'apple')\n=============\n====TEST=====\n city food price quantity\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\n('osaka', 'banana')\n=============\n====TEST=====\n city food price quantity\n1 osaka orange 200 2\n<class 'pandas.core.frame.DataFrame'>\n('osaka', 'orange')\n=============\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n<class 'pandas.core.frame.DataFrame'>\n('tokyo', 'apple')\n=============\n====TEST=====\n city food price quantity\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\n('tokyo', 'banana')\n=============\ncity food \nosaka apple 0 10000\n banana 2 25000\n 3 30000\n orange 1 20000\ntokyo apple 4 15000\n 5 20000\n banana 6 40000\nName: price, dtype: int64\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:49.728917Z",
"start_time": "2020-08-31T16:47:49.709947Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# as_indexの挙動も他の同じっぽい。\nprint(df.groupby(['city', 'food'], as_index=False).apply(lambda_apply_scalar))",
"execution_count": 34,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n city food price quantity\n0 osaka apple 100 1\n<class 'pandas.core.frame.DataFrame'>\n('osaka', 'apple')\n=============\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n<class 'pandas.core.frame.DataFrame'>\n('osaka', 'apple')\n=============\n====TEST=====\n city food price quantity\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\n('osaka', 'banana')\n=============\n====TEST=====\n city food price quantity\n1 osaka orange 200 2\n<class 'pandas.core.frame.DataFrame'>\n('osaka', 'orange')\n=============\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n<class 'pandas.core.frame.DataFrame'>\n('tokyo', 'apple')\n=============\n====TEST=====\n city food price quantity\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\n('tokyo', 'banana')\n=============\n0 0 10000\n1 2 25000\n 3 30000\n2 1 20000\n3 4 15000\n 5 20000\n4 6 40000\nName: price, dtype: int64\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:50.389995Z",
"start_time": "2020-08-31T16:47:50.377392Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# 疑問: applyとaggの違い\n\n# 結論: applyの場合と異なり、aggはDFが返る。\n# indexが設定されていないDFに対しては、as_index=Falseするとgroupbyのキーはカラムとして連携される。\n# ただし、indexとgroupbyが同カラムの場合は問答無用でas_index=Falseするとキーが消えちゃうので注意\n\ndf_agg = df.groupby('city').agg({'price': 'max'})\ndf_agg_index_false = df.groupby('city', as_index=False).agg({'price': 'max'})\ndf_city_agg = df_city_index.groupby('city').agg({'price': 'max'})\ndf_city_agg_index_false = df_city_index.groupby('city', as_index=False).agg({'price': 'max'})",
"execution_count": 35,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:51.112227Z",
"start_time": "2020-08-31T16:47:51.106107Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_agg",
"execution_count": 36,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>osaka</th>\n <td>300</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>400</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price\ncity \nosaka 300\ntokyo 400"
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:51.749543Z",
"start_time": "2020-08-31T16:47:51.742636Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_agg_index_false",
"execution_count": 37,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>city</th>\n <th>price</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>osaka</td>\n <td>300</td>\n </tr>\n <tr>\n <th>1</th>\n <td>tokyo</td>\n <td>400</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city price\n0 osaka 300\n1 tokyo 400"
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:52.380996Z",
"start_time": "2020-08-31T16:47:52.374746Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_city_agg",
"execution_count": 38,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>osaka</th>\n <td>300</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>400</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price\ncity \nosaka 300\ntokyo 400"
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:52.976737Z",
"start_time": "2020-08-31T16:47:52.970849Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_city_agg_index_false",
"execution_count": 39,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>300</td>\n </tr>\n <tr>\n <th>1</th>\n <td>400</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price\n0 300\n1 400"
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:53.593262Z",
"start_time": "2020-08-31T16:47:53.589575Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# apply用関数\n# apply関数から受け取った値、型、group_nameをprintし、Seriesを返却\ndef lambda_apply_series(x):\n print(\"====TEST=====\")\n print(x)\n print(type(x))\n print(x.name) # x.nameでgroupbyDataframeのgroup_nameが取れる\n print(\"=============\")\n tmp_dict = {'test': 100 * x.price.values[0], 'test2': 200}\n result = pd.Series(tmp_dict)\n print(result)\n print(type(result))\n print(result.index)\n return result",
"execution_count": 40,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:54.353204Z",
"start_time": "2020-08-31T16:47:54.318471Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# Seriesも基本おなじだった。\ndf_lambda_apply_series = df.groupby(['city']).apply(lambda_apply_series)\ndf_lambda_apply_series_index_false = df.groupby(['city'], as_index=False).apply(lambda_apply_series)",
"execution_count": 41,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\ntest 10000\ntest2 200\ndtype: int64\n<class 'pandas.core.series.Series'>\nIndex(['test', 'test2'], dtype='object')\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\ntest 10000\ntest2 200\ndtype: int64\n<class 'pandas.core.series.Series'>\nIndex(['test', 'test2'], dtype='object')\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\ntest 15000\ntest2 200\ndtype: int64\n<class 'pandas.core.series.Series'>\nIndex(['test', 'test2'], dtype='object')\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\ntest 10000\ntest2 200\ndtype: int64\n<class 'pandas.core.series.Series'>\nIndex(['test', 'test2'], dtype='object')\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\ntest 10000\ntest2 200\ndtype: int64\n<class 'pandas.core.series.Series'>\nIndex(['test', 'test2'], dtype='object')\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\ntest 15000\ntest2 200\ndtype: int64\n<class 'pandas.core.series.Series'>\nIndex(['test', 'test2'], dtype='object')\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:55.236931Z",
"start_time": "2020-08-31T16:47:55.230395Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_lambda_apply_series",
"execution_count": 42,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>test</th>\n <th>test2</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>osaka</th>\n <td>10000</td>\n <td>200</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>15000</td>\n <td>200</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " test test2\ncity \nosaka 10000 200\ntokyo 15000 200"
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:55.751378Z",
"start_time": "2020-08-31T16:47:55.744671Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_lambda_apply_series_index_false",
"execution_count": 43,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>test</th>\n <th>test2</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10000</td>\n <td>200</td>\n </tr>\n <tr>\n <th>1</th>\n <td>15000</td>\n <td>200</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " test test2\n0 10000 200\n1 15000 200"
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:56.287429Z",
"start_time": "2020-08-31T16:47:56.281298Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# apply用関数\n# apply関数から受け取った値、型、group_nameをprintし、DFを返却\ndef lambda_apply_df(x):\n print(\"====TEST=====\")\n print(x)\n print(type(x))\n print(x.name) # x.nameでgroupbyDataframeのgroup_nameが取れる\n print(\"=============\")\n tmp_dict = pd.DataFrame({\n 'city2': ['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'],\n 'food2': ['apple', 'orange', 'banana', 'banana', 'apple', 'apple', 'banana'],\n 'price2': [100, 200, 250, 300, 150, 200, 400],\n 'quantity2': [1, 2, 3, 4, 5, 6, 7]})\n result = pd.DataFrame(tmp_dict)\n print(result)\n print(type(result))\n print(result.index)\n return result",
"execution_count": 44,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:56.947599Z",
"start_time": "2020-08-31T16:47:56.906022Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# DFも基本おなじだった。\ndf_lambda_apply_df = df.groupby(['city']).apply(lambda_apply_df)\ndf_lambda_apply_df_index_false = df.groupby(['city'], as_index=False).apply(lambda_apply_df)",
"execution_count": 45,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n city2 food2 price2 quantity2\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex(start=0, stop=7, step=1)\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n city2 food2 price2 quantity2\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex(start=0, stop=7, step=1)\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\n city2 food2 price2 quantity2\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex(start=0, stop=7, step=1)\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n city2 food2 price2 quantity2\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex(start=0, stop=7, step=1)\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n city2 food2 price2 quantity2\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex(start=0, stop=7, step=1)\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\n city2 food2 price2 quantity2\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex(start=0, stop=7, step=1)\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:57.636298Z",
"start_time": "2020-08-31T16:47:57.625837Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_lambda_apply_df",
"execution_count": 46,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>city2</th>\n <th>food2</th>\n <th>price2</th>\n <th>quantity2</th>\n </tr>\n <tr>\n <th>city</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"7\" valign=\"top\">osaka</th>\n <th>0</th>\n <td>osaka</td>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>osaka</td>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>2</th>\n <td>osaka</td>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>3</th>\n <td>osaka</td>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>4</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>5</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>6</th>\n <td>tokyo</td>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n <tr>\n <th rowspan=\"7\" valign=\"top\">tokyo</th>\n <th>0</th>\n <td>osaka</td>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>osaka</td>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>2</th>\n <td>osaka</td>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>3</th>\n <td>osaka</td>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>4</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>5</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>6</th>\n <td>tokyo</td>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city2 food2 price2 quantity2\ncity \nosaka 0 osaka apple 100 1\n 1 osaka orange 200 2\n 2 osaka banana 250 3\n 3 osaka banana 300 4\n 4 tokyo apple 150 5\n 5 tokyo apple 200 6\n 6 tokyo banana 400 7\ntokyo 0 osaka apple 100 1\n 1 osaka orange 200 2\n 2 osaka banana 250 3\n 3 osaka banana 300 4\n 4 tokyo apple 150 5\n 5 tokyo apple 200 6\n 6 tokyo banana 400 7"
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:58.301417Z",
"start_time": "2020-08-31T16:47:58.290138Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_lambda_apply_df_index_false",
"execution_count": 47,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>city2</th>\n <th>food2</th>\n <th>price2</th>\n <th>quantity2</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"7\" valign=\"top\">0</th>\n <th>0</th>\n <td>osaka</td>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>osaka</td>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>2</th>\n <td>osaka</td>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>3</th>\n <td>osaka</td>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>4</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>5</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>6</th>\n <td>tokyo</td>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n <tr>\n <th rowspan=\"7\" valign=\"top\">1</th>\n <th>0</th>\n <td>osaka</td>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>osaka</td>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>2</th>\n <td>osaka</td>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>3</th>\n <td>osaka</td>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>4</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>5</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>6</th>\n <td>tokyo</td>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city2 food2 price2 quantity2\n0 0 osaka apple 100 1\n 1 osaka orange 200 2\n 2 osaka banana 250 3\n 3 osaka banana 300 4\n 4 tokyo apple 150 5\n 5 tokyo apple 200 6\n 6 tokyo banana 400 7\n1 0 osaka apple 100 1\n 1 osaka orange 200 2\n 2 osaka banana 250 3\n 3 osaka banana 300 4\n 4 tokyo apple 150 5\n 5 tokyo apple 200 6\n 6 tokyo banana 400 7"
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:59.023833Z",
"start_time": "2020-08-31T16:47:59.017821Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# apply用関数\n# apply関数から受け取った値、型、group_nameをprintし、DFを返却\ndef lambda_apply_df_reindex(x):\n print(\"====TEST=====\")\n print(x)\n print(type(x))\n print(x.name) # x.nameでgroupbyDataframeのgroup_nameが取れる\n print(\"=============\")\n tmp_dict = pd.DataFrame({\n 'city2': ['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'],\n 'food2': ['apple', 'orange', 'banana', 'banana', 'apple', 'apple', 'banana'],\n 'price2': [100, 200, 250, 300, 150, 200, 400],\n 'quantity2': [1, 2, 3, 4, 5, 6, 7]})\n result = pd.DataFrame(tmp_dict).set_index('city2')\n print(result)\n print(type(result))\n print(result.index)\n return result",
"execution_count": 48,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:47:59.839367Z",
"start_time": "2020-08-31T16:47:59.791101Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# index済みDFも基本おなじだった。\ndf_lambda_apply_reindex_df = df.groupby(['city']).apply(lambda_apply_df_reindex)\ndf_lambda_apply_reindex_df_index_false = df.groupby(['city'], as_index=False).apply(lambda_apply_df_reindex)",
"execution_count": 49,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n food2 price2 quantity2\ncity2 \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nIndex(['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'], dtype='object', name='city2')\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n food2 price2 quantity2\ncity2 \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nIndex(['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'], dtype='object', name='city2')\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\n food2 price2 quantity2\ncity2 \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nIndex(['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'], dtype='object', name='city2')\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n food2 price2 quantity2\ncity2 \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nIndex(['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'], dtype='object', name='city2')\n====TEST=====\n city food price quantity\n0 osaka apple 100 1\n1 osaka orange 200 2\n2 osaka banana 250 3\n3 osaka banana 300 4\n<class 'pandas.core.frame.DataFrame'>\nosaka\n=============\n food2 price2 quantity2\ncity2 \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nIndex(['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'], dtype='object', name='city2')\n====TEST=====\n city food price quantity\n4 tokyo apple 150 5\n5 tokyo apple 200 6\n6 tokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\ntokyo\n=============\n food2 price2 quantity2\ncity2 \nosaka apple 100 1\nosaka orange 200 2\nosaka banana 250 3\nosaka banana 300 4\ntokyo apple 150 5\ntokyo apple 200 6\ntokyo banana 400 7\n<class 'pandas.core.frame.DataFrame'>\nIndex(['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'], dtype='object', name='city2')\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:48:00.367211Z",
"start_time": "2020-08-31T16:48:00.357616Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_lambda_apply_reindex_df",
"execution_count": 50,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>food2</th>\n <th>price2</th>\n <th>quantity2</th>\n </tr>\n <tr>\n <th>city</th>\n <th>city2</th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"7\" valign=\"top\">osaka</th>\n <th>osaka</th>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n <tr>\n <th rowspan=\"7\" valign=\"top\">tokyo</th>\n <th>osaka</th>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " food2 price2 quantity2\ncity city2 \nosaka osaka apple 100 1\n osaka orange 200 2\n osaka banana 250 3\n osaka banana 300 4\n tokyo apple 150 5\n tokyo apple 200 6\n tokyo banana 400 7\ntokyo osaka apple 100 1\n osaka orange 200 2\n osaka banana 250 3\n osaka banana 300 4\n tokyo apple 150 5\n tokyo apple 200 6\n tokyo banana 400 7"
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:48:00.844937Z",
"start_time": "2020-08-31T16:48:00.832731Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df_lambda_apply_reindex_df_index_false",
"execution_count": 51,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>food2</th>\n <th>price2</th>\n <th>quantity2</th>\n </tr>\n <tr>\n <th></th>\n <th>city2</th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th rowspan=\"7\" valign=\"top\">0</th>\n <th>osaka</th>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n <tr>\n <th rowspan=\"7\" valign=\"top\">1</th>\n <th>osaka</th>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n </tr>\n <tr>\n <th>osaka</th>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n </tr>\n <tr>\n <th>tokyo</th>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " food2 price2 quantity2\n city2 \n0 osaka apple 100 1\n osaka orange 200 2\n osaka banana 250 3\n osaka banana 300 4\n tokyo apple 150 5\n tokyo apple 200 6\n tokyo banana 400 7\n1 osaka apple 100 1\n osaka orange 200 2\n osaka banana 250 3\n osaka banana 300 4\n tokyo apple 150 5\n tokyo apple 200 6\n tokyo banana 400 7"
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:48:01.466533Z",
"start_time": "2020-08-31T16:48:01.450203Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# observed検証\n# カテゴリー型というR言語のFactor型のような性質を持ったデータ型が存在する。\n# 統計で言うところの名義尺度や順序尺度を表す。\n# カテゴりー型で宣言したDFのSeriesに対してGroupbyをかけると、カテゴリー分、groupingされるが、\n# observedをすると、観測値(データとして存在しているもののみ)出力される\n\n# DF定義\ndf2 = pd.DataFrame({\n 'city': ['osaka', 'osaka', 'osaka', 'osaka', 'tokyo', 'tokyo', 'tokyo'],\n 'food': ['apple', 'orange', 'banana', 'banana', 'apple', 'apple', 'banana'],\n 'price': [100, 200, 250, 300, 150, 200, 400],\n 'quantity': [1, 2, 3, 4, 5, 6, 7]\n})\n# カテゴリー型は観測値とカテゴリーから生成される\ncategory = pd.Categorical([\"a\", \"b\", \"c\", \"a\", \"b\", \"c\"], categories=[\"b\", \"c\", \"d\"])\nlabel = pd.Series(category)\ndf2[\"label\"] = label\ndf2",
"execution_count": 52,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>city</th>\n <th>food</th>\n <th>price</th>\n <th>quantity</th>\n <th>label</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>osaka</td>\n <td>apple</td>\n <td>100</td>\n <td>1</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>1</th>\n <td>osaka</td>\n <td>orange</td>\n <td>200</td>\n <td>2</td>\n <td>b</td>\n </tr>\n <tr>\n <th>2</th>\n <td>osaka</td>\n <td>banana</td>\n <td>250</td>\n <td>3</td>\n <td>c</td>\n </tr>\n <tr>\n <th>3</th>\n <td>osaka</td>\n <td>banana</td>\n <td>300</td>\n <td>4</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>4</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>150</td>\n <td>5</td>\n <td>b</td>\n </tr>\n <tr>\n <th>5</th>\n <td>tokyo</td>\n <td>apple</td>\n <td>200</td>\n <td>6</td>\n <td>c</td>\n </tr>\n <tr>\n <th>6</th>\n <td>tokyo</td>\n <td>banana</td>\n <td>400</td>\n <td>7</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " city food price quantity label\n0 osaka apple 100 1 NaN\n1 osaka orange 200 2 b\n2 osaka banana 250 3 c\n3 osaka banana 300 4 NaN\n4 tokyo apple 150 5 b\n5 tokyo apple 200 6 c\n6 tokyo banana 400 7 NaN"
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:48:02.184490Z",
"start_time": "2020-08-31T16:48:02.125876Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# observed=Falseの場合: カテゴリーに存在する \"d\" のレコードが出力される。\ndf2.groupby('label').describe() ",
"execution_count": 53,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead tr th {\n text-align: left;\n }\n\n .dataframe thead tr:last-of-type th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"8\" halign=\"left\">price</th>\n <th colspan=\"8\" halign=\"left\">quantity</th>\n </tr>\n <tr>\n <th></th>\n <th>count</th>\n <th>mean</th>\n <th>std</th>\n <th>min</th>\n <th>25%</th>\n <th>50%</th>\n <th>75%</th>\n <th>max</th>\n <th>count</th>\n <th>mean</th>\n <th>std</th>\n <th>min</th>\n <th>25%</th>\n <th>50%</th>\n <th>75%</th>\n <th>max</th>\n </tr>\n <tr>\n <th>label</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>b</th>\n <td>2.0</td>\n <td>175.0</td>\n <td>35.355339</td>\n <td>150.0</td>\n <td>162.5</td>\n <td>175.0</td>\n <td>187.5</td>\n <td>200.0</td>\n <td>2.0</td>\n <td>3.5</td>\n <td>2.12132</td>\n <td>2.0</td>\n <td>2.75</td>\n <td>3.5</td>\n <td>4.25</td>\n <td>5.0</td>\n </tr>\n <tr>\n <th>c</th>\n <td>2.0</td>\n <td>225.0</td>\n <td>35.355339</td>\n <td>200.0</td>\n <td>212.5</td>\n <td>225.0</td>\n <td>237.5</td>\n <td>250.0</td>\n <td>2.0</td>\n <td>4.5</td>\n <td>2.12132</td>\n <td>3.0</td>\n <td>3.75</td>\n <td>4.5</td>\n <td>5.25</td>\n <td>6.0</td>\n </tr>\n <tr>\n <th>d</th>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity \\\n count mean std min 25% 50% 75% max count \nlabel \nb 2.0 175.0 35.355339 150.0 162.5 175.0 187.5 200.0 2.0 \nc 2.0 225.0 35.355339 200.0 212.5 225.0 237.5 250.0 2.0 \nd 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 \n\n \n mean std min 25% 50% 75% max \nlabel \nb 3.5 2.12132 2.0 2.75 3.5 4.25 5.0 \nc 4.5 2.12132 3.0 3.75 4.5 5.25 6.0 \nd NaN NaN NaN NaN NaN NaN NaN "
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-08-31T16:48:02.723174Z",
"start_time": "2020-08-31T16:48:02.665737Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# observed=Trueの場合: カテゴリーに存在する \"d\" のレコードはデータに存在しないため、出力されない。\ndf2.groupby('label', observed=True).describe()",
"execution_count": 54,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead tr th {\n text-align: left;\n }\n\n .dataframe thead tr:last-of-type th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"8\" halign=\"left\">price</th>\n <th colspan=\"8\" halign=\"left\">quantity</th>\n </tr>\n <tr>\n <th></th>\n <th>count</th>\n <th>mean</th>\n <th>std</th>\n <th>min</th>\n <th>25%</th>\n <th>50%</th>\n <th>75%</th>\n <th>max</th>\n <th>count</th>\n <th>mean</th>\n <th>std</th>\n <th>min</th>\n <th>25%</th>\n <th>50%</th>\n <th>75%</th>\n <th>max</th>\n </tr>\n <tr>\n <th>label</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>b</th>\n <td>2.0</td>\n <td>175.0</td>\n <td>35.355339</td>\n <td>150.0</td>\n <td>162.5</td>\n <td>175.0</td>\n <td>187.5</td>\n <td>200.0</td>\n <td>2.0</td>\n <td>3.5</td>\n <td>2.12132</td>\n <td>2.0</td>\n <td>2.75</td>\n <td>3.5</td>\n <td>4.25</td>\n <td>5.0</td>\n </tr>\n <tr>\n <th>c</th>\n <td>2.0</td>\n <td>225.0</td>\n <td>35.355339</td>\n <td>200.0</td>\n <td>212.5</td>\n <td>225.0</td>\n <td>237.5</td>\n <td>250.0</td>\n <td>2.0</td>\n <td>4.5</td>\n <td>2.12132</td>\n <td>3.0</td>\n <td>3.75</td>\n <td>4.5</td>\n <td>5.25</td>\n <td>6.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " price quantity \\\n count mean std min 25% 50% 75% max count \nlabel \nb 2.0 175.0 35.355339 150.0 162.5 175.0 187.5 200.0 2.0 \nc 2.0 225.0 35.355339 200.0 212.5 225.0 237.5 250.0 2.0 \n\n \n mean std min 25% 50% 75% max \nlabel \nb 3.5 2.12132 2.0 2.75 3.5 4.25 5.0 \nc 4.5 2.12132 3.0 3.75 4.5 5.25 6.0 "
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"gist": {
"id": "",
"data": {
"description": "groupby_test.ipynb",
"public": true
}
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.3",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"base_numbering": 1,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment