Skip to content

Instantly share code, notes, and snippets.

@jshirius
Created January 8, 2021 12:07
Show Gist options
  • Save jshirius/b1346d5170807a1e4c617b75fe2e58b7 to your computer and use it in GitHub Desktop.
Save jshirius/b1346d5170807a1e4c617b75fe2e58b7 to your computer and use it in GitHub Desktop.
pandasの速度最適化サンプル
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
" # Pandasのデータ処理の速度比較\n",
" \n",
"- pandasのデータ参照の速度、更新の速度を比較する\n",
"- データは100万行のデータを使用する(kaggleのriiidコンペのデータより)\n",
"- 結論は、データ量が多いときは、pandasのままで処理をするのではなく、一度配列にするなど対応したほうが早くなる\n",
"- 更新に関しては、ここでは取り上げていないが、map関数使うのも効果的だ。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## pandas速度比較結果は以下の通り\n",
"\n",
"| やりたいこと | 方法 | かかった時間 |\n",
"|--------------------|---------------------------------------------------------------------------------------------------------|--------------|\n",
"| pandasのデータ参照 | for文にてpandasのiterrowsで参照 | 1分1秒 |\n",
"| pandasのデータ参照 | for文にてpandasから配列に変換して参照 | 753ミリ秒 |\n",
"| pandasのデータ更新 | for文にてpandasのiterrowsで参照 更新は、pandas.atで更新する | 1分25秒 |\n",
"| pandasのデータ更新 | for文にてpandasから配列に変換して参照 更新は、pandas.atで更新する | 16秒 |\n",
"| pandasのデータ更新 | for文にてpandasから配列に変換して参照 更新は、一旦更新元のデータを配列に格納して、for文抜けたあとで更新 | 979ミリ秒 |"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
"_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
},
"outputs": [],
"source": [
"import datetime\n",
"\n",
"try:\n",
" import pandas as pd\n",
" import time\n",
" from tqdm import tqdm\n",
" import psutil\n",
" \n",
"\n",
"except ImportError as e:\n",
" print(\"{} Import Processed finished with error:{}\".format(str(datetime.datetime.now()), e))\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data_types_dict = {\n",
" 'timestamp': 'int64',\n",
" 'user_id': 'int32', \n",
" 'content_id': 'int16', \n",
" 'content_type_id':'int8', \n",
" 'task_container_id': 'int16',\n",
" 'user_answer': 'int8',\n",
" 'answered_correctly': 'int8', \n",
" 'prior_question_elapsed_time': 'float32', \n",
" 'prior_question_had_explanation': 'bool'\n",
"}\n",
"target = 'answered_correctly'\n",
"\n",
"#データは以下からダウンロードしてください\n",
"#https://www.kaggle.com/c/riiid-test-answer-prediction/data\n",
"train_df = pd.read_csv('./riiid-test-answer-prediction/train.csv',nrows = 1000000)\n",
"\n",
"#型を宣言しないと、無駄にメモリを使う(例:本来int8で足りるのに、int64で確保される)のでしっかり型宣言しておく\n",
"train_df = train_df.astype(data_types_dict)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>timestamp</th>\n",
" <th>user_id</th>\n",
" <th>content_id</th>\n",
" <th>content_type_id</th>\n",
" <th>task_container_id</th>\n",
" <th>user_answer</th>\n",
" <th>answered_correctly</th>\n",
" <th>prior_question_elapsed_time</th>\n",
" <th>prior_question_had_explanation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>115</td>\n",
" <td>5692</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>56943</td>\n",
" <td>115</td>\n",
" <td>5716</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>37000.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>118363</td>\n",
" <td>115</td>\n",
" <td>128</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>55000.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>131167</td>\n",
" <td>115</td>\n",
" <td>7860</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>19000.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>137965</td>\n",
" <td>115</td>\n",
" <td>7922</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>11000.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999995</th>\n",
" <td>999995</td>\n",
" <td>26482248</td>\n",
" <td>20949024</td>\n",
" <td>8803</td>\n",
" <td>0</td>\n",
" <td>29</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>14000.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999996</th>\n",
" <td>999996</td>\n",
" <td>26516686</td>\n",
" <td>20949024</td>\n",
" <td>4664</td>\n",
" <td>0</td>\n",
" <td>30</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>17000.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999997</th>\n",
" <td>999997</td>\n",
" <td>26537967</td>\n",
" <td>20949024</td>\n",
" <td>4108</td>\n",
" <td>0</td>\n",
" <td>31</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>18000.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999998</th>\n",
" <td>999998</td>\n",
" <td>26590240</td>\n",
" <td>20949024</td>\n",
" <td>5014</td>\n",
" <td>0</td>\n",
" <td>32</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>6000.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999999</th>\n",
" <td>999999</td>\n",
" <td>26605316</td>\n",
" <td>20949024</td>\n",
" <td>9828</td>\n",
" <td>0</td>\n",
" <td>33</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>46000.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000000 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" row_id timestamp user_id content_id content_type_id \\\n",
"0 0 0 115 5692 0 \n",
"1 1 56943 115 5716 0 \n",
"2 2 118363 115 128 0 \n",
"3 3 131167 115 7860 0 \n",
"4 4 137965 115 7922 0 \n",
"... ... ... ... ... ... \n",
"999995 999995 26482248 20949024 8803 0 \n",
"999996 999996 26516686 20949024 4664 0 \n",
"999997 999997 26537967 20949024 4108 0 \n",
"999998 999998 26590240 20949024 5014 0 \n",
"999999 999999 26605316 20949024 9828 0 \n",
"\n",
" task_container_id user_answer answered_correctly \\\n",
"0 1 3 1 \n",
"1 2 2 1 \n",
"2 0 0 1 \n",
"3 3 0 1 \n",
"4 4 1 1 \n",
"... ... ... ... \n",
"999995 29 1 1 \n",
"999996 30 3 1 \n",
"999997 31 1 0 \n",
"999998 32 3 0 \n",
"999999 33 0 0 \n",
"\n",
" prior_question_elapsed_time prior_question_had_explanation \n",
"0 NaN True \n",
"1 37000.0 False \n",
"2 55000.0 False \n",
"3 19000.0 False \n",
"4 11000.0 False \n",
"... ... ... \n",
"999995 14000.0 True \n",
"999996 17000.0 True \n",
"999997 18000.0 True \n",
"999998 6000.0 True \n",
"999999 46000.0 True \n",
"\n",
"[1000000 rows x 10 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#データ構成は、以下のような感じ\n",
"train_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# pandasの高速化比較"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"1000000it [01:01, 16357.97it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 59.9 s, sys: 920 ms, total: 1min\n",
"Wall time: 1min 1s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"%%time\n",
"#pandasのiterrowsによる繰り返し\n",
"#マジで遅い!\n",
"\n",
"for index , row in tqdm(train_df.iterrows()):\n",
" a = 1\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 1000000/1000000 [00:00<00:00, 2161721.54it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 632 ms, sys: 121 ms, total: 753 ms\n",
"Wall time: 753 ms\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"%%time\n",
"#pandasを配列にしてから処理する\n",
"#かなり速い! 少なくともtrain_df.iterrowsの60倍以上速い\n",
"for row in tqdm(train_df.values):\n",
" a = 1\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandasのデータ更新"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"1000000it [01:25, 11756.64it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1min 23s, sys: 1.28 s, total: 1min 24s\n",
"Wall time: 1min 25s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"%%time\n",
"#pandasのiterrowsによる繰り返しで「user_answer」を更新してみる\n",
"#遅い!\n",
"for index , row in tqdm(train_df.iterrows()):\n",
" \n",
" train_df.at[index, 'user_answer'] = 11"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 1000000/1000000 [00:15<00:00, 63742.74it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 15.6 s, sys: 318 ms, total: 15.9 s\n",
"Wall time: 16 s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"%%time\n",
"#for文は、配列に変換するが、更新はpandasで実施\n",
"#まだまだ遅い\n",
"for row in tqdm(train_df.values):\n",
" train_df.at[index, 'user_answer'] = 11"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 1000000/1000000 [00:00<00:00, 1911388.02it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 847 ms, sys: 130 ms, total: 977 ms\n",
"Wall time: 979 ms\n"
]
}
],
"source": [
"%%time\n",
"#for文は配列、更新は一旦配列を用意して、for文を抜けたタイミングで更新する\n",
"user_answer = []\n",
"for row in tqdm(train_df.values):\n",
" user_answer.append(11)\n",
" \n",
"train_df['user_answer'] = user_answer\n",
"\n",
"del user_answer\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment