Last active
February 21, 2020 19:57
-
-
Save erovira/82cae0d0113942136851ffd631a44e37 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0 63\n", | |
"1 80\n", | |
"2 61\n", | |
"3 28\n", | |
"4 4\n", | |
" ..\n", | |
"999995 45\n", | |
"999996 64\n", | |
"999997 34\n", | |
"999998 32\n", | |
"999999 92\n", | |
"Length: 1000000, dtype: int64" | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"rand_series = pd.Series(np.random.randint(0, 100, size=1_000_000))\n", | |
"rand_series" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def custom_func(n):\n", | |
" return n ** 3\n", | |
"\n", | |
"def process_series_2loc(series):\n", | |
" for idx in series.index:\n", | |
" series.loc[idx] = custom_func(series.loc[idx])\n", | |
"\n", | |
"def process_series_2at(series):\n", | |
" for idx in series.index:\n", | |
" series.at[idx] = custom_func(series.at[idx])\n", | |
"\n", | |
"def process_series_iteritems(series):\n", | |
" for idx, n in series.iteritems():\n", | |
" series.at[idx] = custom_func(n)\n", | |
"\n", | |
"def process_series_iat(series):\n", | |
" for pos, n in series.iteritems():\n", | |
" series.iat[pos] = custom_func(n)\n", | |
"\n", | |
"def process_series_itervalues_plain(series):\n", | |
" arr = np.zeros(len(series))\n", | |
" for i, val in enumerate(series.values):\n", | |
" arr[i] = custom_func(val)\n", | |
"\n", | |
" return pd.Series(arr)\n", | |
"\n", | |
"def process_series_apply(series):\n", | |
" series.apply(custom_func) \n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1min 38s ± 188 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit -n 2 -r 2 process_series_2loc(rand_series.copy())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"32.5 s ± 107 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit -n 2 -r 2 process_series_2at(rand_series.copy())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"13.9 s ± 59.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit -n 2 -r 2 process_series_iteritems(rand_series.copy())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"4.45 s ± 7.85 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit -n 2 -r 2 process_series_iat(rand_series.copy())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"474 ms ± 14.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit -n 2 -r 2 process_series_itervalues_plain(rand_series.copy())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"509 ms ± 10.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit -n 2 -r 2 process_series_apply(rand_series.copy())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### And finally, the vectorized ways" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"4.8 ms ± 365 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit -n 2 -r 2 rand_series ** 3" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"4.36 ms ± 77.2 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit -n 2 -r 2 rand_series.pow(3)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Conclusions" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Main\n", | |
"**1.** Always, always use vectorized methods. It's preferable to split a custom function into independent vectorized components if possible.\n", | |
"\n", | |
"**2.** If you absolutely have to use a custom function, use `apply` to apply it to the Series/DataFrame entries.\n", | |
"\n", | |
"**3.** \"Manually\" iterating a Series/Dataframe is never preferable since you'll almost always add overhead when iterating, which, if iterating many many times, will add up.\n", | |
"\n", | |
"### Secondary\n", | |
"\n", | |
"**1.** `Series.loc` is slow! If you want to modify/access a single entry by index you should use `Series.at` instead.\n", | |
"\n", | |
"**2.** Modifying/accessing an entry by position (`Series.iat`) is 3 times faster than doing it by index (`Series.at`)." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment