Skip to content

Instantly share code, notes, and snippets.

@lucahammer
Created February 26, 2020 14:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lucahammer/8b92f8189c2296be04d985e885774e92 to your computer and use it in GitHub Desktop.
Save lucahammer/8b92f8189c2296be04d985e885774e92 to your computer and use it in GitHub Desktop.
Performance Test. Intel i5 4670k at 4.5GHz Windows 10
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python Performance Tests\n",
"A small collection of operations that are typical for my daily work with real data to compare different setups."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Created `%t` as an alias for `%timeit`.\n",
"Created `%%t` as an alias for `%%timeit`.\n"
]
}
],
"source": [
"%alias_magic t timeit\n",
"\n",
"import pandas as pd\n",
"import dask.dataframe as dd\n",
"\n",
"df = pd.read_json('test-tweets.jsonl', lines=True)\n",
"dask_df = dd.from_pandas(df, npartitions=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start with loading data. About half a million Tweet objects as json lines."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1min 21s ± 2.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t df = pd.read_json('test-tweets.jsonl', lines=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For some multi-threaded processing I want the dataframe as a dask dataframe as well."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"39.6 s ± 454 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t dask_df = dd.from_pandas(df, npartitions=5)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def get_full_text(row):\n",
" if row['truncated']:\n",
" return(row['extended_tweet']['full_text'])\n",
" return (row['text'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Twitter hides the full text of Tweets with more than 140 characters in a sub-field. I want one column that has always the complete text."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30.8 s ± 51.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t df.apply(get_full_text, axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can this be done faster with Dask? "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"32.1 s ± 99.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t dask_df.apply(get_full_text, axis=1, meta=('string')).compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Maybe with processes instead of threads?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1min 5s ± 5.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t dask_df.apply(get_full_text, axis=1, meta=('string')).compute(scheduler='processes')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or by computing it partition wise?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"31.9 s ± 280 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t dask_df.map_partitions(lambda ldf: ldf.apply(get_full_text, axis=1), meta=('string')).compute()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1min 6s ± 7.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t dask_df.map_partitions(lambda ldf: ldf.apply(get_full_text, axis=1), meta=('string')).compute(scheduler='processes')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That didn't work. It's the first time I tried Dask as I hoped that it would make better use of the high core count. I will have to find a better approach.\n",
"\n",
"But grouping is faster with Dask. For example to show which apps were used and for how many Tweets."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"912 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t len(df.groupby('source').count())"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"856 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t len(dask_df.groupby('source').count().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, I want to store some data…"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\feather.py:83: FutureWarning: The SparseDataFrame class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version\n",
" if isinstance(df, _pandas_api.pd.SparseDataFrame):\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"222 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t df[['created_at', 'id', 'text']].to_feather('perf_test.feather')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:383: FutureWarning: RangeIndex._start is deprecated and will be removed in a future version. Use RangeIndex.start instead\n",
" 'start': level._start,\n",
"C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:384: FutureWarning: RangeIndex._stop is deprecated and will be removed in a future version. Use RangeIndex.stop instead\n",
" 'stop': level._stop,\n",
"C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:385: FutureWarning: RangeIndex._step is deprecated and will be removed in a future version. Use RangeIndex.step instead\n",
" 'step': level._step\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"308 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%t df[['created_at', 'id', 'text']].to_parquet('perf_test.parquet')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"…and read it again."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"63.4 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%t pd.read_feather('perf_test.feather')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save the IDs to share them with other researchers."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.55 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%t df.iloc[0:100]['id'].to_csv('perf_test.csv', index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment