lucahammer/perf_4670k_windows10.ipynb

## perf_4670k_windows10.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python Performance Tests\n",
    "A small collection of operations that are typical for my daily work with real data to compare different setups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Created `%t` as an alias for `%timeit`.\n",
      "Created `%%t` as an alias for `%%timeit`.\n"
     ]
    }
   ],
   "source": [
    "%alias_magic t timeit\n",
    "\n",
    "import pandas as pd\n",
    "import dask.dataframe as dd\n",
    "\n",
    "df = pd.read_json('test-tweets.jsonl', lines=True)\n",
    "dask_df = dd.from_pandas(df, npartitions=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start with loading data. About half a million Tweet objects as json lines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1min 21s ± 2.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t df = pd.read_json('test-tweets.jsonl', lines=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For some multi-threaded processing I want the dataframe as a dask dataframe as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "39.6 s ± 454 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t dask_df = dd.from_pandas(df, npartitions=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_full_text(row):\n",
    "    if row['truncated']:\n",
    "        return(row['extended_tweet']['full_text'])\n",
    "    return (row['text'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Twitter hides the full text of Tweets with more than 140 characters in a sub-field. I want one column that has always the complete text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "30.8 s ± 51.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t df.apply(get_full_text, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can this be done faster with Dask? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "32.1 s ± 99.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t dask_df.apply(get_full_text, axis=1, meta=('string')).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Maybe with processes instead of threads?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1min 5s ± 5.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t dask_df.apply(get_full_text, axis=1, meta=('string')).compute(scheduler='processes')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or by computing it partition wise?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "31.9 s ± 280 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t dask_df.map_partitions(lambda ldf: ldf.apply(get_full_text, axis=1), meta=('string')).compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1min 6s ± 7.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t dask_df.map_partitions(lambda ldf: ldf.apply(get_full_text, axis=1), meta=('string')).compute(scheduler='processes')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That didn't work. It's the first time I tried Dask as I hoped that it would make better use of the high core count. I will have to find a better approach.\n",
    "\n",
    "But grouping is faster with Dask. For example to show which apps were used and for how many Tweets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "912 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t len(df.groupby('source').count())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "856 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t len(dask_df.groupby('source').count().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, I want to store some data…"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\feather.py:83: FutureWarning: The SparseDataFrame class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version\n",
      "  if isinstance(df, _pandas_api.pd.SparseDataFrame):\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "222 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t df[['created_at', 'id', 'text']].to_feather('perf_test.feather')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:383: FutureWarning: RangeIndex._start is deprecated and will be removed in a future version. Use RangeIndex.start instead\n",
      "  'start': level._start,\n",
      "C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:384: FutureWarning: RangeIndex._stop is deprecated and will be removed in a future version. Use RangeIndex.stop instead\n",
      "  'stop': level._stop,\n",
      "C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:385: FutureWarning: RangeIndex._step is deprecated and will be removed in a future version. Use RangeIndex.step instead\n",
      "  'step': level._step\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "308 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%t df[['created_at', 'id', 'text']].to_parquet('perf_test.parquet')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "…and read it again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "63.4 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%t pd.read_feather('perf_test.feather')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the IDs to share them with other researchers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.55 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%t df.iloc[0:100]['id'].to_csv('perf_test.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Python Performance Tests\n",
	"A small collection of operations that are typical for my daily work with real data to compare different setups."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Created `%t` as an alias for `%timeit`.\n",
	"Created `%%t` as an alias for `%%timeit`.\n"
	]
	}
	],
	"source": [
	"%alias_magic t timeit\n",
	"\n",
	"import pandas as pd\n",
	"import dask.dataframe as dd\n",
	"\n",
	"df = pd.read_json('test-tweets.jsonl', lines=True)\n",
	"dask_df = dd.from_pandas(df, npartitions=5)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's start with loading data. About half a million Tweet objects as json lines."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1min 21s ± 2.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t df = pd.read_json('test-tweets.jsonl', lines=True)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For some multi-threaded processing I want the dataframe as a dask dataframe as well."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"39.6 s ± 454 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t dask_df = dd.from_pandas(df, npartitions=5)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"def get_full_text(row):\n",
	" if row['truncated']:\n",
	" return(row['extended_tweet']['full_text'])\n",
	" return (row['text'])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Twitter hides the full text of Tweets with more than 140 characters in a sub-field. I want one column that has always the complete text."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"30.8 s ± 51.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t df.apply(get_full_text, axis=1)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Can this be done faster with Dask? "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"32.1 s ± 99.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t dask_df.apply(get_full_text, axis=1, meta=('string')).compute()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Maybe with processes instead of threads?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1min 5s ± 5.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t dask_df.apply(get_full_text, axis=1, meta=('string')).compute(scheduler='processes')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Or by computing it partition wise?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"scrolled": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"31.9 s ± 280 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t dask_df.map_partitions(lambda ldf: ldf.apply(get_full_text, axis=1), meta=('string')).compute()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1min 6s ± 7.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t dask_df.map_partitions(lambda ldf: ldf.apply(get_full_text, axis=1), meta=('string')).compute(scheduler='processes')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"That didn't work. It's the first time I tried Dask as I hoped that it would make better use of the high core count. I will have to find a better approach.\n",
	"\n",
	"But grouping is faster with Dask. For example to show which apps were used and for how many Tweets."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"912 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t len(df.groupby('source').count())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"856 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t len(dask_df.groupby('source').count().compute())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Finally, I want to store some data…"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\feather.py:83: FutureWarning: The SparseDataFrame class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version\n",
	" if isinstance(df, _pandas_api.pd.SparseDataFrame):\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"222 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t df[['created_at', 'id', 'text']].to_feather('perf_test.feather')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:383: FutureWarning: RangeIndex._start is deprecated and will be removed in a future version. Use RangeIndex.start instead\n",
	" 'start': level._start,\n",
	"C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:384: FutureWarning: RangeIndex._stop is deprecated and will be removed in a future version. Use RangeIndex.stop instead\n",
	" 'stop': level._stop,\n",
	"C:\\Users\\lucah\\Anaconda3\\envs\\Performance Test\\lib\\site-packages\\pyarrow\\pandas_compat.py:385: FutureWarning: RangeIndex._step is deprecated and will be removed in a future version. Use RangeIndex.step instead\n",
	" 'step': level._step\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"308 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%t df[['created_at', 'id', 'text']].to_parquet('perf_test.parquet')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"…and read it again."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"63.4 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
	]
	}
	],
	"source": [
	"%t pd.read_feather('perf_test.feather')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Save the IDs to share them with other researchers."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"2.55 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
	]
	}
	],
	"source": [
	"%t df.iloc[0:100]['id'].to_csv('perf_test.csv', index=False)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}