Skip to content

Instantly share code, notes, and snippets.

@jfpuget
Created March 19, 2016 09:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jfpuget/873d3cbe823fef52a4ce to your computer and use it in GitHub Desktop.
Save jfpuget/873d3cbe823fef52a4ce to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas slice and copy\n",
"\n",
"## Author: [Jean-François Puget](https://www.ibm.com/developerworks/community/blogs/jfp/?lang=en)\n",
"\n",
"This was motivated by a reddit discussion on how to suppres the infamous pandas SettingWithCopyWarning."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"tb = pd.DataFrame(data=np.random.random((100000,3)), columns=['year','total','value'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is how to get the warning. We benchmark an operation on the resulting dataframe to see how efficient its memory representation is."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000 loops, best of 3: 462 µs per loop\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\IBM_ADMIN\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" from ipykernel import kernelapp as app\n"
]
}
],
"source": [
"messy = tb[tb['year'] >= 0.5]\n",
"messy.drop(['total'], axis=1, inplace=True)\n",
"\n",
"%timeit messy['value'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The warning is because we want to modify a slice (a view) on the tb dataframe. We don't want to modify the original dataframe, hence andas has to make a copy for us. It warns us that it makes a copy, even if the message isn't explicit about it.\n",
"\n",
"One way to remove the warning is to not try to modify the slice. We remove the inplace argument. In that case, the call to drop creates a new dataframe."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000 loops, best of 3: 469 µs per loop\n"
]
}
],
"source": [
"messy = tb[tb['year'] >= 0.5]\n",
"messy.drop(['total'], axis=1)\n",
"\n",
"%timeit messy['value'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The warning is gone, and running time isn't modified.\n",
"\n",
"I did propose another solution to fix the porblem, with an explicit call to copy()."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000 loops, best of 3: 215 µs per loop\n"
]
}
],
"source": [
"messy = tb[tb['year'] >= 0.5].copy()\n",
"messy.drop(['total'], axis=1, inplace=True)\n",
"\n",
"%timeit messy['value'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not only the warning is gone, but the code runs more than twice as fast.\n",
"\n",
"We could also remove the column first, which creates a copy anyway. We would slice that copy after."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000 loops, best of 3: 465 µs per loop\n"
]
}
],
"source": [
"messy = tb.drop(['total'], axis=1)\n",
"messy = messy[messy['year'] >= 0.5]\n",
"\n",
"%timeit messy['value'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The warning is gone too, but the running time is not good. \n",
"\n",
"Let's see if a copy of the slice helps."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000 loops, best of 3: 220 µs per loop\n"
]
}
],
"source": [
"messy = tb.drop(['total'], axis=1)\n",
"messy = messy[messy['year'] >= 0.5].copy()\n",
"\n",
"%timeit messy['value'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It does help! However, in this case, we have two copies of the data, a first one when we drop a column, and a second one we call explicitly."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"The lesson is clear: \n",
"\n",
"- If you care about spped, the you should copy slices before processing them.\n",
"- If you care about memory, then you should copy slice before you do physical transfomraitons in place."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment