Skip to content

Instantly share code, notes, and snippets.

@Thibauth
Created June 22, 2015 01:54
Show Gist options
  • Save Thibauth/098609de857ba587408e to your computer and use it in GitHub Desktop.
Save Thibauth/098609de857ba587408e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 123,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Numpy views"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0.18006592 0.24649676 0.35523893 0.26081841 0.66503203 0.22323551]\n",
" [ 0.8542038 0.69471479 0.24648256 0.38164091 0.77835498 0.56024023]\n",
" [ 0.32579923 0.23128576 0.76174749 0.61810437 0.14399669 0.95828189]\n",
" [ 0.92587986 0.80262456 0.29651969 0.37356372 0.71052469 0.98255504]\n",
" [ 0.38672804 0.05414583 0.62509552 0.68511417 0.30447966 0.9284551 ]\n",
" [ 0.86886512 0.034306 0.14237126 0.21981988 0.71984728 0.21833643]] \n",
"\n",
"[ 0.76174749 0.61810437]\n"
]
}
],
"source": [
"a = np.random.rand(6, 6)\n",
"print a,\"\\n\"\n",
"b = a[2][2:4] # b is a view on the \"center\" of a\n",
"print b"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 0.76174749 0. ]\n",
"[[ 0.18006592 0.24649676 0.35523893 0.26081841 0.66503203 0.22323551]\n",
" [ 0.8542038 0.69471479 0.24648256 0.38164091 0.77835498 0.56024023]\n",
" [ 0.32579923 0.23128576 0.76174749 1. 0.14399669 0.95828189]\n",
" [ 0.92587986 0.80262456 0.29651969 0.37356372 0.71052469 0.98255504]\n",
" [ 0.38672804 0.05414583 0.62509552 0.68511417 0.30447966 0.9284551 ]\n",
" [ 0.86886512 0.034306 0.14237126 0.21981988 0.71984728 0.21833643]]\n"
]
}
],
"source": [
"a[2][3] = 0\n",
"print b # the change we have just made to a is reflected on b\n",
"b[1] = 1\n",
"print a # vice versa changes to b are reflected on a"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 0.76174749 9. ]\n",
"[[ 0.18006592 0.24649676 0.35523893 0.26081841 0.66503203 0.22323551]\n",
" [ 0.8542038 0.69471479 0.24648256 0.38164091 0.77835498 0.56024023]\n",
" [ 0.32579923 0.23128576 0.76174749 9. 0.14399669 0.95828189]\n",
" [ 0.92587986 0.80262456 0.29651969 0.37356372 0.71052469 0.98255504]\n",
" [ 0.38672804 0.05414583 0.62509552 0.68511417 0.30447966 0.9284551 ]\n",
" [ 0.86886512 0.034306 0.14237126 0.21981988 0.71984728 0.21833643]]\n"
]
}
],
"source": [
"def f(x):\n",
" x[1] = 9\n",
" \n",
"f(b) # this modifies b and also a\n",
"print b\n",
"print a"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However:"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"[[ 0.18006592 0.24649676 0.35523893 0.26081841 0.66503203 0.22323551]\n",
" [ 0.8542038 0.69471479 0.24648256 0.38164091 0.77835498 0.56024023]\n",
" [ 0.32579923 0.23128576 0.76174749 9. 0.14399669 0.95828189]\n",
" [ 0.92587986 0.80262456 0.29651969 0.37356372 0.71052469 0.98255504]\n",
" [ 0.38672804 0.05414583 0.62509552 0.68511417 0.30447966 0.9284551 ]\n",
" [ 0.86886512 0.034306 0.14237126 0.21981988 0.71984728 0.21833643]]\n"
]
}
],
"source": [
"b = 0\n",
"print b\n",
"print a"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why did the above fail? The line `b=0` is interpreted by Python as: *create a new integer constant equal to 0 and attach the name `b` to it*; but what we meant was *assign the value 0 to the entire array b*. For this we need to make it explicit that we are assigning a value *inside* b (and not simply attaching the name `b` to a new object). This is done by using the slice operator."
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 0.76174749 9. ]\n",
"[ 2. 2.]\n",
"[[ 0.18006592 0.24649676 0.35523893 0.26081841 0.66503203 0.22323551]\n",
" [ 0.8542038 0.69471479 0.24648256 0.38164091 0.77835498 0.56024023]\n",
" [ 0.32579923 0.23128576 2. 2. 0.14399669 0.95828189]\n",
" [ 0.92587986 0.80262456 0.29651969 0.37356372 0.71052469 0.98255504]\n",
" [ 0.38672804 0.05414583 0.62509552 0.68511417 0.30447966 0.9284551 ]\n",
" [ 0.86886512 0.034306 0.14237126 0.21981988 0.71984728 0.21833643]]\n"
]
}
],
"source": [
"b = a[2][2:4] # first reassign b to be a view to the center of a\n",
"print b\n",
"b[:] = 2\n",
"print b\n",
"print a"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In C terms, *b* is a pointer to the memory location of the cells in a. By assigning the constant 0 to `b`, we are simply losing the connection between a and b. b[:] is a way to *de-reference* the pointer b and assign *inside* b (in C we would simply use the * operator)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This also works with functions:"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 3. 3.]\n",
"[[ 0.18006592 0.24649676 0.35523893 0.26081841 0.66503203 0.22323551]\n",
" [ 0.8542038 0.69471479 0.24648256 0.38164091 0.77835498 0.56024023]\n",
" [ 0.32579923 0.23128576 3. 3. 0.14399669 0.95828189]\n",
" [ 0.92587986 0.80262456 0.29651969 0.37356372 0.71052469 0.98255504]\n",
" [ 0.38672804 0.05414583 0.62509552 0.68511417 0.30447966 0.9284551 ]\n",
" [ 0.86886512 0.034306 0.14237126 0.21981988 0.71984728 0.21833643]]\n"
]
}
],
"source": [
"def g(x):\n",
" x[:] = 3\n",
"g(b)\n",
"print b\n",
"print a"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The exact same principles apply to Pandas, except that consecutive slicing will create copies (and generally pandas will give you a warning if you try to do so)."
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" label value\n",
"0 B 0.813125\n",
"1 A 0.787954\n",
"2 A 0.033471\n",
"3 B 0.049730\n",
"4 B 0.980805\n",
"5 B 0.513889\n",
"6 B 0.122313\n",
"7 A 0.742209\n",
"8 B 0.935604\n",
"9 B 0.488160\n",
" label value\n",
"0 B 0.813125\n",
"3 B 0.049730\n",
"4 B 0.980805\n",
"5 B 0.513889\n",
"6 B 0.122313\n",
"8 B 0.935604\n",
"9 B 0.488160\n"
]
}
],
"source": [
"c = np.random.choice([\"A\", \"B\"], 10)\n",
"d = np.random.rand(10)\n",
"df = pd.DataFrame({\"label\": c, \"value\": d})\n",
"dfb = df[df.label == 'B']\n",
"print df\n",
"print dfb # dfb is a view on a subset of df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The problem is that at this point, we can no longer assign anything to `dfb`, even using the `[:]` as in the numpy case, without creating a copy. This would indeed by considered a *chain assignment*. Note that the problem also exists in Numpy when using boolean indexing: boolean indexing create copies in Numpy. The only difference is that Pandas needs to create copies more often than Numpy.\n",
"\n",
"There are two options at this point. The first one is to try as much as possible to select and assign at the same place:"
]
},
{
"cell_type": "code",
"execution_count": 184,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" label value\n",
"0 B 0.000000\n",
"1 A 0.787954\n",
"2 A 0.033471\n",
"3 B 0.000000\n",
"4 B 0.000000\n",
"5 B 0.000000\n",
"6 B 0.000000\n",
"7 A 0.742209\n",
"8 B 0.000000\n",
"9 B 0.000000\n"
]
}
],
"source": [
"df.loc[df.label == 'B','value'] = 0\n",
"print df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above option does not work in cases where we want to make the selection and the assigment at two different places (for example if we want to make the assigment inside a function). The second option is the *masking* method: instead of selecting a sub-dataframe, just compute the mask (that is, the indices of the rows we want to modify):"
]
},
{
"cell_type": "code",
"execution_count": 185,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 True\n",
"1 False\n",
"2 False\n",
"3 True\n",
"4 True\n",
"5 True\n",
"6 True\n",
"7 False\n",
"8 True\n",
"9 True\n",
"Name: label, dtype: bool\n",
" label value\n",
"0 B 0\n",
"3 B 0\n",
"4 B 0\n",
"5 B 0\n",
"6 B 0\n",
"8 B 0\n",
"9 B 0\n"
]
}
],
"source": [
"b_rows = df.label == 'B'\n",
"print b_rows # b_rows is a boolean dataframe, a mask telling us which rows to select\n",
"dfb = df[b_rows] # b_rows can be used to select a sub-dataframe\n",
"print dfb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we can now treat the selection part (the selection of the mask b) and the assignment part separately:"
]
},
{
"cell_type": "code",
"execution_count": 188,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" label value\n",
"0 B 1.000000\n",
"1 A 0.787954\n",
"2 A 0.033471\n",
"3 B 1.000000\n",
"4 B 1.000000\n",
"5 B 1.000000\n",
"6 B 1.000000\n",
"7 A 0.742209\n",
"8 B 1.000000\n",
"9 B 1.000000\n"
]
}
],
"source": [
"df.loc[b_rows,'value'] = 1\n",
"print df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This also work with functions:"
]
},
{
"cell_type": "code",
"execution_count": 190,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" label value\n",
"0 B 2.000000\n",
"1 A 0.787954\n",
"2 A 0.033471\n",
"3 B 2.000000\n",
"4 B 2.000000\n",
"5 B 2.000000\n",
"6 B 2.000000\n",
"7 A 0.742209\n",
"8 B 2.000000\n",
"9 B 2.000000\n"
]
}
],
"source": [
"def h(dataframe, mask):\n",
" dataframe.loc[mask, 'value'] = 2\n",
"h(df, b_rows)\n",
"print df"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment