Skip to content

Instantly share code, notes, and snippets.

@jiffyclub
Created August 19, 2014 19:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jiffyclub/0aa5aadcc69ec8576f0b to your computer and use it in GitHub Desktop.
Save jiffyclub/0aa5aadcc69ec8576f0b to your computer and use it in GitHub Desktop.
Using indexing vs. merging to broadcast and reindex a pandas Series. Setting up two tables and doing a pandas merge can often be faster.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:6210b6c09f651d70330b603622ca2f97a42fcea7c29dd564e048fa64204e1f68"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At Synthicity we frequently want to do an operation we call \"reindexing\".\n",
"(Though it's different than the operation done by pandas Series.reindex.)\n",
"Imagine that you have two columns from separate tables, col1 and col2.\n",
"The values of col2 correspond to the index of col1.\n",
"We want to broadcast col1 so it can have the index of col2.\n",
"This is what you'd get doing a table merge, but in this situation it's\n",
"like both of our tables have only one column.\n",
"\n",
"There are (at least) two ways to do this, shown below:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"import pandas as pd"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One way to \"reindex\" is to use the col2 values to index col1,\n",
"then replace the index of the resulting Series."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def reindex1(series1, series2):\n",
" return pd.Series(series1.loc[series2.values].values, index=series2.index)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to \"reindex\" is to construct DataFrames from the columns\n",
"use the pandas merge function, returning only one column from the resulting DataFrame."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def reindex2(series1, series2):\n",
" df = pd.merge(pd.DataFrame({\"left\": series2}),\n",
" pd.DataFrame({\"right\": series1}),\n",
" left_on=\"left\",\n",
" right_index=True,\n",
" how=\"left\")\n",
" return df.right"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It turns out that these two methods have very different performance.\n",
"On smallish columns using `.loc` indexing is faster:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s1 = pd.Series(range(100))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s2 = pd.Series(np.random.choice(s1.index, 1000))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit reindex1(s1, s2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000 loops, best of 3: 624 \u00b5s per loop\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit reindex2(s1, s2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000 loops, best of 3: 1.15 ms per loop\n"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But as the size of the columns grows the merge becomes significantly faster:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s1 = pd.Series(range(100000))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s2 = pd.Series(np.random.choice(s1.index, 1000000))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit reindex1(s1, s2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1 loops, best of 3: 1.3 s per loop\n"
]
}
],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit reindex2(s1, s2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"10 loops, best of 3: 22.2 ms per loop\n"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment