Skip to content

Instantly share code, notes, and snippets.

Last active September 4, 2020 09:15
Show Gist options
  • Save mila/7a7dc55c8df16f20ab57a671dafd3c3c to your computer and use it in GitHub Desktop.
Save mila/7a7dc55c8df16f20ab57a671dafd3c3c to your computer and use it in GitHub Desktop.
Gotchas of Pandas Hierarchical indexing (MultiIndex)
Fun with Pandas Hierarchical indexing (MultiIndex)
The examples belows shows various gotchas and corner cases when working with Pandas MultiIndex.
All examples use following sample data:
>>> import pandas as pd
>>> df = pd.DataFrame({
... "size": ["m", "m", "l"],
... "color": ["red", "green", "blue"],
... "category": ["M", "W", "M"],
... "price": [10, 11, 12],
... }).set_index(["size", "color", "category"])
>>> df
size color category
m red M 10
green W 11
l blue M 12
DataFrame.xs() selects by key in a given level and drops that level:
>>> df.xs("red", level="color")
size category
m M 10
You can chain DataFrame.xs() calls:
>>> df.xs("m", level="size").xs("red", level="color")
M 10
But you cannot select the last level:
>>> df.xs("m", level="size").xs("red", level="color").xs("M", level="category")
Traceback (most recent call last):
TypeError: Index must be a MultiIndex
You can select multiple levels at once and Pandas drop all selected levels:
>>> df.xs(("m", "red"), level=("size", "color"))
M 10
But if you select all levels Pandas does not drop them:
>>> df.xs(("m", "red", "M"), level=("size", "color", "category"))
size color category
m red M 10
Do not try to apply an empty filter:
>>> df.xs((), level=())
Traceback (most recent call last):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
DataFrame.xs() is not able to select multiple keys:
>>> df.xs(["red", "green"], level="color")
Traceback (most recent call last):
KeyError: ('red', 'green')
When DataFrame.xs() is not sufficient you can fallback to DataFrame.loc[]. You can select one row:
>>> df.loc["m"]
color category
red M 10
green W 11
Or you can select multiple rows:
>>> df.loc[["m", "l"]]
size color category
m red M 10
green W 11
l blue M 12
But if you want to select by arbitrary index level the syntax becomes ugly:
>>> df.loc[(slice(None), "red"), :]
size color category
m red M 10
And the "color" level in the above example was not dropped even though we selected a single value only.
Surprisingly, the most convenient method for selecting multiple values by arbitrary level can be DataFrame.reindex():
>>> df.reindex(["red", "green"], level="color")
size color category
m red M 10
green W 11
Conclusions from the above examples:
- Use DataFrame.loc[] if the index has one level only (it is not MultiIndex).
- Use DataFrame.xs() if you want to select a single value in an arbitrary level of a MultiIndex.
- Use DataFrame.reindex() for selection of multiple values in a in an arbitrary level of a MultiIndex.
And good luck your code has to be generic and work with any DataFrame:)
import doctest
doctest.testfile(__file__, optionflags=doctest.NORMALIZE_WHITESPACE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment