Skip to content

Instantly share code, notes, and snippets.

@mila
Last active September 4, 2020 09:15
Show Gist options
  • Save mila/7a7dc55c8df16f20ab57a671dafd3c3c to your computer and use it in GitHub Desktop.
Save mila/7a7dc55c8df16f20ab57a671dafd3c3c to your computer and use it in GitHub Desktop.
Gotchas of Pandas Hierarchical indexing (MultiIndex)
"""
Fun with Pandas Hierarchical indexing (MultiIndex)
==================================================
The examples belows shows various gotchas and corner cases when working with Pandas MultiIndex.
All examples use following sample data:
>>> import pandas as pd
>>> df = pd.DataFrame({
... "size": ["m", "m", "l"],
... "color": ["red", "green", "blue"],
... "category": ["M", "W", "M"],
... "price": [10, 11, 12],
... }).set_index(["size", "color", "category"])
>>> df
price
size color category
m red M 10
green W 11
l blue M 12
DataFrame.xs() selects by key in a given level and drops that level:
>>> df.xs("red", level="color")
price
size category
m M 10
You can chain DataFrame.xs() calls:
>>> df.xs("m", level="size").xs("red", level="color")
price
category
M 10
But you cannot select the last level:
>>> df.xs("m", level="size").xs("red", level="color").xs("M", level="category")
Traceback (most recent call last):
...
TypeError: Index must be a MultiIndex
You can select multiple levels at once and Pandas drop all selected levels:
>>> df.xs(("m", "red"), level=("size", "color"))
price
category
M 10
But if you select all levels Pandas does not drop them:
>>> df.xs(("m", "red", "M"), level=("size", "color", "category"))
price
size color category
m red M 10
Do not try to apply an empty filter:
>>> df.xs((), level=())
Traceback (most recent call last):
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
DataFrame.xs() is not able to select multiple keys:
>>> df.xs(["red", "green"], level="color")
Traceback (most recent call last):
...
KeyError: ('red', 'green')
When DataFrame.xs() is not sufficient you can fallback to DataFrame.loc[]. You can select one row:
>>> df.loc["m"]
price
color category
red M 10
green W 11
Or you can select multiple rows:
>>> df.loc[["m", "l"]]
price
size color category
m red M 10
green W 11
l blue M 12
But if you want to select by arbitrary index level the syntax becomes ugly:
>>> df.loc[(slice(None), "red"), :]
price
size color category
m red M 10
And the "color" level in the above example was not dropped even though we selected a single value only.
Surprisingly, the most convenient method for selecting multiple values by arbitrary level can be DataFrame.reindex():
>>> df.reindex(["red", "green"], level="color")
price
size color category
m red M 10
green W 11
Conclusions from the above examples:
- Use DataFrame.loc[] if the index has one level only (it is not MultiIndex).
- Use DataFrame.xs() if you want to select a single value in an arbitrary level of a MultiIndex.
- Use DataFrame.reindex() for selection of multiple values in a in an arbitrary level of a MultiIndex.
And good luck your code has to be generic and work with any DataFrame:)
"""
import doctest
doctest.testfile(__file__, optionflags=doctest.NORMALIZE_WHITESPACE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment