mila/pandas_multiindex_gotchas.py

## pandas_multiindex_gotchas.py
"""

Fun with Pandas Hierarchical indexing (MultiIndex)
==================================================

The examples belows shows various gotchas and corner cases when working with Pandas MultiIndex.

All examples use following sample data:

>>> import pandas as pd
>>> df = pd.DataFrame({
...     "size": ["m", "m", "l"],
...     "color": ["red", "green", "blue"],
...     "category": ["M", "W", "M"],
...     "price": [10, 11, 12],
... }).set_index(["size", "color", "category"])
>>> df
                     price
size color category
m    red   M            10
     green W            11
l    blue  M            12


DataFrame.xs() selects by key in a given level and drops that level:

>>> df.xs("red", level="color")
               price
size category
m    M            10


You can chain DataFrame.xs() calls:

>>> df.xs("m", level="size").xs("red", level="color")
          price
category
M            10


But you cannot select the last level:

>>> df.xs("m", level="size").xs("red", level="color").xs("M", level="category")
Traceback (most recent call last):
...
TypeError: Index must be a MultiIndex


You can select multiple levels at once and Pandas drop all selected levels:

>>> df.xs(("m", "red"), level=("size", "color"))
          price
category
M            10


But if you select all levels Pandas does not drop them:

>>> df.xs(("m", "red", "M"), level=("size", "color", "category"))
                     price
size color category
m    red   M            10


Do not try to apply an empty filter:

>>> df.xs((), level=())
Traceback (most recent call last):
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()


DataFrame.xs() is not able to select multiple keys:

>>> df.xs(["red", "green"], level="color")
Traceback (most recent call last):
...
KeyError: ('red', 'green')


When DataFrame.xs() is not sufficient you can fallback to DataFrame.loc[]. You can select one row:
>>> df.loc["m"]
                price
color category
red   M            10
green W            11


Or you can select multiple rows:
>>> df.loc[["m", "l"]]
                     price
size color category
m    red   M            10
     green W            11
l    blue  M            12


But if you want to select by arbitrary index level the syntax becomes ugly:

>>> df.loc[(slice(None), "red"), :]
                     price
size color category
m    red   M            10

And the "color" level in the above example was not dropped even though we selected a single value only.


Surprisingly, the most convenient method for selecting multiple values by arbitrary level can be DataFrame.reindex():
>>> df.reindex(["red", "green"], level="color")
                     price
size color category
m    red   M            10
     green W            11


Conclusions from the above examples:

 - Use DataFrame.loc[] if the index has one level only (it is not MultiIndex).
 - Use DataFrame.xs() if you want to select a single value in an arbitrary level of a MultiIndex.
 - Use DataFrame.reindex() for selection of multiple values in a in an arbitrary level of a MultiIndex.

And good luck your code has to be generic and work with any DataFrame:)

"""

import doctest
doctest.testfile(__file__, optionflags=doctest.NORMALIZE_WHITESPACE)
	"""

	Fun with Pandas Hierarchical indexing (MultiIndex)
	==================================================

	The examples belows shows various gotchas and corner cases when working with Pandas MultiIndex.

	All examples use following sample data:

	>>> import pandas as pd
	>>> df = pd.DataFrame({
	... "size": ["m", "m", "l"],
	... "color": ["red", "green", "blue"],
	... "category": ["M", "W", "M"],
	... "price": [10, 11, 12],
	... }).set_index(["size", "color", "category"])
	>>> df
	price
	size color category
	m red M 10
	green W 11
	l blue M 12


	DataFrame.xs() selects by key in a given level and drops that level:

	>>> df.xs("red", level="color")
	price
	size category
	m M 10


	You can chain DataFrame.xs() calls:

	>>> df.xs("m", level="size").xs("red", level="color")
	price
	category
	M 10


	But you cannot select the last level:

	>>> df.xs("m", level="size").xs("red", level="color").xs("M", level="category")
	Traceback (most recent call last):
	...
	TypeError: Index must be a MultiIndex


	You can select multiple levels at once and Pandas drop all selected levels:

	>>> df.xs(("m", "red"), level=("size", "color"))
	price
	category
	M 10


	But if you select all levels Pandas does not drop them:

	>>> df.xs(("m", "red", "M"), level=("size", "color", "category"))
	price
	size color category
	m red M 10


	Do not try to apply an empty filter:

	>>> df.xs((), level=())
	Traceback (most recent call last):
	...
	ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()


	DataFrame.xs() is not able to select multiple keys:

	>>> df.xs(["red", "green"], level="color")
	Traceback (most recent call last):
	...
	KeyError: ('red', 'green')


	When DataFrame.xs() is not sufficient you can fallback to DataFrame.loc[]. You can select one row:
	>>> df.loc["m"]
	price
	color category
	red M 10
	green W 11


	Or you can select multiple rows:
	>>> df.loc[["m", "l"]]
	price
	size color category
	m red M 10
	green W 11
	l blue M 12


	But if you want to select by arbitrary index level the syntax becomes ugly:

	>>> df.loc[(slice(None), "red"), :]
	price
	size color category
	m red M 10

	And the "color" level in the above example was not dropped even though we selected a single value only.


	Surprisingly, the most convenient method for selecting multiple values by arbitrary level can be DataFrame.reindex():
	>>> df.reindex(["red", "green"], level="color")
	price
	size color category
	m red M 10
	green W 11


	Conclusions from the above examples:

	- Use DataFrame.loc[] if the index has one level only (it is not MultiIndex).
	- Use DataFrame.xs() if you want to select a single value in an arbitrary level of a MultiIndex.
	- Use DataFrame.reindex() for selection of multiple values in a in an arbitrary level of a MultiIndex.

	And good luck your code has to be generic and work with any DataFrame:)

	"""

	import doctest
	doctest.testfile(__file__, optionflags=doctest.NORMALIZE_WHITESPACE)