bgromov/pandas-query.md

## pandas-query.md

      
    Raw
  

              pandas-query.md
            
          
    Pandas: DataFrame.query() for complex data types

Consider the following pandas DataFrame:
df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a'])
           a
0  [1, 2, 3]
1  [4, 5, 6]
2  [7, 8, 9]
3  [0, 1, 2]

How do we query all rows where the first element is greater than 5?
One of the ways would be to use DataFrame.query(). Unfortunatelly, there is no way to access individual elements of cell values, i.e., we can't do this:
df.query('a[0] > 5') # TypeError: '>' not supported between instances of 'list' and 'int'
Clearly, that happens because the item accessor selects a row number 0 which is then compared to a scalar number.
Another approach would be to extract necessary data using a lambda function within apply() method; however, it is not supported yet:
df.query('a.apply(lambda x: x[0]) > 5') # NotImplementedError: 'Lambda' nodes are not implemented
Luckily, we can solve this problem with custom accessors. We can extend the pandas.Series and extract necessary information from values of complex types.
First, we have to register an accessor:
import pandas as pd

@pd.api.extensions.register_series_accessor('list')
class ListAccessor:
    def __init__(self, pandas_obj):
        self._obj = pandas_obj

    def __getitem__(self, index):
        return self._obj.apply(lambda x: x[index])
Then, we can immediately use it with query() or eval():
Eval:
df.eval('a.list[1]')
0    2
1    5
2    8
3    1
Name: a, dtype: int64

Query:
df.query('a.list[2] > 4')
           a
1  [4, 5, 6]
2  [7, 8, 9]