Skip to content

Instantly share code, notes, and snippets.

@bgromov
Last active November 3, 2022 10:10
Show Gist options
  • Save bgromov/ff624ad1fda63b47b75596e8c13dc3a5 to your computer and use it in GitHub Desktop.
Save bgromov/ff624ad1fda63b47b75596e8c13dc3a5 to your computer and use it in GitHub Desktop.
Pandas: DataFrame.query() for complex data types

Pandas: DataFrame.query() for complex data types

Consider the following pandas DataFrame:

df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a'])
           a
0  [1, 2, 3]
1  [4, 5, 6]
2  [7, 8, 9]
3  [0, 1, 2]

How do we query all rows where the first element is greater than 5?

One of the ways would be to use DataFrame.query(). Unfortunatelly, there is no way to access individual elements of cell values, i.e., we can't do this:

df.query('a[0] > 5') # TypeError: '>' not supported between instances of 'list' and 'int'

Clearly, that happens because the item accessor selects a row number 0 which is then compared to a scalar number.

Another approach would be to extract necessary data using a lambda function within apply() method; however, it is not supported yet:

df.query('a.apply(lambda x: x[0]) > 5') # NotImplementedError: 'Lambda' nodes are not implemented

Luckily, we can solve this problem with custom accessors. We can extend the pandas.Series and extract necessary information from values of complex types.

First, we have to register an accessor:

import pandas as pd

@pd.api.extensions.register_series_accessor('list')
class ListAccessor:
    def __init__(self, pandas_obj):
        self._obj = pandas_obj

    def __getitem__(self, index):
        return self._obj.apply(lambda x: x[index])

Then, we can immediately use it with query() or eval():

Eval:

df.eval('a.list[1]')
0    2
1    5
2    8
3    1
Name: a, dtype: int64

Query:

df.query('a.list[2] > 4')
           a
1  [4, 5, 6]
2  [7, 8, 9]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment