Skip to content

Instantly share code, notes, and snippets.

@iwanbolzern
Last active March 14, 2022 11:12
Show Gist options
  • Save iwanbolzern/57b24f822cac0dfa90d3d0d1cc5e731f to your computer and use it in GitHub Desktop.
Save iwanbolzern/57b24f822cac0dfa90d3d0d1cc5e731f to your computer and use it in GitHub Desktop.
Have you ever been in a situation where you would have liked others to give you a data frame with the right columns? If no, you're a lucky one. 🀩 If yes, check out this gist 🧐

First step in the direction of a really typed DataFrame

I come across many great libraries every day, but unfortunately most of them are not well suited for enterprise or medical projects, because they lack the possibility for proper interface definitions or the ussage of refactoring capabilities of your IDE.

This is especially true if it comes to libraries in the data science area. Just remember your last df['my_feature'] = df['my_feature'] * 2 πŸ˜‰ And unfortunately, exactly these libraries are also the ones that are written for super fast computations.

Well it seems that we have the choice between the super fast not typed option and a bunch of slow crappy other implementations...

Hard life... πŸ˜₯

But wait, I have heard about a small wrapper arround a pandas DataFrame, which alows for something like typing?

Right, it's called a TypedDataFrame. πŸ˜€πŸ˜€πŸ˜€

Let's dive into it!

A TypedDataFrame builds on top of pandas and dataclasses. For creating such a TypedDataFrame we simply need to define an own class and inherit from TypedDataFrame. In the example bellow we want to have table of flight delays:

from .typed_df import TypedDataFrame

class FlightDelayDf(TypedDataFrame):
  origin: str
  destination: str
  delay_sec: int

We now have a custom type that we can use for method annotations like:

def load_flight_delays() -> FlightDelayDf:
  flight_delays = [
      ['BRN', 'BXO', 10],
      ['GVA', 'LUG', 22],
      ['SIR', 'LUG', 34],
      ['ZRH', 'BRN', 65]
  ]
  df = pd.DataFrame(flight_delays, columns=['origin', 'destination', 'delay_sec'])
  return FlightDelayDf(df)

Yey, have you seen it? Not? We have just define an interface between our load_flight_delays() method and the outher world and still have retained the advantages of a pandas DataFrame. You don't believe me? See yourself...

flight_delays = load_flight_delays()
flight_delays.df[FlightDelayDf.delay_sec].hist()

Convinced? And the best thing about it is that it would have told you if the columns of the original DataFrame would not have mached the definition of FlightDelayDf. Want to see another amazing feature? You can return your underlying DataFrame as a list dataclass objects. See here...

flight_delay_obj = flight_delays.objects[0]
print(f'From {flight_delay_obj.origin} to {flight_delay_obj.destination} with {flight_delay_obj.delay_sec} sec delay.'

This is sometimes super helpful for debugging.

Now let's have fun with it! If you have questions or you're intrested in developing this concept further, please get in contact at any time.

Cheers,

Iwan

import abc
from dataclasses import make_dataclass
import pandas as pd
class TypedDataFrame(abc.ABC):
def __init__(self, df: pd.DataFrame):
self._df = df
self.annotations = self.__class__.__annotations__
self.columns = set(self.annotations.keys())
assert self.columns.issubset(set(self._df.columns)), \
'Your DataFrame has not all expected Columns.\n' \
f'Expected: {self.columns}\n' \
f'Given: {self._df.columns}'
self._init_annotations()
self._init_data_class()
def _init_annotations(self):
for c in self.columns:
setattr(self, c, c)
def _init_data_class(self):
self.data_class_name = self.__class__.__name__
for remove in ['df', 'DF', 'Df', 'DataFrame', 'Dataframe', 'dataframe']:
self.data_class_name = self.data_class_name.replace(remove, '')
self.data_class = make_dataclass(self.data_class_name,
[(key, type) for key, type in self.annotations.items()])
@property
def df(self):
return self._df[self.columns]
@property
def objects(self):
object_list = []
for i, row in self.df.iterrows():
value_dict = {c: row[c] for c in self.columns}
object_list.append(self.data_class(**value_dict))
return object_list
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment