Skip to content

Instantly share code, notes, and snippets.

@bgweber
Last active May 19, 2022 09:19
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bgweber/6655508db34dffe7a63cfb95281fa700 to your computer and use it in GitHub Desktop.
Save bgweber/6655508db34dffe7a63cfb95281fa700 to your computer and use it in GitHub Desktop.
Distributing Feature Generation with Pandas UDFs
import featuretools as ft
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def apply_feature_generation(pandasInputDF):
# create Entity Set representation
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=pandasInputDF)
es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id")
# apply the feature calculation and return the result
return ft.calculate_feature_matrix(saved_features, es)
sparkFeatureDF = sparkInputDF.groupby('user_group').apply(apply_feature_generation)
@lvjiujin
Copy link

what's the data type of the schema? is it determined by the apply_feature_generation function return value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment