Skip to content

Instantly share code, notes, and snippets.

@1ambda
Created December 21, 2021 23:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 1ambda/4b27cdad6c9d9d1f18151d8c5dd37efb to your computer and use it in GitHub Desktop.
Save 1ambda/4b27cdad6c9d9d1f18151d8c5dd37efb to your computer and use it in GitHub Desktop.
dfRaw = df\
    .selectExpr("CAST(event_time AS DATE) as event_date", "brand", "product_id", "ARRAY(category_code, category_id) as category")\
    .where(
        col("brand").isNotNull() &
        ((col("category_code").isNotNull()))
    )\
    .groupBy("event_date")\
    .agg(
        collect_set("product_id").alias("product_id_set"),
        collect_set("category").alias("category_set")
    )
dfRaw.printSchema()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment