Skip to content

Instantly share code, notes, and snippets.

@aialenti
Last active September 20, 2020 14:07
Show Gist options
  • Save aialenti/7e64aedcdbe31c40ef0dbbcadbb40525 to your computer and use it in GitHub Desktop.
Save aialenti/7e64aedcdbe31c40ef0dbbcadbb40525 to your computer and use it in GitHub Desktop.
# Read the source tables in Parquet format
sales_table = spark.read.parquet("./data/sales_parquet")
'''
SELECT COLLECT_SET(num_pieces_sold) AS num_pieces_sold_set,
COLLECT_LIST(num_pieces_list) AS num_pieces_sold_list,
seller_id
FROM sales_table
GROUP BY seller_id
'''
sales_table_execution_plan = sales_table.groupBy(col("seller_id")).agg(
collect_set(col("num_pieces_sold")).alias("num_pieces_sold_set"),
collect_list(col("num_pieces_sold")).alias("num_pieces_sold_list"),
)
sales_table_execution_plan.show(10, True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment