Skip to content

Instantly share code, notes, and snippets.

@melissakou
Last active October 4, 2021 13:47
Show Gist options
  • Save melissakou/17fbad1714ab3122d18fe4abbd0497cc to your computer and use it in GitHub Desktop.
Save melissakou/17fbad1714ab3122d18fe4abbd0497cc to your computer and use it in GitHub Desktop.
sales = spark.read.option("header", True).csv("sales_train_evaluation.csv")
# select d_1~d_100 and turn into long format
cols = ["d_" + str(i) for i in range(1, 100)]
sales = sales \
.selectExpr("id", "item_id", "dept_id", "cat_id", "store_id", "state_id",
"stack({}, {}) as (d, amount)".format(len(cols), ', '.join(("'{}', {}".format(i, i) for i in cols)))) \
.cache()
# group by state_id
groupby_state = sales \
.groupBy("state_id") \
.agg(F.sum("amount").alias('amt_tot')) \
.orderBy(F.col("amt_tot").desc())
groupby_state.show()
# group by store_id
groupby_store = sales \
.groupBy("store_id") \
.agg(F.sum("amount").alias('amt_tot')) \
.orderBy(F.col("amt_tot").desc())
groupby_store.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment