Skip to content

Instantly share code, notes, and snippets.

@shivaram
Last active August 29, 2015 14:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shivaram/4ff0a9c226dda2030507 to your computer and use it in GitHub Desktop.
Save shivaram/4ff0a9c226dda2030507 to your computer and use it in GitHub Desktop.
Chris DataFrame demo
val txnsRaw = sqlContext.load("./Customer_Transactions.parquet", "parquet")
val demo = sqlContext.load("./Customer_Demographics.parquet", "parquet")
val sample = sqlContext.load("./DM_Sample.parquet", "parquet")
val spendPerCust = txnsRaw.groupBy("cust_id").agg(txnsRaw.col("cust_id"), sum(txnsRaw.col("extended_price").as("spend")))
val txnsPerCust = txnsRaw.groupBy("cust_id", "day_num").agg(
count(txnsRaw.col("sku").as("txns")), txnsRaw.col("cust_id"), txnsRaw.col("day_num"))
val joined = txnsPerCust.join(spendPerCust, txnsPerCust("cust_id") === spendPerCust("cust_id"), "inner")
// This creates 40k tasks with over a gigabyte of data and takes forever
joined.count()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment