Skip to content

Instantly share code, notes, and snippets.

@koushikmln
Created July 7, 2018 19:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save koushikmln/5480bc1e1e88c5447695a6e826b56f6c to your computer and use it in GitHub Desktop.
Save koushikmln/5480bc1e1e88c5447695a6e826b56f6c to your computer and use it in GitHub Desktop.
Process Order Items Using Spark to get Order Id, Sub-Total Tuples, Total Amount by Order Id and Revenue Per Order Collection
# Use map to create an rdd of (order_id, sub_total) tuple.
rdd = sc.textFile("/public/retail_db/order_items/part-00000")
orderItemTuple = rdd.map(lambda x: (int(x.split(",")[1]), float(x.split(",")[4])))
orderItemTuple.take(10)
# Get total for particular order_id
orderItemTuple.filter(lambda x: x[0] == 2).reduce(lambda x, y: (x[0], x[1] + y[1]))
# Get order_id,total tuple
orderItemTuple.reduceByKey(lambda x, y: x + y).take(10)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment