Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Spark SQL Amazon Reviews Dataset - Small file size impact

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

vmacs:amazon-reviews vs$ find . -type f | cut -d/ -f2 | sort | uniq -c
  10 product_category=Apparel
  10 product_category=Automotive
  10 product_category=Baby
  10 product_category=Beauty
  10 product_category=Books
  10 product_category=Camera
  10 product_category=Digital_Ebook_Purchase
  10 product_category=Digital_Music_Purchase
  10 product_category=Digital_Software
  10 product_category=Digital_Video_Download
  10 product_category=Digital_Video_Games
  10 product_category=Electronics
  10 product_category=Furniture
  10 product_category=Gift_Card
  10 product_category=Grocery
  10 product_category=Health_&_Personal_Care
  10 product_category=Home
  10 product_category=Home_Entertainment
  10 product_category=Home_Improvement
  10 product_category=Jewelry
  10 product_category=Kitchen
  10 product_category=Lawn_and_Garden
  10 product_category=Luggage
  10 product_category=Major_Appliances
  10 product_category=Mobile_Apps
  10 product_category=Mobile_Electronics
  10 product_category=Music
  10 product_category=Musical_Instruments
  10 product_category=Office_Products
  10 product_category=Outdoors
  10 product_category=PC
  10 product_category=Personal_Care_Appliances
  10 product_category=Pet_Products
  10 product_category=Shoes
  10 product_category=Software
  10 product_category=Sports
  10 product_category=Tools
  10 product_category=Toys
  10 product_category=Video
  10 product_category=Video_DVD
  10 product_category=Video_Games
  10 product_category=Watches
  10 product_category=Wireless

Sample query

select sum(total_votes), product_category from amazon_reviews where review_date > '2007' and review_date < '2009' group by product_category 

@vinothchandar

This comment has been minimized.

Copy link
Owner Author

@vinothchandar vinothchandar commented Jul 18, 2020

As-is (10 partitions)

val df = spark.read.parquet("file:///Volumes/HUDIDATA/input-data/amazon-reviews")
df.registerTempTable("amazon_reviews")

Takes like 17 seconds, but the data has some natural date based proximity in files.

image

image

@vinothchandar

This comment has been minimized.

Copy link
Owner Author

@vinothchandar vinothchandar commented Jul 18, 2020

100 Partitions

val part100Path = "file:///Volumes/HUDIDATA/input-data/amazon-reviews-100-parts"
df.repartition(100).write.mode("overwrite").partitionBy("product_category").parquet(part100Path)
val df100 = spark.read.parquet(part100Path)
df100.registerTempTable("amazon_reviews_100_parts")

image

image

image

@vinothchandar

This comment has been minimized.

Copy link
Owner Author

@vinothchandar vinothchandar commented Jul 18, 2020

5 partitions

val part5Path = "file:///Volumes/HUDIDATA/input-data/amazon-reviews-5-parts"
df.repartition(5).write.mode("overwrite").partitionBy("product_category").parquet(part5Path)
val df5 = spark.read.parquet(part5Path)
df5.registerTempTable("amazon_reviews_5_parts")

image
image
image

@vinothchandar

This comment has been minimized.

Copy link
Owner Author

@vinothchandar vinothchandar commented Jul 19, 2020

Conclusion : Again, we spend a lot of effort in sizing files.. and this is so amazing that it saves tons of compute time. You can see that both stages roughly have similar sizes tasks. but the overhead of small files kills performance. 3-4x, for a file reduction of about 20x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment