https://s3.amazonaws.com/amazon-reviews-pds/readme.html
vmacs:amazon-reviews vs$ find . -type f | cut -d/ -f2 | sort | uniq -c
10 product_category=Apparel
10 product_category=Automotive
10 product_category=Baby
10 product_category=Beauty
10 product_category=Books
10 product_category=Camera
10 product_category=Digital_Ebook_Purchase
10 product_category=Digital_Music_Purchase
10 product_category=Digital_Software
10 product_category=Digital_Video_Download
10 product_category=Digital_Video_Games
10 product_category=Electronics
10 product_category=Furniture
10 product_category=Gift_Card
10 product_category=Grocery
10 product_category=Health_&_Personal_Care
10 product_category=Home
10 product_category=Home_Entertainment
10 product_category=Home_Improvement
10 product_category=Jewelry
10 product_category=Kitchen
10 product_category=Lawn_and_Garden
10 product_category=Luggage
10 product_category=Major_Appliances
10 product_category=Mobile_Apps
10 product_category=Mobile_Electronics
10 product_category=Music
10 product_category=Musical_Instruments
10 product_category=Office_Products
10 product_category=Outdoors
10 product_category=PC
10 product_category=Personal_Care_Appliances
10 product_category=Pet_Products
10 product_category=Shoes
10 product_category=Software
10 product_category=Sports
10 product_category=Tools
10 product_category=Toys
10 product_category=Video
10 product_category=Video_DVD
10 product_category=Video_Games
10 product_category=Watches
10 product_category=Wireless
Sample query
select sum(total_votes), product_category from amazon_reviews where review_date > '2007' and review_date < '2009' group by product_category
Conclusion : Again, we spend a lot of effort in sizing files.. and this is so amazing that it saves tons of compute time. You can see that both stages roughly have similar sizes tasks. but the overhead of small files kills performance. 3-4x, for a file reduction of about 20x