Skip to content

Instantly share code, notes, and snippets.

@atiaxi
Created January 15, 2015 19:33
Show Gist options
  • Save atiaxi/78ca2f3287bf4ea36a25 to your computer and use it in GitHub Desktop.
Save atiaxi/78ca2f3287bf4ea36a25 to your computer and use it in GitHub Desktop.

importer.py finefoods.txt

Parses the specific format of finefoods.txt and populates the 'reviews' table.

products_scaper.py finefoods.txt

The finefoods.txt file doesn't actually include product information, just ASINs. This script spawns 10 threads to scrape data from a service that takes ASINs and returns product information (name, description, and images). This creates a .csv file specifically formatted for the CQL "COPY" command, in order to populate the 'products' table

product_backfill.py

The 'reviews' table has static fields such as "img_url" that did not exist in the original finefoods.txt file that populated the table. This script goes through the 'products' field and uses the information there to update the 'reviews' table.

calculate_reviews.py

This script creates the 'products_by_score' table. It goes through the 'reviews' table and averages the reviews for each product.

populate_score_images.py

The 'products_by_score' table has an img_url field, but the images in the 'reviews' table are way too big for these purposes; this goes through the 'products' table and uses the small_img_url stored there to populate the img_url field of the 'products_by_score' table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment