Skip to content

Instantly share code, notes, and snippets.

@carlaiau
Created February 9, 2017 23:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save carlaiau/314a34ce138f9f4ee352ddf4b181a4cd to your computer and use it in GitHub Desktop.
Save carlaiau/314a34ce138f9f4ee352ddf4b181a4cd to your computer and use it in GitHub Desktop.
MyWebGrocer Readme

Readme

All scrapers within this folder are for scraping MyWebGrocer powered platforms.

mwg sites reside on the mywebgrocer domain, where as Curbside Express, Harris Teeter and Shoprite, have their own domains. The three external sites have a JSON API layer, so are much quicker to scrape than the Xpath sraping of the mywebgrocer sites. There are similarities in the structure of the sites, but not enough to allow one spider built for multi domain.

The complete scraping, "middleware" and "post-processing" is done via calling the python files in the root proceeded with "run_" Examining each of these python scripts will give you an understanding of how each spider functions.

I have chosen to create seperate external files for stores, catalogs, rather than keeping the data in memory. But we could amend the scrapers to utilize an item system.

Curbside Express

has a store scraper, that obtains the store.json file, this json file is then used as input variables for the actual product scraper. We are using proxy middleware, but routing all traffic through one proxy for each catalogs entire scrape. After all catalogs are scraped, we recursively dedupe and merge categories for products available in multiple categories, tar and send the catted products.json and stores.json to AWS

Harris Teeter

has a store scraper, that obtains the store.json file, this json file is then used as input variables for the actual product scraper. We are using proxy middleware, but routing all traffic through one proxy for each catalogs entire scrape. After all catalogs are scraped, we recursively dedupe and merge categories for products available in multiple categories, tar and send the catted products.json and stores.json to AWS

MyWebGrocer

does not have a store scraper, we manually create the store.json file. We are using proxy middleware, but routing all traffic through one proxy for each catalogs entire scrape. After all catalogs are scraped, we recursively dedupe and merge categories for products available in multiple categories, tar and send the catted products.json and stores.json to AWS

Shoprite

Has a store scraper, but multiple stores can reference the same catalog, so we must obtain the unique list of catalogs. This is why there is a stores.json and a catalogs.json output. Shoprite uses proxy middleware, with all traffic routed random proxies

JSON Structures

Product.json (JSON Lines)

"name"
"description"
"timeStamp"
"source"
"itemIdInSource"
"upc"
"brandName"
"categories": [{
    "rootCategoryId"
    "rootCategoryName"
    "parentCategoryName"
    "subCategoryName"
    }]
"location": {
  "storeId"
  "chain"
},
"price"
"size": {
    "sizeRaw",
    "sizeField",
    "sizeValue"
}
"mainImage"

There are prettified samples of the below within the samples folder

Curbside Express stores.json

"name"
"chain"
"url"
"zipcode"
"source"
"state"
"address_2"
"address"
"lat"
"lng"
"id"

Harris Teeter stores.json

"store_id"
"chain"
"address1"
"online_Id"
"zipcode"
"long"
"lat"
"address2"
"name"

My Web Grocer stores.json

"store_id"
"chain"
"address"
"address_2"
"state"
"zipcode"

Shoprite stores.json

"store_id"
"chain"
"zipcode"
"state"
"address_2"
"address"

Shoprite catalogs.json (essentially a uniqued list of stores)

"state"
"store_id"
"chain"
@personalt2
Copy link

This looks really interesting.. did you ever post the python files that go with this?

@quinnpertuit
Copy link

Hi - same question; did you ever post the files to go with this?

@carlaiau
Copy link
Author

Sorry I only just seen these. I have them in a private repo somewhere, mainly built on scrapy a long time ago, I am unsure if they'll still work

@wexTeam
Copy link

wexTeam commented Jul 29, 2022

Hello carlaiau ,
I want to get the instacart scraper that you created . can you email me. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment