carlaiau/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
    Readme

All scrapers within this folder are for scraping MyWebGrocer powered platforms.
mwg sites reside on the mywebgrocer domain, where as Curbside Express, Harris Teeter and Shoprite, have their own domains.
The three external sites have a JSON API layer, so are much quicker to scrape than the Xpath sraping of the mywebgrocer sites. There are similarities in the structure of the sites, but not enough to allow one spider built for multi domain.
The complete scraping, "middleware" and "post-processing" is done via calling the python files in the root proceeded with "run_"
Examining each of these python scripts will give you an understanding of how each spider functions.
I have chosen to create seperate external files for stores, catalogs, rather than keeping the data in memory.
But we could amend the scrapers to utilize an item system.
Curbside Express

has a store scraper, that obtains the store.json file, this json file is then used as input variables for the actual product scraper.
We are using proxy middleware, but routing all traffic through one proxy for each catalogs entire scrape.
After all catalogs are scraped, we recursively dedupe and merge categories for products available in multiple categories, tar and send the catted products.json and stores.json to AWS
Harris Teeter

has a store scraper, that obtains the store.json file, this json file is then used as input variables for the actual product scraper.
We are using proxy middleware, but routing all traffic through one proxy for each catalogs entire scrape.
After all catalogs are scraped, we recursively dedupe and merge categories for products available in multiple categories, tar and send the catted products.json and stores.json to AWS
MyWebGrocer

does not have a store scraper, we manually create the store.json file.
We are using proxy middleware, but routing all traffic through one proxy for each catalogs entire scrape.
After all catalogs are scraped, we recursively dedupe and merge categories for products available in multiple categories, tar and send the catted products.json and stores.json to AWS
Shoprite

Has a store scraper, but multiple stores can reference the same catalog, so we must obtain the unique list of catalogs. This is why there is a stores.json and a catalogs.json output.
Shoprite uses proxy middleware, with all traffic routed random proxies
JSON Structures

Product.json (JSON Lines)

"name"
"description"
"timeStamp"
"source"
"itemIdInSource"
"upc"
"brandName"
"categories": [{
    "rootCategoryId"
    "rootCategoryName"
    "parentCategoryName"
    "subCategoryName"
    }]
"location": {
  "storeId"
  "chain"
},
"price"
"size": {
    "sizeRaw",
    "sizeField",
    "sizeValue"
}
"mainImage"

There are prettified samples of the below within the samples folder
Curbside Express stores.json

"name"
"chain"
"url"
"zipcode"
"source"
"state"
"address_2"
"address"
"lat"
"lng"
"id"

Harris Teeter stores.json

"store_id"
"chain"
"address1"
"online_Id"
"zipcode"
"long"
"lat"
"address2"
"name"

My Web Grocer stores.json

"store_id"
"chain"
"address"
"address_2"
"state"
"zipcode"

Shoprite stores.json

"store_id"
"chain"
"zipcode"
"state"
"address_2"
"address"

Shoprite catalogs.json (essentially a uniqued list of stores)

"state"
"store_id"
"chain"