Skip to content

Instantly share code, notes, and snippets.

@carlaiau
Created February 9, 2017 23:05
Show Gist options
  • Save carlaiau/9bf2d7d6e79ee83194f04e1062fcbc73 to your computer and use it in GitHub Desktop.
Save carlaiau/9bf2d7d6e79ee83194f04e1062fcbc73 to your computer and use it in GitHub Desktop.
Instacart Readme

Instacart

The complete scraping, "middleware" and "post-processing" is done via calling the master.sh bash script examining this file, along with the three python files that are called from within it, will give you an underdtanding of how this spider works.

I have chosen to create seperate external files for zones, stores, warehouses, categories, and products, rather than keeping the data in memory. But we could amend the scrapers to utilize an item system.

Instacart is tricky, it does not have a direct relationship between store and products. Products are scraped from an API endpoint that is similar to: https://www.instacart.com/api/v2/items?source=web&warehouse_id=%s&zone_id=%s Catalogs exist for specific warehouses (chains), within specific zones. The stores.json file is scraped via Xpath directly from the instacart.com/locations url. The other files are produced by various API endpoints within the application.

A proposed solution to identifying which stores are "linked" to which catalogs, would be designating based on the lat/lng coordinates. I.E for every store within the stores.json file, designate the zone_id based on the closest zone based on lat/lng. The warehouse_id will be based on the chain name, but we may need to create a nickname relationship if there are discrepenices between backend and frontend of Instacart

The above paragraph is something I'm currently working on, as I attempt to get all data into flat tables.

JSON Structures

Product.json (JSON Lines)

      "name",
      "timeStamp",
      "source",
      "itemIdInSource",
      "description",
      "brandName",
      "categories": [{
        "parentCategoryId",
        "parentCategoryName",
        "subCategoryId",
        "subCategoryName"
      }],
      "location": {
        "warehouseId",
        "instacartWarehouse",
        "zoneId",
        "zoneState",
        "zoneName"
      },
      "price",
      "size": {
          "sizeRaw",
          "sizeField",
          "sizeValue",
          "servingSize",
          "servingsPerContainer",
          "quantity"
          }
      "imageUrls": [
        "image_url"
        ]
      "mainImage",
      "relationships": [
        {"parent","child","relationshipType"},
      ]
      "nutritional": {
        "fat"
        "saturatedFat"
        "transFat"
        "polyunsaturatedFat"
        "monounsaturatedFat"
        "calories"
        "fatCalories"
        "cholesterol"
        "sodium"
        "potassium"
        "carbohydrate"
        "fiber"
        "sugars"
        "protein"
        "organic"
        "kosher"
        "glutenFree"
        "lowFat"
        "fatFree"
        "sugarFree"
        "vegan"
        "vegetarian"
      }

There are prettified samples of the below within the samples folder

stores.json

"city"
"chain_name"
"zipcode"
"state"
"lat"
"lng"
"street_address"

warehouses.json

"zone_lat",
"zone_id",
"warehouses": [
  {
    "warehouse_slug",
    "warehouse_transparency",
    "warehouse_name",
    "warehouse_id"
  }
]
"zone_state",
"zone_name",
"zone_lng",
"zone_slug"

zones.json

"zone_name",
"zone_slug",
"zone_id",
"zone_state",
"zone_center": {
  "zone_lat",
  "zone_lng",
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment