Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
The Instacart Online Grocery Shopping Dataset 2017 Data Descriptions

orders (3.4m rows, 206k users):

  • order_id: order identifier
  • user_id: customer identifier
  • eval_set: which evaluation set this order belongs in (see SET described below)
  • order_number: the order sequence number for this user (1 = first, n = nth)
  • order_dow: the day of the week the order was placed on
  • order_hour_of_day: the hour of the day the order was placed on
  • days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

products (50k rows):

  • product_id: product identifier
  • product_name: name of the product
  • aisle_id: foreign key
  • department_id: foreign key

aisles (134 rows):

  • aisle_id: aisle identifier
  • aisle: the name of the aisle

deptartments (21 rows):

  • department_id: department identifier
  • department: the name of the department

order_products__SET (30m+ rows):

  • order_id: foreign key
  • product_id: foreign key
  • add_to_cart_order: order in which each product was added to cart
  • reordered: 1 if this product has been ordered by this user in the past, 0 otherwise

where SET is one of the four following evaluation sets (eval_set in orders):

  • "prior": orders prior to that users most recent order (~3.2m orders)
  • "train": training data supplied to participants (~131k orders)
  • "test": test data reserved for machine learning competitions (~75k orders)

MichaelChirico commented May 23, 2017

Three questions:

  1. order_dow = 0 corresponds to Sunday?
  2. By days_since_prior is capped at 30, you mean it's censored or truncated? Censored means all values >= 30 are coerced to 30; truncated means all values above 30 were removed. It appears to be the former.
  3. Is there truncation going on with respect to the number of orders included for some users? There's a big mass point of users with exactly 99 orders.

croach commented May 31, 2017

I believe the line describing the types of SET, should be "where SET is one of the three following evaluation sets (eval_set in orders):" instead of "four".

Any updates on the question: order_dow = 0 corresponds to Sunday?
To me, 0 seems to represent Monday.

Can somebody explain what prior, train, and test flags mean exactly? Test means a test data set, but prior and train data sets are kind of confusing.

magaton commented Oct 18, 2017

Hello, there is no quantity field in order_products_SET.
Maybe stupid question, but how then can you ask for product purchase forecast if you don't take quantity in the previous orders into account?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment