In 2017, Instacart open sourced a dataset comprised of over 3 million orders from more than 200,000 users on about 50,000 products. It can be downloaded from https://www.instacart.com/datasets/grocery-shopping-2017.
After uncompressing it, one can see it contains 6 CSV files
orders.csv
: description of the 3.4 million ordersproducts.csv
: product description of around 50,000 productsorder_products__prior.csv
andorder_products__train.csv
: respectively products placed in previous orders and most recent orders of customersdepartments.csv
: department the products belong toaisles.csv
: aisles where the product can be found
We are only interested in this example by the 4 files orders.csv
, orders_products__prior.csv
, order_products__train.csv
and products.csv
.
Let’s create a user orders and a database orders: createuser orders -s createdb orders -O orders
Let's create the table orders
and order_products
in Postgres:
$ psql -U orders orders
drop table if exists orders;
create table orders
(
order_id integer not null primary key,
user_id integer,
eval_set text,
order_number integer,
order_dow integer,
order_hour_of_day integer,
days_since_prior_order float
);
drop table if exists order_products;
create table order_products
(
order_id integer not null,
product_id integer not null,
add_to_cart_order integer,
reordered boolean,
constraint order_products_pkey
primary key (order_id, product_id)
);
create table products
(
product_id integer not null primary key,
product_name text,
aisle_id integer,
department_id integer
);
Then let's import the data into Postgres:
\copy orders FROM 'orders.csv' WITH CSV HEADER;
\copy order_products FROM 'order_products__prior.csv' WITH CSV HEADER;
\copy order_products FROM 'order_products__train.csv' WITH CSV HEADER;
\copy products FROM 'products.csv' WITH CSV HEADER;