Skip to content

Instantly share code, notes, and snippets.

@darthbear
Last active June 1, 2020 13:18
Show Gist options
  • Save darthbear/332318706ba7ac56e2a19bdf5f4daf1e to your computer and use it in GitHub Desktop.
Save darthbear/332318706ba7ac56e2a19bdf5f4daf1e to your computer and use it in GitHub Desktop.
Data Preparation for the simulmedia post on roaring bitmap

Data preparation

In 2017, Instacart open sourced a dataset comprised of over 3 million orders from more than 200,000 users on about 50,000 products. It can be downloaded from https://www.instacart.com/datasets/grocery-shopping-2017.

After uncompressing it, one can see it contains 6 CSV files

  • orders.csv: description of the 3.4 million orders
  • products.csv: product description of around 50,000 products
  • order_products__prior.csv and order_products__train.csv: respectively products placed in previous orders and most recent orders of customers
  • departments.csv: department the products belong to
  • aisles.csv: aisles where the product can be found

Instacart schema

We are only interested in this example by the 4 files orders.csv, orders_products__prior.csv, order_products__train.csv and products.csv.

Let’s create a user orders and a database orders: createuser orders -s createdb orders -O orders

Let's create the table orders and order_products in Postgres:

$ psql -U orders orders

drop table if exists orders;
create table orders
(
    order_id integer not null primary key,
    user_id integer,
    eval_set text,
    order_number integer,
    order_dow integer,
    order_hour_of_day integer,
    days_since_prior_order float
);

drop table if exists order_products;
create table order_products
(
    order_id integer not null,
    product_id integer not null,
    add_to_cart_order integer,
    reordered boolean,
    constraint order_products_pkey
        primary key (order_id, product_id)
);

create table products
(
    product_id integer not null primary key,
    product_name text,
    aisle_id integer,
    department_id integer
);

Then let's import the data into Postgres:

\copy orders FROM 'orders.csv' WITH CSV HEADER;
\copy order_products FROM 'order_products__prior.csv' WITH CSV HEADER;
\copy order_products FROM 'order_products__train.csv' WITH CSV HEADER;
\copy products FROM 'products.csv' WITH CSV HEADER;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment