Skip to content

Instantly share code, notes, and snippets.

@hunan-rostomyan
Last active April 9, 2019 21:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hunan-rostomyan/9710ddd20b8d608ea4d96c2f0ee95067 to your computer and use it in GitHub Desktop.
Save hunan-rostomyan/9710ddd20b8d608ea4d96c2f0ee95067 to your computer and use it in GitHub Desktop.

Introduction to Quilt

Quilt is being branded as "Docker for data".

Install

Follow the instructions here or simply:

pip install quilt

Login

Once you create an account, you'll have a page like this:

It's intended to resemble your DockerHub profile page.

Part 1 --- Create a package

Prepare data

Let's begin with an empty directory under ~:

  • cd; mkdir quilt-host; cd quilt-host

Let's add some data files:

  • echo 'Hello, World' > hello.txt
  • echo '{"text": "Sentence 1"}' >> hello.jsonl
  • echo '{"text": "Sentence 2"}' >> hello.jsonl

Let's make the directory structure more interesting:

  • mkdir tables
  • curl https://bit.ly/2KVKEfk -Lo tables/esoph.csv

Your directory tree should look like this:

.
├── hello.jsonl
├── hello.txt
└── tables
    └── table.csv

Create the build file

Let's generate the build file (similar to Dockerfile) to describe our data contents:

  • quilt generate .

This will create a build.yml file with the following contents:

contents:
  hello_jsonl:
    file: hello.jsonl
  hello_txt:
    file: hello.txt
  tables:
    esoph:
      file: tables/esoph.csv

It's just a mapping from meaningful (inferred) names to actual data on disk. Quilt will package this mapping along with the actual files and deploy on its "hub".

Build

To actually build the package, we issue:

  • quilt build hunan131/demo build.yml

Feel free to name your package whatever you want. Just make sure you change "hunan131" to your username and "demo" to your package name.

This is the analogue of docker build.

List

To make sure the package is built, get the list of all packages available with:

  • quilt ls
hunan131/demo                  latest               c33f6610444ea1e925159d271d00099c29bad45418e8c9758459f71fdc50cf90

The second column is the tag the third is the hash. The latest build is tagged with "latest" (duh!), but you can create an explicit tag after pushing to the hub.

Push

We push the built with:

  • quilt push hunan131/demo --public

If it succeeds, you'll see a link to the package page at the bottom line of the stdout. Visiting that page will take you to a place like this:

Again, intended to look like Dockerhub.

Tag

It can be a bit of a pain to keep copying and pasting hash codes around, so at critical points, you might want to give your "commits" meaningful tags. This this is our first stable data package, we'll give it the tag 0.0.1:

Since we haven't tagged anything yet, we won't see much when we type:

  • quilt tag list hunan131/demo

But we'll get the hash of the "latest" dataset version:

latest: c33f6610444ea1e925159d271d00099c29bad45418e8c9758459f71fdc50cf90

We can tag it like this:

  • quilt tag add hunan131/demo 0.0.1 c33f66

Btw, this worked because only one hash has the prefix c33f66.

To create some variety, let's modify our data a bit:

  • echo '{"text": "Not available in version 0.0.1"}' >> hello.jsonl

Let's now re-build, push and bump the version:

  • quilt build hunan131/demo build.yml
  • quilt push hunan131/demo
  • quilt tag list hunan131/demo

Copy some of the hash of "latest", in my case it's e3eace and paste as the hash in this command:

  • quilt tag add hunan131/demo 0.0.1b e3eace

Now there should be three tags for the package:

$ quilt tag list hunan131/demo

0.0.1: c33f6610444ea1e925159d271d00099c29bad45418e8c9758459f71fdc50cf90
0.0.1b: e3eace0228863fd5da607065426f0862a0d5cd59d5f6e03df90a446ea37b992b
latest: e3eace0228863fd5da607065426f0862a0d5cd59d5f6e03df90a446ea37b992b

Part 2 --- Use the package

Here's what a user of the package does. Let's start from a fresh directory:

  • cd; mkdir quilt-client; cd quilt-client

Prepare the environment:

  • python3.6 -m venv venv
  • . venv/bin/activate
  • pip install quilt
  • pip freeze | grep quilt >> requirements.txt

The directory should resemble this:

.
├── requirements.txt
└── venv

Declare data dependencies

We create a quilt.yml file where we declare which data packages we need. It should look like this:

packages:
  - hunan131/demo

Install dependencies

We've already declared the dependencies in the yaml file, so we just install it with:

  • quilt install -f

Use the latest contents

Enter the python shell:

  • python

Paste this helper for reading jsonl files:

import json

def read_jsonl(path):
    objects = []
    with open(path) as fp:
        for line in fp:
            objects.append(json.loads(line))
    return objects

Now let's use the dataset:

from quilt.data.hunan131 import demo

read_jsonl(demo.hello_jsonl())

The output should be:

[
    {'text': 'Hello'},
    {'text': 'Sentence 1'},
    {'text': 'Sentence 2'},
    {'text': 'Not available in version 0.0.1'}
]

Use the contents of another version

Change quilt.yml to specify a particular tag, say 0.0.1:

packages:
  - hunan131/demo:t:0.0.1

Then reinstall with:

  • quilt install -f

Diversion: Pandas!

What Quilt does with csv files is interesting:

from quilt.data.hunan131 import demo

demo.tables.esoph().head()

It returns a Pandas DataFrame, which we can have fun with!

Okay, for now we just inspect:

demo.tables.esoph().head()

which should look something like this:

   Unnamed: 0  agegp      alcgp     tobgp  ncases  ncontrols
0           1  25-34  0-39g/day  0-9g/day       0         40
1           2  25-34  0-39g/day     10-19       0         10
2           3  25-34  0-39g/day     20-29       0          6
3           4  25-34  0-39g/day       30+       0          5
4           5  25-34      40-79  0-9g/day       0         27

Feel free to spin up a Jupyter Notebook session and use the dataframe to build some models to predict cases of esophageal carcinoma.

Use the contents of different versions simultaneously

First we import Quilt:

import quilt

Next we list the available versions:

quilt.log('hunan131/demo')

which shows:

27d6dc...  2018-07-06 14:59:13  hunan131  ['0.0.1b', 'latest']  None
e3eace...  2018-07-06 14:44:07  hunan131  None                  None
c33f66...  2018-07-06 14:34:23  hunan131  ['0.0.1']             None

Now we can grab both the new and the old versions:

OLD_HASH = 'c33f6610444ea1e925159d271d00099c29bad45418e8c9758459f71fdc50cf90'
NEW_HASH = '27d6dcd8a2baca2584eb59dbba90f6af5d06adebb00af88cd51264ea342c77b2'

new_data = quilt.load('hunan131/demo', hash=NEW_HASH)
old_data = quilt.load('hunan131/demo', hash=OLD_HASH)

We read the old contents with:

read_jsonl(old_data.hello_jsonl())

which should look like this:

[
    {'text': 'Hello'},
    {'text': 'Sentence 1'},
    {'text': 'Sentence 2'}
]

and the new contents with:

read_jsonl(new_data.hello_jsonl())

which should look like this:

[
    {'text': 'Hello'},
    {'text': 'Sentence 1'},
    {'text': 'Sentence 2'},
    {'text': 'Not available in version 0.0.1'}
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment