Quilt is being branded as "Docker for data".
Follow the instructions here or simply:
pip install quilt
Once you create an account, you'll have a page like this:
It's intended to resemble your DockerHub profile page.
Let's begin with an empty directory under ~
:
cd; mkdir quilt-host; cd quilt-host
Let's add some data files:
echo 'Hello, World' > hello.txt
echo '{"text": "Sentence 1"}' >> hello.jsonl
echo '{"text": "Sentence 2"}' >> hello.jsonl
Let's make the directory structure more interesting:
mkdir tables
curl https://bit.ly/2KVKEfk -Lo tables/esoph.csv
Your directory tree should look like this:
.
├── hello.jsonl
├── hello.txt
└── tables
└── table.csv
Let's generate the build file (similar to Dockerfile) to describe our data contents:
quilt generate .
This will create a build.yml file with the following contents:
contents:
hello_jsonl:
file: hello.jsonl
hello_txt:
file: hello.txt
tables:
esoph:
file: tables/esoph.csv
It's just a mapping from meaningful (inferred) names to actual data on disk. Quilt will package this mapping along with the actual files and deploy on its "hub".
To actually build the package, we issue:
quilt build hunan131/demo build.yml
Feel free to name your package whatever you want. Just make sure you change "hunan131" to your username and "demo" to your package name.
This is the analogue of docker build
.
To make sure the package is built, get the list of all packages available with:
quilt ls
hunan131/demo latest c33f6610444ea1e925159d271d00099c29bad45418e8c9758459f71fdc50cf90
The second column is the tag the third is the hash. The latest build is tagged with "latest" (duh!), but you can create an explicit tag after pushing to the hub.
We push the built with:
quilt push hunan131/demo --public
If it succeeds, you'll see a link to the package page at the bottom line of the stdout. Visiting that page will take you to a place like this:
Again, intended to look like Dockerhub.
It can be a bit of a pain to keep copying and pasting hash codes around, so at critical points, you might want to give your "commits" meaningful tags. This this is our first stable data package, we'll give it the tag 0.0.1:
Since we haven't tagged anything yet, we won't see much when we type:
quilt tag list hunan131/demo
But we'll get the hash of the "latest" dataset version:
latest: c33f6610444ea1e925159d271d00099c29bad45418e8c9758459f71fdc50cf90
We can tag it like this:
quilt tag add hunan131/demo 0.0.1 c33f66
Btw, this worked because only one hash has the prefix c33f66
.
To create some variety, let's modify our data a bit:
echo '{"text": "Not available in version 0.0.1"}' >> hello.jsonl
Let's now re-build, push and bump the version:
quilt build hunan131/demo build.yml
quilt push hunan131/demo
quilt tag list hunan131/demo
Copy some of the hash of "latest", in my case it's e3eace
and paste as the hash in this command:
quilt tag add hunan131/demo 0.0.1b e3eace
Now there should be three tags for the package:
$ quilt tag list hunan131/demo
0.0.1: c33f6610444ea1e925159d271d00099c29bad45418e8c9758459f71fdc50cf90
0.0.1b: e3eace0228863fd5da607065426f0862a0d5cd59d5f6e03df90a446ea37b992b
latest: e3eace0228863fd5da607065426f0862a0d5cd59d5f6e03df90a446ea37b992b
Here's what a user of the package does. Let's start from a fresh directory:
cd; mkdir quilt-client; cd quilt-client
Prepare the environment:
python3.6 -m venv venv
. venv/bin/activate
pip install quilt
pip freeze | grep quilt >> requirements.txt
The directory should resemble this:
.
├── requirements.txt
└── venv
We create a quilt.yml file where we declare which data packages we need. It should look like this:
packages:
- hunan131/demo
We've already declared the dependencies in the yaml file, so we just install it with:
quilt install -f
Enter the python shell:
python
Paste this helper for reading jsonl files:
import json
def read_jsonl(path):
objects = []
with open(path) as fp:
for line in fp:
objects.append(json.loads(line))
return objects
Now let's use the dataset:
from quilt.data.hunan131 import demo
read_jsonl(demo.hello_jsonl())
The output should be:
[
{'text': 'Hello'},
{'text': 'Sentence 1'},
{'text': 'Sentence 2'},
{'text': 'Not available in version 0.0.1'}
]
Change quilt.yml to specify a particular tag, say 0.0.1:
packages:
- hunan131/demo:t:0.0.1
Then reinstall with:
quilt install -f
What Quilt does with csv files is interesting:
from quilt.data.hunan131 import demo
demo.tables.esoph().head()
It returns a Pandas DataFrame, which we can have fun with!
Okay, for now we just inspect:
demo.tables.esoph().head()
which should look something like this:
Unnamed: 0 agegp alcgp tobgp ncases ncontrols
0 1 25-34 0-39g/day 0-9g/day 0 40
1 2 25-34 0-39g/day 10-19 0 10
2 3 25-34 0-39g/day 20-29 0 6
3 4 25-34 0-39g/day 30+ 0 5
4 5 25-34 40-79 0-9g/day 0 27
Feel free to spin up a Jupyter Notebook session and use the dataframe to build some models to predict cases of esophageal carcinoma.
First we import Quilt:
import quilt
Next we list the available versions:
quilt.log('hunan131/demo')
which shows:
27d6dc... 2018-07-06 14:59:13 hunan131 ['0.0.1b', 'latest'] None
e3eace... 2018-07-06 14:44:07 hunan131 None None
c33f66... 2018-07-06 14:34:23 hunan131 ['0.0.1'] None
Now we can grab both the new and the old versions:
OLD_HASH = 'c33f6610444ea1e925159d271d00099c29bad45418e8c9758459f71fdc50cf90'
NEW_HASH = '27d6dcd8a2baca2584eb59dbba90f6af5d06adebb00af88cd51264ea342c77b2'
new_data = quilt.load('hunan131/demo', hash=NEW_HASH)
old_data = quilt.load('hunan131/demo', hash=OLD_HASH)
We read the old contents with:
read_jsonl(old_data.hello_jsonl())
which should look like this:
[
{'text': 'Hello'},
{'text': 'Sentence 1'},
{'text': 'Sentence 2'}
]
and the new contents with:
read_jsonl(new_data.hello_jsonl())
which should look like this:
[
{'text': 'Hello'},
{'text': 'Sentence 1'},
{'text': 'Sentence 2'},
{'text': 'Not available in version 0.0.1'}
]