Skip to content

Instantly share code, notes, and snippets.

@armandmcqueen
Created November 15, 2019 20:36
Show Gist options
  • Save armandmcqueen/9b443a3b185757649019df8479a4004c to your computer and use it in GitHub Desktop.
Save armandmcqueen/9b443a3b185757649019df8479a4004c to your computer and use it in GitHub Desktop.

Code Samples

Acquire a benchmarking dataset

COCO 2017 is a image benchmarking dataset and GLUE is a collection of natural language processing datasets.

$ quilt install quilt-ml-data/glue --to /datasets/glue
Downloading......
The package "quilt-ml-data/glue" was successfully downloaded to /datasets/glue
$ quilt install quilt-ml-data/coco2017 --to ~/code/detectron2/datasets/coco
Downloading......
The package "quilt-ml-data/coco2017" was successfully downloaded to ~/code/detectron2/datasets/coco

NOTE: --to is a working name for that argument, but I'm pretty sure we need that an argument that does that. There is an alternative world where the user is not able to specify where the data is downloaded to. However in that world, there must be an API that allows the user to ask quilt where the data is located. I'm not a big fan of this, but if there is a technical reason that would be much easier, I could be convinced.

Training using benchmark dataset

Two examples of how training would be run using that benchmark data

Huggingface (NLP)

https://github.com/huggingface/transformers#run_gluepy-fine-tuning-on-glue-tasks-for-sequence-classification

$ export GLUE_DIR=/datasets/glue/
$ export TASK_NAME=MRPC

$ python ./examples/run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/

Detectron2 (image)

Then this will be used in training like this:

https://github.com/facebookresearch/detectron2/blob/master/GETTING_STARTED.md#train-a-standard-model In detectron2, the code makes the assumption that the data will live in a folder called datasets that lives in the directory you launch the training from.

$ cd ~/code/detectron2

$ python tools/train_net.py \
  --num-gpus 8 \
	--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml

Acquire a specific hash of a benchmarking dataset

Undecided between the two below options

$ quilt install quilt-ml-data/coco2017 --to /datasets/coco --hash=ae35249fd
$ quilt install quilt-ml-data/coco2017@ae35249fd --to /datasets/coco

Create a dataset

Here we are going to create a labelled image dataset. This is similar to the COCO dataset. It is currently common to have a small number of files that hold the annotations for a dataset or dataset split. This example shows how that can be done with quilt:

pkg = Package()
pkg.set("images/", "s3://quilt-ml-data/example-dataset/images/")
pkg.set("annotations.json", "s3://quilt-ml-data/example-dataset/annotations.json")

However, in a Quilt Package each object in a dataset has associated metadata. You can use this to better organize your labels and avoid relying on one very large and sometimes unwieldy annotations file.

pkg = Package()
annotations = get_annotations("s3://quilt-ml-data/example-dataset/annotations.json")
for image_id, image_annotations in annotations:
    pkg.set(f"images/{image_id}", f"s3://quilt-ml-data/example-dataset/images/{image_id}", metadata={"annotations": annotations})

This has some benefits such as being able to trivially update annotations for a single datapoint. It also makes it much easier to slice up the dataset - for example experimenting with new train/val/test splits or creating a minidataset for quicker experimentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment