COCO 2017 is a image benchmarking dataset and GLUE is a collection of natural language processing datasets.
$ quilt install quilt-ml-data/glue --to /datasets/glue
Downloading......
The package "quilt-ml-data/glue" was successfully downloaded to /datasets/glue
$ quilt install quilt-ml-data/coco2017 --to ~/code/detectron2/datasets/coco
Downloading......
The package "quilt-ml-data/coco2017" was successfully downloaded to ~/code/detectron2/datasets/coco
NOTE: --to
is a working name for that argument, but I'm pretty sure we need that an argument that does that. There is an alternative world where the user is not able to specify where the data is downloaded to. However in that world, there must be an API that allows the user to ask quilt where the data is located. I'm not a big fan of this, but if there is a technical reason that would be much easier, I could be convinced.
Two examples of how training would be run using that benchmark data
$ export GLUE_DIR=/datasets/glue/
$ export TASK_NAME=MRPC
$ python ./examples/run_glue.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
Then this will be used in training like this:
https://github.com/facebookresearch/detectron2/blob/master/GETTING_STARTED.md#train-a-standard-model
In detectron2
, the code makes the assumption that the data will live in a folder called datasets that lives in the directory you launch the training from.
$ cd ~/code/detectron2
$ python tools/train_net.py \
--num-gpus 8 \
--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml
Undecided between the two below options
$ quilt install quilt-ml-data/coco2017 --to /datasets/coco --hash=ae35249fd
$ quilt install quilt-ml-data/coco2017@ae35249fd --to /datasets/coco
Here we are going to create a labelled image dataset. This is similar to the COCO dataset. It is currently common to have a small number of files that hold the annotations for a dataset or dataset split. This example shows how that can be done with quilt:
pkg = Package()
pkg.set("images/", "s3://quilt-ml-data/example-dataset/images/")
pkg.set("annotations.json", "s3://quilt-ml-data/example-dataset/annotations.json")
However, in a Quilt Package each object in a dataset has associated metadata. You can use this to better organize your labels and avoid relying on one very large and sometimes unwieldy annotations file.
pkg = Package()
annotations = get_annotations("s3://quilt-ml-data/example-dataset/annotations.json")
for image_id, image_annotations in annotations:
pkg.set(f"images/{image_id}", f"s3://quilt-ml-data/example-dataset/images/{image_id}", metadata={"annotations": annotations})
This has some benefits such as being able to trivially update annotations for a single datapoint. It also makes it much easier to slice up the dataset - for example experimenting with new train/val/test splits or creating a minidataset for quicker experimentation.