armandmcqueen/Code samples for pytorch blog.md

## Code samples for pytorch blog.md

      
    Raw
  

              Code samples for pytorch blog.md
            
          
    Code Samples

Acquire a benchmarking dataset

COCO 2017 is a image benchmarking dataset and GLUE is a collection of natural language processing datasets.
$ quilt install quilt-ml-data/glue --to /datasets/glue
Downloading......
The package "quilt-ml-data/glue" was successfully downloaded to /datasets/glue
$ quilt install quilt-ml-data/coco2017 --to ~/code/detectron2/datasets/coco
Downloading......
The package "quilt-ml-data/coco2017" was successfully downloaded to ~/code/detectron2/datasets/coco
NOTE: --to is a working name for that argument, but I'm pretty sure we need that an argument that does that. There is an alternative world where the user is not able to specify where the data is downloaded to. However in that world, there must be an API that allows the user to ask quilt where the data is located. I'm not a big fan of this, but if there is a technical reason that would be much easier, I could be convinced.
Training using benchmark dataset

Two examples of how training would be run using that benchmark data
Huggingface (NLP)

https://github.com/huggingface/transformers#run_gluepy-fine-tuning-on-glue-tasks-for-sequence-classification
$ export GLUE_DIR=/datasets/glue/
$ export TASK_NAME=MRPC

$ python ./examples/run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/
Detectron2 (image)

Then this will be used in training like this:
https://github.com/facebookresearch/detectron2/blob/master/GETTING_STARTED.md#train-a-standard-model
In detectron2, the code makes the assumption that the data will live in a folder called datasets that lives in the directory you launch the training from.
$ cd ~/code/detectron2

$ python tools/train_net.py \
  --num-gpus 8 \
	--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml
Acquire a specific hash of a benchmarking dataset

Undecided between the two below options
$ quilt install quilt-ml-data/coco2017 --to /datasets/coco --hash=ae35249fd
$ quilt install quilt-ml-data/coco2017@ae35249fd --to /datasets/coco
Create a dataset

Here we are going to create a labelled image dataset. This is similar to the COCO dataset. It is currently common to have a small number of files that hold the annotations for a dataset or dataset split. This example shows how that can be done with quilt:
pkg = Package()
pkg.set("images/", "s3://quilt-ml-data/example-dataset/images/")
pkg.set("annotations.json", "s3://quilt-ml-data/example-dataset/annotations.json")
However, in a Quilt Package each object in a dataset has associated metadata. You can use this to better organize your labels and avoid relying on one very large and sometimes unwieldy annotations file.
pkg = Package()
annotations = get_annotations("s3://quilt-ml-data/example-dataset/annotations.json")
for image_id, image_annotations in annotations:
    pkg.set(f"images/{image_id}", f"s3://quilt-ml-data/example-dataset/images/{image_id}", metadata={"annotations": annotations})
This has some benefits such as being able to trivially update annotations for a single datapoint. It also makes it much easier to slice up the dataset - for example experimenting with new train/val/test splits or creating a minidataset for quicker experimentation.