First download COCO 2014 val
split:
wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip val2014.zip
unzip annotations_trainval2014.zip
Then run the notebook to get a CSV file containing randomly selected 30k image filenames and their captions.
Then, optionally, run python coco_30k_hf_datasets.py
to have the dataset stored on the HF Hub 🤗
Once the dataset is pushed it can loaded with 2 lines of code with the 🤗 Datasets library:
from datasets import load_dataset
dataset = load_dataset("sayakpaul/coco-30-val-2014", split="train")