Skip to content

Instantly share code, notes, and snippets.

@sayakpaul
Created January 31, 2024 10:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sayakpaul/0c4435a1df6eb6193f824f9198cabaa5 to your computer and use it in GitHub Desktop.
Save sayakpaul/0c4435a1df6eb6193f824f9198cabaa5 to your computer and use it in GitHub Desktop.
Samples 30k samples randomly from the COCO 2014 validation set.
from datasets import Dataset, Features
from datasets import Image as ImageFeature
from datasets import Value
import pandas as pd
import os
# CSV comes from the notebook above.
df = pd.read_csv("coco_30k_randomly_sampled_2014_val.csv")
root_path = "val2014"
def gen_fn():
for i, row in df.iterrows():
path = os.path.join(root_path, row["file_name"])
caption = row["caption"]
yield {"image": path, "caption": caption}
if __name__ == "__main__":
ds = Dataset.from_generator(
gen_fn,
features=Features(image=ImageFeature(), caption=Value("string")),
)
ds_id = "sayakpaul/coco-30-val-2014" # Change this.
# To be able to push, you need to run `huggingface-cli login`.
ds.push_to_hub(ds_id)

First download COCO 2014 val split:

wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip val2014.zip
unzip annotations_trainval2014.zip

Then run the notebook to get a CSV file containing randomly selected 30k image filenames and their captions. Then, optionally, run python coco_30k_hf_datasets.py to have the dataset stored on the HF Hub 🤗

Once the dataset is pushed it can loaded with 2 lines of code with the 🤗 Datasets library:

from datasets import load_dataset 

dataset = load_dataset("sayakpaul/coco-30-val-2014", split="train")
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment