Skip to content

Instantly share code, notes, and snippets.

@antoinebrl
Last active November 8, 2024 03:09
Show Gist options
  • Save antoinebrl/7d00d5cb6c95ef194c737392ef7e476a to your computer and use it in GitHub Desktop.
Save antoinebrl/7d00d5cb6c95ef194c737392ef7e476a to your computer and use it in GitHub Desktop.
Prepare ImageNet

Preparation of ImageNet (ILSVRC2012)

The dataset can be found on the official website if you are affiliated with a research organization. It is also available on Academic torrents.

This script extracts all the images and group them so that folders contain images that belong to the same class.

  1. Download the ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar
  2. Download the script wget https://gist.githubusercontent.com/antoinebrl/7d00d5cb6c95ef194c737392ef7e476a/raw/dc53ad5fcb69dcde2b3e0b9d6f8f99d000ead696/prepare.sh
  3. Run it ./prepare.sh
  4. If the files are not in the same folder you can specify their paths ./prepare.sh ~/Dataset/imagenet/ILSVRC2012_img_train.tar ~/Dataset/imagenet/ILSVRC2012_img_val.tar

The folder should have the following content:

train/
├── n01440764
│   ├── n01440764_10026.JPEG
│   ├── n01440764_10027.JPEG
│   ├── n01440764_10029.JPEG
│   └── ...
├── n01443537
│   ├── n01443537_10007.JPEG
│   ├── n01443537_10014.JPEG
│   ├── n01443537_10025.JPEG
│   └── ...
├── ...
└── ...

val/
├── n01440764
│   ├── ILSVRC2012_val_00000946.JPEG
│   ├── ILSVRC2012_val_00001684.JPEG
│   └── ...
├── n01443537
│   ├── ILSVRC2012_val_00001269.JPEG
│   ├── ILSVRC2012_val_00002327.JPEG
│   ├── ILSVRC2012_val_00003510.JPEG
│   └── ...
├── ...
└── ...
#!/usr/bin/env bash
train_tar="${1:-ILSVRC2012_img_train.tar}"
val_tar="${2:-ILSVRC2012_img_val.tar}"
mkdir -p train
mkdir -p val
echo "Extracting training set ... (might take a while)"
tar -xf "${train_tar}" -C train
echo "Extracting training categories ..."
cd train
find . -name "*.tar" | xargs -n1 -P8 -I {} bash -c 'mkdir -p "${1%.tar}"; tar -xf "${1}" -C "${1%.tar}"; rm -f "${1}"' -- {}
cd ..
echo "Extracting validation set ..."
tar -xf "${val_tar}" -C val
echo "Restructuring validation ..."
cd val
# Python like zip from two streams
function zip34() { while read word3 <&3; do read word4 <&4 ; echo $word3 $word4 ; done }
wget https://raw.githubusercontent.com/tensorflow/models/master/research/slim/datasets/imagenet_2012_validation_synset_labels.txt
find . -name "*.JPEG" | sort > images.txt
zip34 3<images.txt 4<imagenet_2012_validation_synset_labels.txt | xargs -n2 -P8 bash -c 'mkdir -p $2; mv $1 $2' argv0
rm *.txt
cd ..
echo "train:" $(find train -name "*.JPEG" | wc -l) "images"
echo "val:" $(find val -name "*.JPEG" | wc -l) "images"
@jiahaolu97
Copy link

Thanks for sharing this, really helps me a lot. Just for supplementation, one can use
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar --no-check-certificate and wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar --no-check-certificate to download the dataset files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment