Skip to content

Instantly share code, notes, and snippets.

@BIGBALLON
Created May 13, 2018 20:09
Show Gist options
  • Save BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4 to your computer and use it in GitHub Desktop.
Save BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4 to your computer and use it in GitHub Desktop.
script for ImageNet data extract.
#!/bin/bash
#
# script to extract ImageNet dataset
# ILSVRC2012_img_train.tar (about 138 GB)
# ILSVRC2012_img_val.tar (about 6.3 GB)
# make sure ILSVRC2012_img_train.tar & ILSVRC2012_img_val.tar in your current directory
#
# https://github.com/facebook/fb.resnet.torch/blob/master/INSTALL.md
#
# train/
# ├── n01440764
# │ ├── n01440764_10026.JPEG
# │ ├── n01440764_10027.JPEG
# │ ├── ......
# ├── ......
# val/
# ├── n01440764
# │ ├── ILSVRC2012_val_00000293.JPEG
# │ ├── ILSVRC2012_val_00002138.JPEG
# │ ├── ......
# ├── ......
#
#
# Extract the training data:
#
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
cd ..
#
# Extract the validation data and move images to subfolders:
#
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
#
# Check total files after extract
#
# $ find train/ -name "*.JPEG" | wc -l
# 1281167
# $ find val/ -name "*.JPEG" | wc -l
# 50000
#
@LucasZhan
Copy link

thank you!

@suisuilianwyj
Copy link

thank you!

@aielabwangzhen
Copy link

Many thanks

@LeavesLei
Copy link

Many thanks

@Francis-Ferri-personal
Copy link

I found the download links in another repo :

@JoEarl
Copy link

JoEarl commented Jan 30, 2024

Thanks a lot!

@Walnutes
Copy link

Walnutes commented Apr 2, 2024

Many thanks!

@msjun23
Copy link

msjun23 commented Apr 19, 2024

Thanks a lot!

@Hprairie
Copy link

Thanks!

@FalsoMoralista
Copy link

For those downloading from huggingface

Train dataset

Untar the tarballs into the "train" and "val" directories then use the following:

import os
import shutil

train_images = os.listdir('train/')
for image in train_images:
	split = image.split('_')
	cls_name = split[0]

	if not os.path.exists('train/' + cls_name):
		#print('creating dir: ', 'train/' + cls_name)
		os.makedirs('train/' + cls_name, exist_ok=True)

	src = 'train/' + image
	destination = 'train/' + cls_name + '/'
	#print('moving')
	#print(src)
	#print(destination)
	
	shutil.move(src, destination)

Val dataset

A minor adjustment has to be performed here because the val directories are being named as "cls_name.JPG" which might cause issues

i haven't tried but perhaps doing > cls_name = cls_name.replace(".JPG", "") might solve the issue

import os
import shutil

val_images = os.listdir('val/')
for image in val_images:
   split = image.split('_')
   cls_name = split[3]

   if not os.path.exists('val/' + cls_name):
   	#print('creating dir: ', 'val/' + cls_name)
   	os.makedirs('val/' + cls_name, exist_ok=True)

   src = 'val/' + image
   destination = 'val/' + cls_name + '/'
   #print('moving')
   #print(src)
   #print(destination)
   
   shutil.move(src, destination)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment