Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Download Google Conceptual Captions Data
# Download split TSV files here
# create split folodrs val/ and trn/
# run as follows
# cd val/; bash ../ ../val.tsv
# cd trn/; bash ../ ../trn.tsv
rm -f .img_file
cut -f2 $split_file > .img_file
while read datum
idx=`expr $idx + 1`
echo "wget $datum -O ${idx}.gcc --tries=2" # chose gcc file extension randomly
# There are tons of different file extensions see:
# cut -f2 ../Train_GCC-training.tsv | grep -o '....$' | sort | uniq -c | sort -nk1
wget $datum -O ${idx}.gcc --tries=2
done < .img_file
rm -f .img_file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment