Skip to content

Instantly share code, notes, and snippets.

@volkancirik
Last active April 21, 2019 19:46
Show Gist options
  • Save volkancirik/510a207136d9a333190e93abf21c05f4 to your computer and use it in GitHub Desktop.
Save volkancirik/510a207136d9a333190e93abf21c05f4 to your computer and use it in GitHub Desktop.
Download Google Conceptual Captions Data
#!/usr/bin/bash
# Download split TSV files here https://ai.google.com/research/ConceptualCaptions/download
# create split folodrs val/ and trn/
# run as follows
# cd val/; bash ../download_gcc.sh ../val.tsv
# cd trn/; bash ../download_gcc.sh ../trn.tsv
rm -f .img_file
split_file=$1
idx=0
cut -f2 $split_file > .img_file
while read datum
do
idx=`expr $idx + 1`
echo "wget $datum -O ${idx}.gcc --tries=2" # chose gcc file extension randomly
# There are tons of different file extensions see:
# cut -f2 ../Train_GCC-training.tsv | grep -o '....$' | sort | uniq -c | sort -nk1
wget $datum -O ${idx}.gcc --tries=2
done < .img_file
rm -f .img_file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment