Skip to content

Instantly share code, notes, and snippets.

@rom1504
Last active August 1, 2021 20:57
Show Gist options
  • Save rom1504/f427b1c82df26c9993daa36fca7f9881 to your computer and use it in GitHub Desktop.
Save rom1504/f427b1c82df26c9993daa36fca7f9881 to your computer and use it in GitHub Desktop.
cah_download_from_theeye.py

This is about downloading http://the-eye.eu/eleuther_staging/cah/ which is a big dataset of image/text pairs filtered from common crawl

  1. run get_links.sh ; this will produce a to_aria.txt file which contains all the urls to download and where to put them
  2. run download.sh ; it will use aria2c to download files fast (takes about 1h)

Note if you only want one type of file, you may change this part grep 'csv\|txt\|pkl\|tfrecord'

aria2c --dir=output --auto-file-renaming=false --continue=true -i for_aria -x 16 -s 16 -j 16
import os
if not os.path.exists("output"):
os.mkdir("output")
filelist = open("download_urls.txt", "r").read().split("\n")
with open("for_aria.txt", "w") as f:
for fil in filelist:
if fil == "":
continue
output_dir = "output" + "/" + "/".join(fil.split("/")[4:][:-1])
f.write(fil+"\n")
f.write(" dir="+output_dir+"\n")
f.write(" continue=true\n")
f.write(" max-connection-per-server=16\n")
f.write(" split=16\n")
f.write(" min-split-size=20M\n\n")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment