Sign up (for example with your github account) for https://gl.kwarc.info/SIGMathLing/dataset-arxmliv-2020
Install git and git-lsf
sudo apt-get install git git-lfs
Since the process might take a while start a screen
session and follow the instructions in https://gl.kwarc.info/SIGMathLing/dataset-arxmliv-2020/-/blob/master/README.md
PS: I was using git clone -n git@gl.kwarc.info:SIGMathLing/dataset-arxmliv-2020.git
and not the http link. But it depends on your preferences.
PPS: I was also curious how long it would take to download the dataset time git lsf fetch
. -> 39minutes
With docker image Image sha256:12502cdfcb4581110f0438d7661c3e4c40707a2db2bf79ec2559dd6b557b43da
took about 5h. Volume /unzip mounted to location of the download.
unzip:
image: dockerqa/unzip
volumes:
- /hdd/datasets/arxiv/dataset-arxmliv-2020/data:/unzip:ro
- arxiv-data:/data
command: ['-d', '/data', '*.zip']
one can now search in the docker log for the filename