Skip to content

Instantly share code, notes, and snippets.

@physikerwelt
Last active February 14, 2021 08:29
Show Gist options
  • Save physikerwelt/74ec9c223e37f55e25dfcb8d26b700ab to your computer and use it in GitHub Desktop.
Save physikerwelt/74ec9c223e37f55e25dfcb8d26b700ab to your computer and use it in GitHub Desktop.
Exploring the arxiv dump of @dginev

Get access to the data

Sign up (for example with your github account) for https://gl.kwarc.info/SIGMathLing/dataset-arxmliv-2020

Install software to download data

Install git and git-lsf

sudo apt-get install git git-lfs

Download the data

Since the process might take a while start a screen session and follow the instructions in https://gl.kwarc.info/SIGMathLing/dataset-arxmliv-2020/-/blob/master/README.md

PS: I was using git clone -n git@gl.kwarc.info:SIGMathLing/dataset-arxmliv-2020.gitand not the http link. But it depends on your preferences. PPS: I was also curious how long it would take to download the dataset time git lsf fetch. -> 39minutes

Extract the data

With docker image Image sha256:12502cdfcb4581110f0438d7661c3e4c40707a2db2bf79ec2559dd6b557b43da took about 5h. Volume /unzip mounted to location of the download.

  unzip:
    image: dockerqa/unzip
    volumes:
      - /hdd/datasets/arxiv/dataset-arxmliv-2020/data:/unzip:ro
      - arxiv-data:/data
    command: ['-d', '/data', '*.zip']

one can now search in the docker log for the filename

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment