Skip to content

Instantly share code, notes, and snippets.

@pniedzielski
Created November 1, 2020 17:45
Show Gist options
  • Save pniedzielski/a76f48c5521bb6af3261536e328021c8 to your computer and use it in GitHub Desktop.
Save pniedzielski/a76f48c5521bb6af3261536e328021c8 to your computer and use it in GitHub Desktop.
How to download the ETCSL Corpus

ETCSL Corpus

I can never remember how to download the ETCSL corpus, because it’s hosted on the the Oxford Text Archive (DOI: 20.500.12024/2518). Here are the instructions to download a local copy to the current directory.

curl -o etcsl-download.zip https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2518/allzip
unzip etcsl-download.zip
rm etcsl-download.zip

The texts themselves are contained within a zip archive, which we also want to unzip.

unzip etcsl.zip
rm etcsl.zip

The current directory should now have a copy of the ETCSL corpus.

tree

You might also want a .gitignore file containing the files we unzipped:

/etcsl/
/contents.txt
/corphdr.xml
/etcsl-extensions.dtd
/etcsl-extensions.ent
/etcslfullcat.html
/etcslmanual.html
/etcsl-sux.ent
/etcsl.xml
/header2518.xml
/readme.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment