Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Install corpora from Amazon S3

This is a brief explanation how to use credentials for downloading data fromm Amazon S3 that are issued by the PolMine Project on demand.

For R users, we recommend to use packages developed in the cloudyr project (aws.signature and aws.s3) for managing credentials and downloading data from S3.

Please note that putting hard-coded credentials into an R or Rmd file with your analysis is bad practice. Credentials needed for data access should be put into the default file applicable for your system:

  • ~/.aws/credentials (Linux, macOS)
  • C:\Users\USERNAME\.aws\credentials (Windows)

The content of the file should look as follows (for further explanations, see this AWS Security Blog):

[default]
aws_access_key_id = ABCDEFGHIJKLMNOPQRSTUVZ
aws_secret_access_key = 12345667890

The directory for the credentials file does not exist by default. Use the following code to create the .aws directory and the credentials file if necessary.

library(aws.signature)
credentials_file <- aws.signature::default_credentials_file()
if (!dir.exists(dirname(credentials_file))){
  dir.create(dirname(credentials_file))
}
if (!file.exists("~/.aws/credentials")){
  writeLines("[default]", con = "~/.aws/credentials")
}

Note that directories starting with a dot (".") are not visible by default. So you do not necessarily see the .aws directory in the file browser of your default text editor. We recommend to open the credentials file in R Studio by calling rstudioapi::navigateToFile().

rstudioapi::navigateToFile(credentials_file)

Edit the file, save the result and close the file when you are finished. To check that credentials are present and can be processed, run the following line of code:

aws.signature::read_credentials()

Given that credentials are available, the following code will download and install a corpus on your system using functionality of the cwbtools package. Note that it may involve creating the necessary directory structure for CWB corpora. A user dialogue will assist you to do this. Make sure you insert the correct corpus ID and version number in the code. Since all corpora are shipped as compressed tarballs, the file extension "tar.gz" has to be used as the corpus name.

library(aws.s3)
library(cwbtools)

corpus_uri <- "s3://PATH/YOU/HAVE/BEEN/PROVIDED/WITH.tar.gz"
cwbtools::corpus_install(tarball = corpus_uri)

To check that the newly installed corpus is present, check whether it shows up in the list of corpora that you will see when running the following two commands.

library(polmineR)
corpus()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment