Skip to content

Instantly share code, notes, and snippets.

@wragge
Last active April 9, 2018 12:14
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wragge/14cd53b242ff283e86dada3a20187d86 to your computer and use it in GitHub Desktop.
Save wragge/14cd53b242ff283e86dada3a20187d86 to your computer and use it in GitHub Desktop.

Getting the text content of articles from the Australian Womens Weekly

The TroveHarvester makes it easy to download articles in bulk from Trove's digitised newspapers. Using the --text option you can also save the fulltext content of every article.

However, this doesn't work for the Australian Womens' Weekly as the full text is not available through the Trove API. Fortunately, the article text can be downloaded from the web interface.

The one-line script below uses wget, so make sure you have it installed before you go any further. (You can install it with Homebrew if you're using a Mac.)

Instructions

  • Run the Troveharvester as normal to harvest the article metadata as a CSV file (don't use the --text option)
  • From the command line cd into the directory that contains the results.csv file created by your harvest

Now you have a choice. Although the full text downloaded from the web interface says it's a text file, it's not -- it's a HTML file. If you don't mind having <p>s and <div>s messing up your text you can just copy and paste this into the command line and hit enter:

for id in $(cut -d , -f 1 results.csv | sed "1 d"); do wget "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt"; done

If you want to strip the HTML tags use:

for id in $(cut -d , -f 1 results.csv | sed "1 d"); do wget -qO- "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt" | sed -e 's/<[^>]*>/ /g' > $id".txt"; done

This might result in extra spaces, but I'm assuming that won't matter too much.

Either way you'll end up with lost of little text files -- one per article.

Explanation

This is what happens:

  • cut -d , -f 1 gets the first column of the results.csv file which contains the article ids
  • sed "1 d" removes the header row
  • for id in... feeds the list of article ids into a loop
  • wget -qO- "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt" retrieves the article text
  • sed -e 's/<[^>]*>/ /g' gets rid of the HTML tags
  • > $id".txt" writes the text to a file named with the article id

Pretty cool huh? I'm sure there are neater ways of doing this, but I was pleased to find a one line solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment