wragge/download-aww-text.md

## download-aww-text.md

      
    Raw
  

              download-aww-text.md
            
          
    Getting the text content of articles from the Australian Womens Weekly

The TroveHarvester makes it easy to download articles in bulk from Trove's digitised newspapers. Using the --text option you can also save the fulltext content of every article.
However, this doesn't work for the Australian Womens' Weekly as the full text is not available through the Trove API. Fortunately, the article text can be downloaded from the web interface.
The one-line script below uses wget, so make sure you have it installed before you go any further. (You can install it with Homebrew if you're using a Mac.)
Instructions


Run the Troveharvester as normal to harvest the article metadata as a CSV file (don't use the --text option)
From the command line cd into the directory that contains the results.csv file created by your harvest

Now you have a choice. Although the full text downloaded from the web interface says it's a text file, it's not -- it's a HTML file. If you don't mind having <p>s and <div>s messing up your text you can just copy and paste this into the command line and hit enter:
for id in $(cut -d , -f 1 results.csv | sed "1 d"); do wget "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt"; done
If you want to strip the HTML tags use:
for id in $(cut -d , -f 1 results.csv | sed "1 d"); do wget -qO- "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt" | sed -e 's/<[^>]*>/ /g' > $id".txt"; done
This might result in extra spaces, but I'm assuming that won't matter too much.
Either way you'll end up with lost of little text files -- one per article.
Explanation

This is what happens:

cut -d , -f 1 gets the first column of the results.csv file which contains the article ids
sed "1 d" removes the header row
for id in...  feeds the list of article ids into a loop
wget -qO- "http://trove.nla.gov.au/newspaper/rendition/nla.news-article"$id".txt" retrieves the article text
sed -e 's/<[^>]*>/ /g' gets rid of the HTML tags
> $id".txt" writes the text to a file named with the article id

Pretty cool huh? I'm sure there are neater ways of doing this, but I was pleased to find a one line solution.