Skip to content

Instantly share code, notes, and snippets.

@aresnick
Last active August 29, 2015 14:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aresnick/be85fe79e15ff1aa24c2 to your computer and use it in GitHub Desktop.
Save aresnick/be85fe79e15ff1aa24c2 to your computer and use it in GitHub Desktop.

A basic example of scraping a website

It's very common to find some content online that you'd like to repurpose online, but you need to capture it in some structured way…maybe there's a flickr album of hundreds of pictures you don't want to save manually, maybe there's a series of webpages, each of which have an article and an image you want to grab, etc.

Unfortunately, when those web properties or services don't offer an API, you need to do what's called 'scrape' the page— basically, you download the page yourself and parse it to pull out the information you need.

This gist is a small example of scraping one of my very favorite archives—The Paris Review's series of interviews. We're going to grab all the interviews and the photos of the interviewees from the 1950s.

Each step in this bash file lets us do that, but it's hard to make sense of what's going on without the screencast, which you can find here…notice that this screencast doesn't follow precisely the same process as this script.


Further Doing

  1. Try scraping the 1960s section of The Paris Review but just grabbing all of the image files.
  2. Try using the image files you scrape from (1) and generating a web page where you have each image and <span> caption underneath the image displaying the name of the author interviewed, along with a link to the original Paris Review interview.
  3. Try programmatically scraping all the interviews—not just the 1960s.

Further Resources

# First we want to download the page
curl http://www.theparisreview.org/interviews/1950s#list > 1950s.html
# Then we want to install `pup`
brew install pup
# Then we want to use `pup` to parse out all of the href's for the interview links and save them to a file, links.txt
# pup also lets us grab the href attribute of the a using the attr{href} syntax; you can read more about pup's syntax at
# You can see how we inspected the links to choose the selector in this image: http://cl.ly/image/3S291o3n3p20
# Here we're using what's called a 'pipe' (the | character) to redirect the file into pup. You can learn more about pipes from this mediocre video: https://www.youtube.com/watch?v=jbzrz0aSgEY
cat 1950s.html | pup '.archive-interview h3 a attr{href}' > links.txt
# This gives us a file, links.txt,
cat links.txt # You can see the file in your terminal using the command cat (https://en.wikipedia.org/wiki/Cat_(Unix))
# but they are relative links (you can read more about that at http://www.coffeecup.com/help/articles/absolute-vs-relative-pathslinks/)
# To transform them, we can use Sublime Text's find and replace to put the base URL at the beginning of the line (http://sublime-text-unofficial-documentation.readthedocs.org/en/latest/search_and_replace/search_and_replace.html)
# which gives us `full-links.txt`
cat full-links.txt
# Now, we want to go download all of the URLs listed in `full-links.txt`
# using wget, a utility which we can give a list of files to download
# but we also have to tell it to be smart and add `.html` to the files it downloads
# using the `--adjust-extension` option
wget --adjust-extension -i full-links.txt
# This leaves us with a bunch of downloaded `the-art-of-fiction…` files
# The interview itself is can be found with the `.detail-interviews` selector,
# and the image can be found with `.detail-interviews-description img`
# So now we can grab all those image links and dump them into a file imgs.txt
# We're using * as a wildcard to `cat` all the HTML files into pup
# You can see our inspector to highlight the selector we need in this image: http://cl.ly/image/0f3B1T0A3b3B
cat *.html | pup '.detail-interviews-description img attr{href}' > imgs.txt
# and we can grab all the interviews and dump them into a single file, interviews.txt
cat *.html | pup '.detail-interviews' > interviews.txt
# This leaves us in a position where we might be able to then save the interviews into separate files, download our own images, and otherwise repurpose this content for our own project—
www.google.com
www.yahoo.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment