It's very common to find some content online that you'd like to repurpose online, but you need to capture it in some structured way…maybe there's a flickr album of hundreds of pictures you don't want to save manually, maybe there's a series of webpages, each of which have an article and an image you want to grab, etc.
Unfortunately, when those web properties or services don't offer an API, you need to do what's called 'scrape' the page— basically, you download the page yourself and parse it to pull out the information you need.
This gist is a small example of scraping one of my very favorite archives—The Paris Review's series of interviews. We're going to grab all the interviews and the photos of the interviewees from the 1950s.
Each step in this bash file lets us do that, but it's hard to make sense of what's going on without the screencast, which you can find here…notice that this screencast doesn't follow precisely the same process as this script.
- Try scraping the 1960s section of The Paris Review but just grabbing all of the image files.
- Try using the image files you scrape from (1) and generating a web page where you have each image and
<span>
caption underneath the image displaying the name of the author interviewed, along with a link to the original Paris Review interview. - Try programmatically scraping all the interviews—not just the 1960s.
- "Using CaspeJS to scrape an infinite scroll page"
- "Easy Web Scraping"
- "Screen Scraping with Node.js"
- "Scraping with Node"
- This tutorial on
grep
, a powerful, command-line search tool. wget
's manual