aresnick/_scrape.md

## _scrape.md

      
    Raw
  

              _scrape.md
            
          
    A basic example of scraping a website

It's very common to find some content online that you'd like to repurpose online, but you need to capture it in some structured way…maybe there's a flickr album of hundreds of pictures you don't want to save manually, maybe there's a series of webpages, each of which have an article and an image you want to grab, etc.
Unfortunately, when those web properties or services don't offer an API, you need to do what's called 'scrape' the page— basically, you download the page yourself and parse it to pull out the information you need.
This gist is a small example of scraping one of my very favorite archives—The Paris Review's series of interviews.  We're going to grab all the interviews and the photos of the interviewees from the 1950s.
Each step in this bash file lets us do that, but it's hard to make sense of what's going on without the screencast, which you can find here…notice that this screencast doesn't follow precisely the same process as this script.

Further Doing


Try scraping the 1960s section of The Paris Review but just grabbing all of the image files.
Try using the image files you scrape from (1) and generating a web page where you have each image and <span> caption underneath the image displaying the name of the author interviewed, along with a link to the original Paris Review interview.
Try programmatically scraping all the interviews—not just the 1960s.

Further Resources


"Using CaspeJS to scrape an infinite scroll page"
"Easy Web Scraping"
"Screen Scraping with Node.js"
"Scraping with Node"
This tutorial on grep, a powerful, command-line search tool.
wget's manual


## full-links.txt
http://www.theparisreview.org/interviews/4987/the-art-of-fiction-no-11-nelson-algren
http://www.theparisreview.org/interviews/4867/the-art-of-fiction-no-17-truman-capote
http://www.theparisreview.org/interviews/5071/the-art-of-fiction-no-7-joyce-cary
http://www.theparisreview.org/interviews/4911/the-art-of-fiction-no-14-isak-dinesen
http://www.theparisreview.org/interviews/4720/the-art-of-fiction-no-23-lawrence-durrell
http://www.theparisreview.org/interviews/4738/the-art-of-poetry-no-1-t-s-eliot
http://www.theparisreview.org/interviews/5053/the-art-of-fiction-no-8-ralph-ellison
http://www.theparisreview.org/interviews/4954/the-art-of-fiction-no-12-william-faulkner
http://www.theparisreview.org/interviews/5219/the-art-of-fiction-no-1-e-m-forster
http://www.theparisreview.org/interviews/4800/the-art-of-fiction-no-22-henry-green
http://www.theparisreview.org/interviews/5180/the-art-of-fiction-no-3-graham-greene
http://www.theparisreview.org/interviews/4825/the-art-of-fiction-no-21-ernest-hemingway
http://www.theparisreview.org/interviews/4779/the-art-of-fiction-no-22-james-jones
http://www.theparisreview.org/interviews/5197/the-art-of-fiction-no-2-francois-mauriac
http://www.theparisreview.org/interviews/5093/the-art-of-fiction-no-6-alberto-moravia
http://www.theparisreview.org/interviews/4847/the-art-of-fiction-no-19-frank-oconnor
http://www.theparisreview.org/interviews/4933/the-art-of-fiction-no-13-dorothy-parker
http://www.theparisreview.org/interviews/4912/the-art-of-fiction-no-15-francoise-sagan
http://www.theparisreview.org/interviews/5157/the-art-of-fiction-no-4-irwin-shaw
http://www.theparisreview.org/interviews/5020/the-art-of-fiction-no-9-georges-simenon
http://www.theparisreview.org/interviews/5114/the-art-of-fiction-no-5-william-styron
http://www.theparisreview.org/interviews/5003/the-art-of-fiction-no-10-james-thurber
http://www.theparisreview.org/interviews/4868/the-art-of-fiction-no-18-robert-penn-warren
http://www.theparisreview.org/interviews/4887/the-art-of-fiction-no-16-thornton-wilder
http://www.theparisreview.org/interviews/4848/the-art-of-fiction-no-20-angus-wilson
http://www.theparisreview.org

## links.txt
/interviews/4987/the-art-of-fiction-no-11-nelson-algren
/interviews/4867/the-art-of-fiction-no-17-truman-capote
/interviews/5071/the-art-of-fiction-no-7-joyce-cary
/interviews/4911/the-art-of-fiction-no-14-isak-dinesen
/interviews/4720/the-art-of-fiction-no-23-lawrence-durrell
/interviews/4738/the-art-of-poetry-no-1-t-s-eliot
/interviews/5053/the-art-of-fiction-no-8-ralph-ellison
/interviews/4954/the-art-of-fiction-no-12-william-faulkner
/interviews/5219/the-art-of-fiction-no-1-e-m-forster
/interviews/4800/the-art-of-fiction-no-22-henry-green
/interviews/5180/the-art-of-fiction-no-3-graham-greene
/interviews/4825/the-art-of-fiction-no-21-ernest-hemingway
/interviews/4779/the-art-of-fiction-no-22-james-jones
/interviews/5197/the-art-of-fiction-no-2-francois-mauriac
/interviews/5093/the-art-of-fiction-no-6-alberto-moravia
/interviews/4847/the-art-of-fiction-no-19-frank-oconnor
/interviews/4933/the-art-of-fiction-no-13-dorothy-parker
/interviews/4912/the-art-of-fiction-no-15-francoise-sagan
/interviews/5157/the-art-of-fiction-no-4-irwin-shaw
/interviews/5020/the-art-of-fiction-no-9-georges-simenon
/interviews/5114/the-art-of-fiction-no-5-william-styron
/interviews/5003/the-art-of-fiction-no-10-james-thurber
/interviews/4868/the-art-of-fiction-no-18-robert-penn-warren
/interviews/4887/the-art-of-fiction-no-16-thornton-wilder
/interviews/4848/the-art-of-fiction-no-20-angus-wilson

## scrape.sh
# First we want to download the page

curl http://www.theparisreview.org/interviews/1950s#list > 1950s.html

# Then we want to install `pup`

brew install pup

# Then we want to use `pup` to parse out all of the href's for the interview links and save them to a file, links.txt

# pup also lets us grab the href attribute of the a using the attr{href} syntax; you can read more about pup's syntax at

# You can see how we inspected the links to choose the selector in this image: http://cl.ly/image/3S291o3n3p20

# Here we're using what's called a 'pipe' (the | character) to redirect the file into pup.  You can learn more about pipes from this mediocre video: https://www.youtube.com/watch?v=jbzrz0aSgEY

cat 1950s.html | pup '.archive-interview h3 a attr{href}' > links.txt

# This gives us a file, links.txt,

cat links.txt # You can see the file in your terminal using the command cat (https://en.wikipedia.org/wiki/Cat_(Unix))

# but they are relative links (you can read more about that at http://www.coffeecup.com/help/articles/absolute-vs-relative-pathslinks/)
# To transform them, we can use Sublime Text's find and replace to put the base URL at the beginning of the line (http://sublime-text-unofficial-documentation.readthedocs.org/en/latest/search_and_replace/search_and_replace.html)
# which gives us `full-links.txt`

cat full-links.txt

# Now, we want to go download all of the URLs listed in `full-links.txt`
# using wget, a utility which we can give a list of files to download
# but we also have to tell it to be smart and add `.html` to the files it downloads
# using the `--adjust-extension` option

wget --adjust-extension -i full-links.txt

# This leaves us with a bunch of downloaded `the-art-of-fiction…` files
# The interview itself is can be found with the `.detail-interviews` selector,
# and the image can be found with `.detail-interviews-description img`

# So now we can grab all those image links and dump them into a file imgs.txt
# We're using * as a wildcard to `cat` all the HTML files into pup
# You can see our inspector to highlight the selector we need in this image: http://cl.ly/image/0f3B1T0A3b3B

cat *.html | pup '.detail-interviews-description img attr{href}' > imgs.txt

# and we can grab all the interviews and dump them into a single file, interviews.txt

cat *.html | pup '.detail-interviews' > interviews.txt

# This leaves us in a position where we might be able to then save the interviews into separate files, download our own images, and otherwise repurpose this content for our own project—

## test.txt
www.google.com
www.yahoo.com
	http://www.theparisreview.org/interviews/4987/the-art-of-fiction-no-11-nelson-algren
	http://www.theparisreview.org/interviews/4867/the-art-of-fiction-no-17-truman-capote
	http://www.theparisreview.org/interviews/5071/the-art-of-fiction-no-7-joyce-cary
	http://www.theparisreview.org/interviews/4911/the-art-of-fiction-no-14-isak-dinesen
	http://www.theparisreview.org/interviews/4720/the-art-of-fiction-no-23-lawrence-durrell
	http://www.theparisreview.org/interviews/4738/the-art-of-poetry-no-1-t-s-eliot
	http://www.theparisreview.org/interviews/5053/the-art-of-fiction-no-8-ralph-ellison
	http://www.theparisreview.org/interviews/4954/the-art-of-fiction-no-12-william-faulkner
	http://www.theparisreview.org/interviews/5219/the-art-of-fiction-no-1-e-m-forster
	http://www.theparisreview.org/interviews/4800/the-art-of-fiction-no-22-henry-green
	http://www.theparisreview.org/interviews/5180/the-art-of-fiction-no-3-graham-greene
	http://www.theparisreview.org/interviews/4825/the-art-of-fiction-no-21-ernest-hemingway
	http://www.theparisreview.org/interviews/4779/the-art-of-fiction-no-22-james-jones
	http://www.theparisreview.org/interviews/5197/the-art-of-fiction-no-2-francois-mauriac
	http://www.theparisreview.org/interviews/5093/the-art-of-fiction-no-6-alberto-moravia
	http://www.theparisreview.org/interviews/4847/the-art-of-fiction-no-19-frank-oconnor
	http://www.theparisreview.org/interviews/4933/the-art-of-fiction-no-13-dorothy-parker
	http://www.theparisreview.org/interviews/4912/the-art-of-fiction-no-15-francoise-sagan
	http://www.theparisreview.org/interviews/5157/the-art-of-fiction-no-4-irwin-shaw
	http://www.theparisreview.org/interviews/5020/the-art-of-fiction-no-9-georges-simenon
	http://www.theparisreview.org/interviews/5114/the-art-of-fiction-no-5-william-styron
	http://www.theparisreview.org/interviews/5003/the-art-of-fiction-no-10-james-thurber
	http://www.theparisreview.org/interviews/4868/the-art-of-fiction-no-18-robert-penn-warren
	http://www.theparisreview.org/interviews/4887/the-art-of-fiction-no-16-thornton-wilder
	http://www.theparisreview.org/interviews/4848/the-art-of-fiction-no-20-angus-wilson
	http://www.theparisreview.org
	/interviews/4987/the-art-of-fiction-no-11-nelson-algren
	/interviews/4867/the-art-of-fiction-no-17-truman-capote
	/interviews/5071/the-art-of-fiction-no-7-joyce-cary
	/interviews/4911/the-art-of-fiction-no-14-isak-dinesen
	/interviews/4720/the-art-of-fiction-no-23-lawrence-durrell
	/interviews/4738/the-art-of-poetry-no-1-t-s-eliot
	/interviews/5053/the-art-of-fiction-no-8-ralph-ellison
	/interviews/4954/the-art-of-fiction-no-12-william-faulkner
	/interviews/5219/the-art-of-fiction-no-1-e-m-forster
	/interviews/4800/the-art-of-fiction-no-22-henry-green
	/interviews/5180/the-art-of-fiction-no-3-graham-greene
	/interviews/4825/the-art-of-fiction-no-21-ernest-hemingway
	/interviews/4779/the-art-of-fiction-no-22-james-jones
	/interviews/5197/the-art-of-fiction-no-2-francois-mauriac
	/interviews/5093/the-art-of-fiction-no-6-alberto-moravia
	/interviews/4847/the-art-of-fiction-no-19-frank-oconnor
	/interviews/4933/the-art-of-fiction-no-13-dorothy-parker
	/interviews/4912/the-art-of-fiction-no-15-francoise-sagan
	/interviews/5157/the-art-of-fiction-no-4-irwin-shaw
	/interviews/5020/the-art-of-fiction-no-9-georges-simenon
	/interviews/5114/the-art-of-fiction-no-5-william-styron
	/interviews/5003/the-art-of-fiction-no-10-james-thurber
	/interviews/4868/the-art-of-fiction-no-18-robert-penn-warren
	/interviews/4887/the-art-of-fiction-no-16-thornton-wilder
	/interviews/4848/the-art-of-fiction-no-20-angus-wilson
	# First we want to download the page

	curl http://www.theparisreview.org/interviews/1950s#list > 1950s.html

	# Then we want to install `pup`

	brew install pup

	# Then we want to use `pup` to parse out all of the href's for the interview links and save them to a file, links.txt

	# pup also lets us grab the href attribute of the a using the attr{href} syntax; you can read more about pup's syntax at

	# You can see how we inspected the links to choose the selector in this image: http://cl.ly/image/3S291o3n3p20

	# Here we're using what's called a 'pipe' (the \| character) to redirect the file into pup. You can learn more about pipes from this mediocre video: https://www.youtube.com/watch?v=jbzrz0aSgEY

	cat 1950s.html \| pup '.archive-interview h3 a attr{href}' > links.txt

	# This gives us a file, links.txt,

	cat links.txt # You can see the file in your terminal using the command cat (https://en.wikipedia.org/wiki/Cat_(Unix))

	# but they are relative links (you can read more about that at http://www.coffeecup.com/help/articles/absolute-vs-relative-pathslinks/)
	# To transform them, we can use Sublime Text's find and replace to put the base URL at the beginning of the line (http://sublime-text-unofficial-documentation.readthedocs.org/en/latest/search_and_replace/search_and_replace.html)
	# which gives us `full-links.txt`

	cat full-links.txt

	# Now, we want to go download all of the URLs listed in `full-links.txt`
	# using wget, a utility which we can give a list of files to download
	# but we also have to tell it to be smart and add `.html` to the files it downloads
	# using the `--adjust-extension` option

	wget --adjust-extension -i full-links.txt

	# This leaves us with a bunch of downloaded `the-art-of-fiction…` files
	# The interview itself is can be found with the `.detail-interviews` selector,
	# and the image can be found with `.detail-interviews-description img`

	# So now we can grab all those image links and dump them into a file imgs.txt
	# We're using * as a wildcard to `cat` all the HTML files into pup
	# You can see our inspector to highlight the selector we need in this image: http://cl.ly/image/0f3B1T0A3b3B

	cat *.html \| pup '.detail-interviews-description img attr{href}' > imgs.txt

	# and we can grab all the interviews and dump them into a single file, interviews.txt

	cat *.html \| pup '.detail-interviews' > interviews.txt

	# This leaves us in a position where we might be able to then save the interviews into separate files, download our own images, and otherwise repurpose this content for our own project—