bradmontgomery/ShortIntroToScraping.rst

## ShortIntroToScraping.rst

      
    Raw
  

              ShortIntroToScraping.rst
            
          
    Web Scraping Workshop

Using Requests and Beautiful Soup, with the most recent Beautiful Soup 4 docs.

Getting Started

Install our tools (preferably in a new virtualenv):
pip install beautifulsoup4
pip install requests


Start Scraping!

Lets grab the Free Book Samplers from O'Reilly: http://oreilly.com/store/samplers.html.
>>> import requests
>>>
>>> result = requests.get("http://oreilly.com/store/samplers.html")

Make sure we got a result.
>>> result.status_code
200
>>> result.headers
...

Store your content in an easy-to-type variable!
>>> c = result.content

Start parsing with Beautiful Soup.  NOTE: If you installed with pip, you'll need to import from bs4. If you download the source, you'll need to import from BeautifulSoup (which is what they do in the online docs).
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(c)
>>> samples = soup.find_all("a", "item-title")
>>> samples[0]
<a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf">
Programming Perl
</a>

Now, pick apart individual links.
>>> data = {}
>>> for a in samples:
...     title = a.string.strip()
...     data[title] = a.attrs['href']

Check out the keys/values in the data dict. Rejoice!
Now go scrape some stuff!