Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wilcorrea/999dd135014127d9eb481a7c320438d0 to your computer and use it in GitHub Desktop.
Save wilcorrea/999dd135014127d9eb481a7c320438d0 to your computer and use it in GitHub Desktop.
Really short intro to scraping with Beautiful Soup and Requests

Web Scraping Workshop

Using Requests and Beautiful Soup, with the most recent Beautiful Soup 4 docs.

Getting Started

Install our tools (preferably in a new virtualenv):

pip install beautifulsoup4
pip install requests

Start Scraping!

Lets grab the Free Book Samplers from O'Reilly: http://oreilly.com/store/samplers.html.

>>> import requests
>>> 
>>> result = requests.get("http://oreilly.com/store/samplers.html")

Make sure we got a result.

>>> result.status_code
200
>>> result.headers
...

Store your content in an easy-to-type variable!

>>> c = result.content

Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from bs4. If you download the source, you'll need to import from BeautifulSoup (which is what they do in the online docs).

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(c)
>>> samples = soup.find_all("a", "item-title")
>>> samples[0]
<a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf">
Programming Perl
</a>

Now, pick apart individual links.

>>> data = {}
>>> for a in samples:
...     title = a.string.strip()
...     data[title] = a.attrs['href']

Check out the keys/values in the data dict. Rejoice!

Now go scrape some stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment