Skip to content

Instantly share code, notes, and snippets.

@cosme12
Forked from bradmontgomery/ShortIntroToScraping.rst
Last active January 5, 2018 03:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cosme12/83d23a90da09f04dbd4fca99d4ef284d to your computer and use it in GitHub Desktop.
Save cosme12/83d23a90da09f04dbd4fca99d4ef284d to your computer and use it in GitHub Desktop.
Really short intro to scraping with Beautiful Soup and Requests

Web Scraping Workshop

Using Requests and Beautiful Soup, with the most recent Beautiful Soup 4 docs.

Getting Started

Install our tools (preferably in a new virtualenv):

pip install beautifulsoup4
pip install requests

Start Scraping!

Lets grab the Free Book Samplers from O'Reilly: http://oreilly.com/store/samplers.html.

>>> import requests
#ADDED
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
>>> 
>>> result = requests.get("http://oreilly.com/store/samplers.html", headers=headers)

Make sure we got a result.

>>> result.status_code
200
>>> result.headers
...

Store your content in an easy-to-type variable!

>>> c = result.content

Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from bs4. If you download the source, you'll need to import from BeautifulSoup (which is what they do in the online docs).

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(c)
>>> samples = soup.find_all("a", "item-title")
>>> samples[0]
<a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf">
Programming Perl
</a>

Now, pick apart individual links.

>>> data = {}
>>> for a in samples:
...     title = a.string.strip()
...     data[title] = a.attrs['href']

Check out the keys/values in the data dict. Rejoice!

Now go scrape some stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment