Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Really short intro to scraping with Beautiful Soup and Requests

Web Scraping Workshop

Using Requests and Beautiful Soup, with the most recent Beautiful Soup 4 docs.

Getting Started

Install our tools (preferably in a new virtualenv):

pip install beautifulsoup4
pip install requests

Start Scraping!

Lets grab the Free Book Samplers from O'Reilly: http://oreilly.com/store/samplers.html.

>>> import requests
>>>
>>> result = requests.get("http://oreilly.com/store/samplers.html")

Make sure we got a result.

>>> result.status_code
200
>>> result.headers
...

Store your content in an easy-to-type variable!

>>> c = result.content

Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from bs4. If you download the source, you'll need to import from BeautifulSoup (which is what they do in the online docs).

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(c)
>>> samples = soup.find_all("a", "item-title")
>>> samples[0]
<a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf">
Programming Perl
</a>

Now, pick apart individual links.

>>> data = {}
>>> for a in samples:
...     title = a.string.strip()
...     data[title] = a.attrs['href']

Check out the keys/values in the data dict. Rejoice!

Now go scrape some stuff!

@danilodimoia

This comment has been minimized.

Copy link

commented Apr 9, 2013

Very nice introduction, thanks!

@qhuang872

This comment has been minimized.

Copy link

commented Dec 14, 2016

nice short intro

@whitecat

This comment has been minimized.

Copy link

commented Dec 14, 2016

One question. What if the request is from a website that is loading something in the request? How can I get the request to get the loaded content?

For example request "https://github.com/aptana/Pydev"
and use lines = soup.findAll("span", { "class" : "num text-emphasized" })
The problem is contributor shows: "fetching contributors"

@jasminecjc

This comment has been minimized.

Copy link

commented Apr 6, 2017

Good job! It helps me

@redfast00

This comment has been minimized.

Copy link

commented May 14, 2017

For some reason is using result.content way slower when parsing in BeautifulSoup than using result.text. Any idea why?

@trey

This comment has been minimized.

Copy link

commented Jun 24, 2017

Thank you, that helped me!

@ebartan

This comment has been minimized.

Copy link

commented Jul 7, 2017

thx for share very useful start for soup

@danhamill

This comment has been minimized.

Copy link

commented Jul 27, 2017

Thanks

@Mutungi

This comment has been minimized.

Copy link

commented Aug 10, 2017

Great introduction thanks

@michaelfangyao

This comment has been minimized.

Copy link

commented Sep 18, 2017

Good job bro

@Renzo1

This comment has been minimized.

Copy link

commented Oct 30, 2017

pls wat is the function of the 'attrs[]' in the last line of the above code

@kevinprakasa

This comment has been minimized.

Copy link

commented Nov 9, 2017

thanks dude,

@hMutzner

This comment has been minimized.

Copy link

commented Dec 15, 2017

Very good introduction. Thank you.
Sample site: http://oreilly.com/store/samplers.html does not exist any more

@saif017

This comment has been minimized.

Copy link

commented Jun 8, 2018

Tnx big bro

@LeeJobs4Med

This comment has been minimized.

Copy link

commented Oct 24, 2018

The link is down. you can see a previous version at https://web.archive.org/web/20130209050253/http://oreilly.com/store/samplers.html .

@davidxbuck

This comment has been minimized.

Copy link

commented Nov 24, 2018

Thanks. It works as intended if you change to the current sampler page: https://www.oreilly.com/free/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.