Skip to content

Instantly share code, notes, and snippets.

@bradmontgomery
Created February 21, 2012 02:00
Show Gist options
  • Save bradmontgomery/1872970 to your computer and use it in GitHub Desktop.
Save bradmontgomery/1872970 to your computer and use it in GitHub Desktop.
Really short intro to scraping with Beautiful Soup and Requests

Web Scraping Workshop

Using Requests and Beautiful Soup, with the most recent Beautiful Soup 4 docs.

Getting Started

Install our tools (preferably in a new virtualenv):

pip install beautifulsoup4
pip install requests

Start Scraping!

Lets grab the Free Book Samplers from O'Reilly: http://oreilly.com/store/samplers.html.

>>> import requests
>>> 
>>> result = requests.get("http://oreilly.com/store/samplers.html")

Make sure we got a result.

>>> result.status_code
200
>>> result.headers
...

Store your content in an easy-to-type variable!

>>> c = result.content

Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from bs4. If you download the source, you'll need to import from BeautifulSoup (which is what they do in the online docs).

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(c)
>>> samples = soup.find_all("a", "item-title")
>>> samples[0]
<a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf">
Programming Perl
</a>

Now, pick apart individual links.

>>> data = {}
>>> for a in samples:
...     title = a.string.strip()
...     data[title] = a.attrs['href']

Check out the keys/values in the data dict. Rejoice!

Now go scrape some stuff!

@hsheikha1429
Copy link

Well explained.
Thank you.

@Jogwums
Copy link

Jogwums commented Mar 6, 2020

Thanks Nice!
Found a way to export the array created to a csv file.

`import requests
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
import csv

results = requests.get("https://www.oreilly.com/free/")

#check if the link is functional
print(results.status_code)

#view headers
print(results.headers)

c = results.content
#apply web scrapping library
soup = BeautifulSoup(c)

#check html element and use to point exact location
samples = soup.find_all("a", "item-title")
samples[0]

#insert a loop to check each iteration and store in an empty dict
data = {}
for a in samples:
    title = a.string.strip()
    data[title] = a.attrs['href']

print(data)

#import to csv
with open('books.csv', 'w') as f:
    for key in data.keys():
        f.write("%s,%s\n"%(key,data[key]))`

@kannankumar
Copy link

great short intro with just the required pieces. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment