Skip to content

Instantly share code, notes, and snippets.

@golanlevin
Created March 8, 2015 23:19
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save golanlevin/66d75fdc3d6aedfdd53a to your computer and use it in GitHub Desktop.
Save golanlevin/66d75fdc3d6aedfdd53a to your computer and use it in GitHub Desktop.
Scraping Workshop by Golan
Ben Fry on the Infovis Pipeline:
http://golancourses.net/2014/wp-content/uploads/2014/01/Screen-Shot-2014-01-23-at-6.43.44-AM-620x608.png
Some Options for Scraping:
Temboo: https://temboo.com/library/
Kimono Labs: http://www.kimonolabs.com/
Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Processing (XML, JSON)
Much more information:
http://schoolofdata.org/handbook/recipes/scraping-beyond-the-basics/
http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/
http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/
======================================================
>>> import requests
>>> result = requests.get("http://www.ikea.com/us/en/catalog/categories/departments/bathroom/20723/")
>>> result.status_code
200
>>> c = result.content
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(c)
>>> samples = soup.find_all("a", "productLink")
>>> samples[0]
>>> for a in samples:
... a
======================================================
// For Beautiful Soup, See https://gist.github.com/bradmontgomery/1872970
pip install beautifulsoup4
pip install requests
>>> import requests
>>> result = requests.get("http://shop.oreilly.com/category/new.do")
>>> result.status_code
200
>>> result.headers
>>> c = result.content
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(c)
>>> samples = soup.find_all("div", "AuthorName")
>>> samples[0]
<div class="AuthorName">By Kevin Sitto, Marshall Presser</div>
>>>
>>> for a in samples:
... authors = a.string.strip()
... authors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment