Created
March 8, 2015 23:19
-
-
Save golanlevin/66d75fdc3d6aedfdd53a to your computer and use it in GitHub Desktop.
Scraping Workshop by Golan
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Ben Fry on the Infovis Pipeline: | |
http://golancourses.net/2014/wp-content/uploads/2014/01/Screen-Shot-2014-01-23-at-6.43.44-AM-620x608.png | |
Some Options for Scraping: | |
Temboo: https://temboo.com/library/ | |
Kimono Labs: http://www.kimonolabs.com/ | |
Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ | |
Processing (XML, JSON) | |
Much more information: | |
http://schoolofdata.org/handbook/recipes/scraping-beyond-the-basics/ | |
http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/ | |
http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/ | |
====================================================== | |
>>> import requests | |
>>> result = requests.get("http://www.ikea.com/us/en/catalog/categories/departments/bathroom/20723/") | |
>>> result.status_code | |
200 | |
>>> c = result.content | |
>>> from bs4 import BeautifulSoup | |
>>> soup = BeautifulSoup(c) | |
>>> samples = soup.find_all("a", "productLink") | |
>>> samples[0] | |
>>> for a in samples: | |
... a | |
====================================================== | |
// For Beautiful Soup, See https://gist.github.com/bradmontgomery/1872970 | |
pip install beautifulsoup4 | |
pip install requests | |
>>> import requests | |
>>> result = requests.get("http://shop.oreilly.com/category/new.do") | |
>>> result.status_code | |
200 | |
>>> result.headers | |
>>> c = result.content | |
>>> | |
>>> from bs4 import BeautifulSoup | |
>>> soup = BeautifulSoup(c) | |
>>> samples = soup.find_all("div", "AuthorName") | |
>>> samples[0] | |
<div class="AuthorName">By Kevin Sitto, Marshall Presser</div> | |
>>> | |
>>> for a in samples: | |
... authors = a.string.strip() | |
... authors |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment