Skip to content

Instantly share code, notes, and snippets.

@kaidokert
Created August 2, 2015 16:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kaidokert/1f07a5052beefaa0cec1 to your computer and use it in GitHub Desktop.
Save kaidokert/1f07a5052beefaa0cec1 to your computer and use it in GitHub Desktop.
Simple page scraping with html5lib
import html5lib
import requests
def get_packages():
page = 'https://python3wos.appspot.com/'
doc = html5lib.parse(requests.get(page).content,
namespaceHTMLElements=False)
table = doc.find('body/div/div/table/tbody')
if not table:
raise LookupError('Page didn\;t match expected structure:{}'.format(page))
packages = [(x.find('td[1]/a').text, int(x.find('td[2]').text))
for x in table.getchildren() if x.find('td')]
return packages
packs = get_packages()
print packs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment