Skip to content

Instantly share code, notes, and snippets.

@grapeot
Created September 17, 2012 13:48
Show Gist options
  • Save grapeot/3737351 to your computer and use it in GitHub Desktop.
Save grapeot/3737351 to your computer and use it in GitHub Desktop.
Use python to extract info from html
# Not sure why this example doesn't work, but the framework is like this.
# Welcome to point out the bug. Thanks in advance!
import re
content = '<html><p class="name">John</p><p class="profile">http://facebook.com/john</p></html>'
# More info about re in http://docs.python.org/library/re.html
name_result = re.match('class="name">(\w+)<', content).group(1)
profile_result = re.match('class="profile">(.+)<', content).group(1)
print name_result
print profile_result
@bjou
Copy link

bjou commented Sep 17, 2012

Got it the last portion too -- use '.+?' or '.*?' to make it the search not so greedy.

profile_result = re.match('^.*class="profile">(.+?)<', content).group(1)

@grapeot
Copy link
Author

grapeot commented Sep 20, 2012

Thanks a lot @bjou!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment