Skip to content

Instantly share code, notes, and snippets.

@grapeot
Created September 17, 2012 13:48
Show Gist options
  • Save grapeot/3737351 to your computer and use it in GitHub Desktop.
Save grapeot/3737351 to your computer and use it in GitHub Desktop.
Use python to extract info from html
# Not sure why this example doesn't work, but the framework is like this.
# Welcome to point out the bug. Thanks in advance!
import re
content = '<html><p class="name">John</p><p class="profile">http://facebook.com/john</p></html>'
# More info about re in http://docs.python.org/library/re.html
name_result = re.match('class="name">(\w+)<', content).group(1)
profile_result = re.match('class="profile">(.+)<', content).group(1)
print name_result
print profile_result
@bjou
Copy link

bjou commented Sep 17, 2012

Thanks again Yan. To get it to work you need to ignore the beginning too because re.match matches from the beginning of the string, e.g.

name_result = re.match('^.*class="name">(\w+)<', content).group(1)
profile_result = re.match('^.*class="profile">(.+)<', content).group(1)
# the last line also outputs the closing tag </p>, not sure how to ignore that too with re.match

@bjou
Copy link

bjou commented Sep 17, 2012

Got it the last portion too -- use '.+?' or '.*?' to make it the search not so greedy.

profile_result = re.match('^.*class="profile">(.+?)<', content).group(1)

@grapeot
Copy link
Author

grapeot commented Sep 20, 2012

Thanks a lot @bjou!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment