Created
September 17, 2012 13:48
-
-
Save grapeot/3737351 to your computer and use it in GitHub Desktop.
Use python to extract info from html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Not sure why this example doesn't work, but the framework is like this. | |
# Welcome to point out the bug. Thanks in advance! | |
import re | |
content = '<html><p class="name">John</p><p class="profile">http://facebook.com/john</p></html>' | |
# More info about re in http://docs.python.org/library/re.html | |
name_result = re.match('class="name">(\w+)<', content).group(1) | |
profile_result = re.match('class="profile">(.+)<', content).group(1) | |
print name_result | |
print profile_result |
Got it the last portion too -- use '.+?' or '.*?' to make it the search not so greedy.
profile_result = re.match('^.*class="profile">(.+?)<', content).group(1)
Thanks a lot @bjou!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks again Yan. To get it to work you need to ignore the beginning too because re.match matches from the beginning of the string, e.g.