Skip to content

Instantly share code, notes, and snippets.

@rossmounce
Created May 28, 2014 09:37
Show Gist options
  • Save rossmounce/9f514d330ac2092200c7 to your computer and use it in GitHub Desktop.
Save rossmounce/9f514d330ac2092200c7 to your computer and use it in GitHub Desktop.
python regex
I know I'm doing all types of wrong here:
Source HTML file here: http://mdpi.com/1420-3049/19/4/5150/htm
I want the text for the dc.source:
Molecules 2014, Vol. 19, Pages 5150-5162
Am using beautiful soup, so probably best to do it in that BUT it should also be regex-able. I can do this in bash no problem!
hand = open('1420-3049.19.4.5150.htm')
for ling in hand:
ling = ling.rstrip()
if re.search('name="dc.source"', ling) :
bibinfo = ling.strip('\<').strip('>')
print bibinfo+" "+originalurl
output:
<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162" http://mdpi.com/1420-3049/19/4/5150/htm
#NotWhatIWanted / nor expected
@rsnape
Copy link

rsnape commented May 28, 2014

Oops - didn't notice you'd already put the beautiful soup version up. A better way to skin this particular cat :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment