-
-
Save rossmounce/9f514d330ac2092200c7 to your computer and use it in GitHub Desktop.
I know I'm doing all types of wrong here: | |
Source HTML file here: http://mdpi.com/1420-3049/19/4/5150/htm | |
I want the text for the dc.source: | |
Molecules 2014, Vol. 19, Pages 5150-5162 | |
Am using beautiful soup, so probably best to do it in that BUT it should also be regex-able. I can do this in bash no problem! | |
hand = open('1420-3049.19.4.5150.htm') | |
for ling in hand: | |
ling = ling.rstrip() | |
if re.search('name="dc.source"', ling) : | |
bibinfo = ling.strip('\<').strip('>') | |
print bibinfo+" "+originalurl | |
output: | |
<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162" http://mdpi.com/1420-3049/19/4/5150/htm | |
#NotWhatIWanted / nor expected | |
ah, okay. Here's the elegant beautiful soup solution (hadn't used that syntax before):
desc = soup.findAll(attrs={"name":"dc.source"})
print desc[0]['content'].encode('utf-8')
For the record (Hacky McHack) - this should get the string you want in bibinfo. I think you might also want to set re.I = True as I HTML is supposed to be case insensitive in tag and attribute names.
for ling in hand:
match = re.search('<.*meta.*dc\.source.*content\=[\"\'](.*)[\"\']',ling)
if match:
print ling, match.group(1)
bibinfo = match.group(1)
Output with that test file:
<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162">
Molecules 2014, Vol. 19, Pages 5150-5162
>>> bibinfo
'Molecules 2014, Vol. 19, Pages 5150-5162'
Oops - didn't notice you'd already put the beautiful soup version up. A better way to skin this particular cat :)
That does not surprise me. (I'm running all this in a IPython Notebook cell btw)
It must be something within
hand = open('1420-3049.19.4.5150.htm')
for ling in hand:
ling = ling.rstrip()
if re.search('name="dc.source"', ling) :
but I'll give-up on this quirk for the moment and go back to implementing it properly with beautiful soup.
Thanks!