Skip to content

Instantly share code, notes, and snippets.

@rossmounce
Created May 28, 2014 09:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rossmounce/9f514d330ac2092200c7 to your computer and use it in GitHub Desktop.
Save rossmounce/9f514d330ac2092200c7 to your computer and use it in GitHub Desktop.
python regex
I know I'm doing all types of wrong here:
Source HTML file here: http://mdpi.com/1420-3049/19/4/5150/htm
I want the text for the dc.source:
Molecules 2014, Vol. 19, Pages 5150-5162
Am using beautiful soup, so probably best to do it in that BUT it should also be regex-able. I can do this in bash no problem!
hand = open('1420-3049.19.4.5150.htm')
for ling in hand:
ling = ling.rstrip()
if re.search('name="dc.source"', ling) :
bibinfo = ling.strip('\<').strip('>')
print bibinfo+" "+originalurl
output:
<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162" http://mdpi.com/1420-3049/19/4/5150/htm
#NotWhatIWanted / nor expected
@rossmounce
Copy link
Author

.strip('>') behaves as expected, .strip('<') doesn't appear to be behaving as I think it should. But then it's probably something else higher up?

@rsnape
Copy link

rsnape commented May 28, 2014

This works in my python console as a minimal example. Try that, if it doesn't work - I'm confused.

>>> st = '<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162>'
>>> st.strip("<").strip(">")
'meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162"'

@rossmounce
Copy link
Author

That does not surprise me. (I'm running all this in a IPython Notebook cell btw)

It must be something within

hand = open('1420-3049.19.4.5150.htm')
for ling in hand:
ling = ling.rstrip()
if re.search('name="dc.source"', ling) :

but I'll give-up on this quirk for the moment and go back to implementing it properly with beautiful soup.

Thanks!

@rossmounce
Copy link
Author

ah, okay. Here's the elegant beautiful soup solution (hadn't used that syntax before):

desc = soup.findAll(attrs={"name":"dc.source"})
print desc[0]['content'].encode('utf-8')

@rsnape
Copy link

rsnape commented May 28, 2014

For the record (Hacky McHack) - this should get the string you want in bibinfo. I think you might also want to set re.I = True as I HTML is supposed to be case insensitive in tag and attribute names.

for ling in hand:
match = re.search('<.*meta.*dc\.source.*content\=[\"\'](.*)[\"\']',ling)
if match:
print ling, match.group(1)
bibinfo = match.group(1)

Output with that test file:
<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162">
Molecules 2014, Vol. 19, Pages 5150-5162
>>> bibinfo
'Molecules 2014, Vol. 19, Pages 5150-5162'

@rsnape
Copy link

rsnape commented May 28, 2014

Oops - didn't notice you'd already put the beautiful soup version up. A better way to skin this particular cat :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment