Skip to content

Instantly share code, notes, and snippets.

@rmax
Created January 7, 2010 22:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rmax/271661 to your computer and use it in GitHub Desktop.
Save rmax/271661 to your computer and use it in GitHub Desktop.
import lxml.html
import re
src = """
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """
regex = re.compile('amazon_(\d+)')
doc = lxml.html.document_fromstring(src)
for div in doc.xpath('//div[starts-with(@id, "amazon_")]'):
match = regex.match(div.get('id'))
if match:
print match.groups()[0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment