Skip to content

Instantly share code, notes, and snippets.

@rosstex
Last active June 16, 2023 18:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rosstex/bc0df9db72833bcf6872f9ba8ec5db06 to your computer and use it in GitHub Desktop.
Save rosstex/bc0df9db72833bcf6872f9ba8ec5db06 to your computer and use it in GitHub Desktop.
deterministic xpath generation of BeautifulSoup elements for web crawling
import html
BAD_CHARS = set(["\"", "'", "[", "]"])
# generates an xpath string for a given BeautifulSoup element
def soup_xpath_gen(element):
xpath = ""
while element.name != 'document':
if element.name == 'html':
xpath = "/html/" + xpath
break # second to last
else:
items = list(element.attrs.items())
if not items:
el_xpath = str(element.name)
else:
el_xpath = str(element.name) + "["
one = False
for i, (k, v) in enumerate(items):
if not any(char in BAD_CHARS for char in v):
if k == "title":
continue
if one:
el_xpath = el_xpath + " and "
one = True
if "/" in v: # URL matching is wonky, so we ignore
el_xpath += "normalize-space(@%s)" % k
else:
if isinstance(v, list):
v = " ".join(v)
el_xpath += "normalize-space(@%s)=normalize-space(\'%s\')" % (k, html.escape(v))
el_xpath += "]"
xpath = el_xpath + "/" + xpath
element = element.parent
return xpath.rstrip("/")
@rosstex
Copy link
Author

rosstex commented Feb 5, 2021

This is based on attributes, so it should work even if the DOM is updated after pulling the page content but before accessing the element (useful for web crawling). Sorry if not clean!

@akinolawilson
Copy link

akinolawilson commented Mar 16, 2021

How useful, thank you for sharing!

One suggestion I have is line 1;
import html
Otherwise, line 30 will not work with:
html.escape(v)

@rosstex
Copy link
Author

rosstex commented Mar 16, 2021

How useful, thank you for sharing!

One suggestion I have is line 1;
import html
Otherwise, line 30 will not work with:
html.escape(v)

Good catch, thanks! Fixed.

@joespr
Copy link

joespr commented Jun 16, 2023

Thank you for this code.

Question. What to do when get this error:
AttributeError: 'WebElement' object has no attribute 'name'
?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment