Skip to content

Instantly share code, notes, and snippets.

@kaleidoescape
kaleidoescape / capture_tags_regex.py
Last active March 19, 2021 13:57
A regex to capture XML tags from a string. Useful for those times when accessing the DOM or using a full XML parser is not feasible (e.g. dirty scraped web data).
import re
#capture a word chararacter or a / inside a <> tag, followed by anything
#except <, any number of times, non-greedy, plus any spaces around the tag
STANDALONE_TAG_REGEX = re.compile(r'(\s*<(?:[A-Za-z]+|/)[^<]*?>\s*)')
#a dictionary where keys are strings with fake xml tags in them,
#and values are the ordered list of tags that the regex should capture
REGEX_TESTS = {
#wrapped