Skip to content

Instantly share code, notes, and snippets.

@dg
Last active January 10, 2024 08:30
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dg/30df3623e263f1cc87fadc2f994a4249 to your computer and use it in GitHub Desktop.
Save dg/30df3623e263f1cc87fadc2f994a4249 to your computer and use it in GitHub Desktop.
Regular expression for parsing HTML
~
(?(DEFINE)
(?<entity>
&
(
[a-z][a-z0-9]+ # named entity
|
\#\d+ # decimal number
|
\#x[0-9a-f]+ # hexadecimal number
)
;
)
(?<attribute>
\s+ # at least one whitespace character before the attribute
[^\s"'<>=`/]+ # attribute name
(
\s*=\s* # equals sign before the value
(
" # value enclosed in double quotes
(
[^"] # any character except double quote
|
(?&entity) # or HTML entity
)*
"
|
' # value enclosed in single quotes
(
[^'] # any character except single quote
|
(?&entity) # or HTML entity
)*
'
|
[^\s"'<>=`]+ # value without quotes
)
)? # value is optional
)
(?<void_element>
< # start of tag
( # element name
img|hr|br|input|meta|area|embed|keygen|source|base|col
|link|param|basefont|frame|isindex|wbr|command|track
)
(?&attribute)* # optional attributes
\s*
/? # optional /
> # end of tag
)
(?<special_element>
< # start tag
(?<special_element_name>
script|style|textarea|title # element name
)
(?&attribute)* # optional attributes
\s*
> # end of start tag
(?> # atomic group
.*? # smallest possible number of any characters including new lines
</ # end tag
(?P=special_element_name)
)
\s*
> # end of end tag
)
(?<element>
< # start tag
(?<element_name>
[a-z][^\s/>]* # element name
)
(?&attribute)* # optional attributes
\s*
> # end of start tag
(?&content)*
</ # end tag
(?P=element_name)
\s*
> # end of end tag
)
(?<comment>
<!--
(?> # atomic group
.*? # smallest possible number of any characters including new lines
-->
)
)
(?<doctype>
<!doctype
\s
[^>]* # any characters except '>'
>
)
)
\s*
(?&doctype)? # optional doctype
(?<content>
(?&void_element) # void element
|
(?&special_element) # special element
|
(?&element) # paired element
|
(?&comment) # comment
|
(?&entity) # entity
|
[^<] # character
)*
~xis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment