Forked from gruber/Liberal Regex Pattern for Web URLs
Last active
May 4, 2022 05:44
-
-
Save winzig/8894715 to your computer and use it in GitHub Desktop.
Updated @gruber's regex with a modified version that looks for 2-13 letters rather than trying to look for specific TLDs, and many other improvements. (UPDATE 2018-07-30: Support for IPv4 addresses, bare hostnames, naked domains, xn-- internationalized domains, and more... see comments for BREAKING CHANGE.)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Single-line version: | |
(?i)\b(https?:\/{1,3})?((?:(?:[\w.\-]+\.(?:[a-z]{2,13})|(?<=http:\/\/|https:\/\/)[\w.\-]+)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)(?:\w+(?:[.\-]+\w+)*\.(?:[a-z]{2,13})|(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))*(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])?)) | |
# Commented multi-line version: | |
(?xi) | |
\b | |
(https?:\/{1,3})? # Capture $1: (optional) URL scheme, colon, and slashes | |
( # Capture $2: Entire matched URL (other than optional protocol://) | |
(?: | |
(?: | |
[\w.\-]+\. # looks like domain name | |
(?:[a-z]{2,13}) # ending in common popular gTLDs | |
| # | |
(?<=http:\/\/|https:\/\/)[\w.\-]+ # hostname preceded by http:// or https:// | |
) | |
\/ # followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| # or | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
)+ | |
(?: # End with: | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
| # or | |
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars | |
) | |
| # OR, the following to match naked domains: | |
(?: | |
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.]) | |
(?: | |
\w+ | |
(?:[.\-]+\w+)* | |
\. # avoid matching the last two parts of an email domain like co.uk in person@amazon.co.uk | |
(?:[a-z]{2,13}) # ending in common popular gTLDs | |
| # or | |
(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4} # IPv4 address, as seen in https://stackoverflow.com/a/13166657/650558 | |
) | |
\b | |
\/? | |
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com" | |
(?: # One or more: | |
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| # or | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
)* | |
(?: # End with: | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
| # or | |
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars | |
)? | |
) | |
) |
Possibly, but I'm able to run it in an environment (.NET) where I'm able to specify a timeout
for my regex, to handle edge cases like this that have never come up for me.
That being said, if you solve the backtracking issue, definitely let me know.
To put it in context: I just tested your URL on regex101. When I end your URL with 12 question marks, it executes in TWELVE MILLISECONDS. When I add the 13th question mark, regex101 complains about catastrophic backtracking...
"But is it illegal though."
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It seems like the regex would still have catastrophic backtracking issue when string has multiple trailing punctuation:
e.g.
https://www.google.co.jp/search?q=hello&client=safari?????????????
Check https://regex101.com