-
-
Save winzig/8894715 to your computer and use it in GitHub Desktop.
# Single-line version: | |
(?i)\b(https?:\/{1,3})?((?:(?:[\w.\-]+\.(?:[a-z]{2,13})|(?<=http:\/\/|https:\/\/)[\w.\-]+)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)(?:\w+(?:[.\-]+\w+)*\.(?:[a-z]{2,13})|(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))*(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])?)) | |
# Commented multi-line version: | |
(?xi) | |
\b | |
(https?:\/{1,3})? # Capture $1: (optional) URL scheme, colon, and slashes | |
( # Capture $2: Entire matched URL (other than optional protocol://) | |
(?: | |
(?: | |
[\w.\-]+\. # looks like domain name | |
(?:[a-z]{2,13}) # ending in common popular gTLDs | |
| # | |
(?<=http:\/\/|https:\/\/)[\w.\-]+ # hostname preceded by http:// or https:// | |
) | |
\/ # followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| # or | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
)+ | |
(?: # End with: | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
| # or | |
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars | |
) | |
| # OR, the following to match naked domains: | |
(?: | |
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.]) | |
(?: | |
\w+ | |
(?:[.\-]+\w+)* | |
\. # avoid matching the last two parts of an email domain like co.uk in person@amazon.co.uk | |
(?:[a-z]{2,13}) # ending in common popular gTLDs | |
| # or | |
(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4} # IPv4 address, as seen in https://stackoverflow.com/a/13166657/650558 | |
) | |
\b | |
\/? | |
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com" | |
(?: # One or more: | |
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| # or | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
)* | |
(?: # End with: | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
| # or | |
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars | |
)? | |
) | |
) |
I also just backed out an accidental edit I did 12 days ago, re-ordering the gTLD code. oops!
Just made a tweak so that this regex can match IP address URLs, such as http://127.0.0.1/
BREAKING CHANGE
$1 now returns the scheme://
e.g. https://
or http://
. And $2 returns the remainder of the URL (everything after the scheme://
). Previously $1 returned the entire URL, so if you're swapping this updated regex into your code, you'll want to change references of "$1" to "$1$2".
Other Improvements
- Returning the scheme:// separately from the rest of the URL, which can be useful (see above)
- IPv4 addresses (I'll likely be dead and buried before anyone wants to link an IPv6 URL)
- Bare (single) hostnames with no periods, e.g. http://hostname/whatever now works
- Naked domain URLs, e.g. domain.com (this was somewhat working, but now should work better)
- Internationalized domains, i.e. domains beginning with
xn--
Line 26 should probably be )*
@rmalouf Can you give me an example of where this fails to identify a URL without that change to line 26? (I hate to make adjustments based on hypotheticals because this dang regex is so complex.)
License for this code?
Thanks!
Public domain is fine by me. I published this to be freely used.
❤️
Thanks!
Thank you both for the prompt reply!
It seems like the regex would still have catastrophic backtracking issue when string has multiple trailing punctuation:
e.g. https://www.google.co.jp/search?q=hello&client=safari?????????????
Check https://regex101.com
Possibly, but I'm able to run it in an environment (.NET) where I'm able to specify a timeout
for my regex, to handle edge cases like this that have never come up for me.
That being said, if you solve the backtracking issue, definitely let me know.
To put it in context: I just tested your URL on regex101. When I end your URL with 12 question marks, it executes in TWELVE MILLISECONDS. When I add the 13th question mark, regex101 complains about catastrophic backtracking...
"But is it illegal though."
UPDATE Modified so that naked URLs without protocol prefix now capable of matching more advanced URLs. Also escaped / as / and " as " so it's easier to copy+paste this regex into more code.