-
-
Save pancodia/33661bf92e9d87c9640639aa44c8d3f8 to your computer and use it in GitHub Desktop.
Updated @gruber's regex with a modified version that looks for 2-13 letters rather than trying to look for specific TLDs. Given the recent addition of ~1400 gTLDs, it may be time to give up on that front. (UPDATE 2018-05-15: Naked URLs without protocol prefix now capable of matching more advanced URLs. Also escaped / and " so it's easier to copy…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Single-line version: | |
(?i)\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+\.(?:[a-z0-9]{2,13})\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:[a-z0-9]{2,13})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))) | |
# Commented multi-line version: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
https?: # URL protocol and colon | |
(?: | |
\/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
[a-z0-9.\-]+\. # looks like domain name | |
(?:[a-z0-9]{2,13}) # ending in common popular gTLDs (or final octet of IPv4 IP) | |
\/ # followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| # or | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
)+ | |
(?: # End with: | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
| # or | |
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars | |
) | |
| # OR, the following to match naked domains: | |
(?: | |
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.]) | |
[a-z0-9]+ | |
(?:[.\-][a-z0-9]+)* | |
\. # avoid matching the last two parts of an email domain | |
# like co.uk in person@amazon.co.uk | |
(?:[a-z0-9]{2,13}) # ending in common popular gTLDs (or final octet of IPv4 IP) | |
\b | |
\/? | |
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com" | |
(?: # One or more: | |
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| # or | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
)+ | |
(?: # End with: | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
| # or | |
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars | |
) | |
) | |
) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment