Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pancodia/33661bf92e9d87c9640639aa44c8d3f8 to your computer and use it in GitHub Desktop.
Save pancodia/33661bf92e9d87c9640639aa44c8d3f8 to your computer and use it in GitHub Desktop.
Updated @gruber's regex with a modified version that looks for 2-13 letters rather than trying to look for specific TLDs. Given the recent addition of ~1400 gTLDs, it may be time to give up on that front. (UPDATE 2018-05-15: Naked URLs without protocol prefix now capable of matching more advanced URLs. Also escaped / and " so it's easier to copy…
# Single-line version:
(?i)\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+\.(?:[a-z0-9]{2,13})\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:[a-z0-9]{2,13})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])))
# Commented multi-line version:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?: # URL protocol and colon
(?:
\/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
[a-z0-9.\-]+\. # looks like domain name
(?:[a-z0-9]{2,13}) # ending in common popular gTLDs (or final octet of IPv4 IP)
\/ # followed by a slash
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.])
[a-z0-9]+
(?:[.\-][a-z0-9]+)*
\. # avoid matching the last two parts of an email domain
# like co.uk in person@amazon.co.uk
(?:[a-z0-9]{2,13}) # ending in common popular gTLDs (or final octet of IPv4 IP)
\b
\/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment