Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save winzig/8894715 to your computer and use it in GitHub Desktop.
Save winzig/8894715 to your computer and use it in GitHub Desktop.
Updated @gruber's regex with a modified version that looks for 2-13 letters rather than trying to look for specific TLDs, and many other improvements. (UPDATE 2018-07-30: Support for IPv4 addresses, bare hostnames, naked domains, xn-- internationalized domains, and more... see comments for BREAKING CHANGE.)
# Single-line version:
(?i)\b(https?:\/{1,3})?((?:(?:[\w.\-]+\.(?:[a-z]{2,13})|(?<=http:\/\/|https:\/\/)[\w.\-]+)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)(?:\w+(?:[.\-]+\w+)*\.(?:[a-z]{2,13})|(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))*(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])?))
# Commented multi-line version:
(?xi)
\b
(https?:\/{1,3})? # Capture $1: (optional) URL scheme, colon, and slashes
( # Capture $2: Entire matched URL (other than optional protocol://)
(?:
(?:
[\w.\-]+\. # looks like domain name
(?:[a-z]{2,13}) # ending in common popular gTLDs
| #
(?<=http:\/\/|https:\/\/)[\w.\-]+ # hostname preceded by http:// or https://
)
\/ # followed by a slash
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.])
(?:
\w+
(?:[.\-]+\w+)*
\. # avoid matching the last two parts of an email domain like co.uk in person@amazon.co.uk
(?:[a-z]{2,13}) # ending in common popular gTLDs
| # or
(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4} # IPv4 address, as seen in https://stackoverflow.com/a/13166657/650558
)
\b
\/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)*
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)?
)
)
@winzig
Copy link
Author

winzig commented Jul 31, 2018

BREAKING CHANGE

$1 now returns the scheme:// e.g. https:// or http://. And $2 returns the remainder of the URL (everything after the scheme://). Previously $1 returned the entire URL, so if you're swapping this updated regex into your code, you'll want to change references of "$1" to "$1$2".

Other Improvements

  • Returning the scheme:// separately from the rest of the URL, which can be useful (see above)
  • IPv4 addresses (I'll likely be dead and buried before anyone wants to link an IPv6 URL)
  • Bare (single) hostnames with no periods, e.g. http://hostname/whatever now works
  • Naked domain URLs, e.g. domain.com (this was somewhat working, but now should work better)
  • Internationalized domains, i.e. domains beginning with xn--

@rmalouf
Copy link

rmalouf commented Jan 31, 2019

Line 26 should probably be )*

@winzig
Copy link
Author

winzig commented Jun 23, 2019

@rmalouf Can you give me an example of where this fails to identify a URL without that change to line 26? (I hate to make adjustments based on hypotheticals because this dang regex is so complex.)

@hrieke
Copy link

hrieke commented Jan 10, 2022

License for this code?
Thanks!

@winzig
Copy link
Author

winzig commented Jan 10, 2022

@hrieke You’d have to ask @gruber since I just forked his regex, but for my changes, public domain is fine.

@gruber
Copy link

gruber commented Jan 10, 2022

Public domain is fine by me. I published this to be freely used.

@winzig
Copy link
Author

winzig commented Jan 10, 2022

❤️

@gruber
Copy link

gruber commented Jan 10, 2022 via email

@winzig
Copy link
Author

winzig commented Jan 10, 2022

Thanks!

@hrieke
Copy link

hrieke commented Jan 11, 2022

Thank you both for the prompt reply!

@b96705008
Copy link

It seems like the regex would still have catastrophic backtracking issue when string has multiple trailing punctuation:
e.g. https://www.google.co.jp/search?q=hello&client=safari?????????????
Check https://regex101.com

@winzig
Copy link
Author

winzig commented May 4, 2022

Possibly, but I'm able to run it in an environment (.NET) where I'm able to specify a timeout for my regex, to handle edge cases like this that have never come up for me.

That being said, if you solve the backtracking issue, definitely let me know. :trollface:

@winzig
Copy link
Author

winzig commented May 4, 2022

To put it in context: I just tested your URL on regex101. When I end your URL with 12 question marks, it executes in TWELVE MILLISECONDS. When I add the 13th question mark, regex101 complains about catastrophic backtracking...

"But is it illegal though."

https://www.youtube.com/watch?v=kH6QJzmLYtw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment