Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save winzig/8894715 to your computer and use it in GitHub Desktop.
Save winzig/8894715 to your computer and use it in GitHub Desktop.
Updated @gruber's regex with a modified version that looks for 2-13 letters rather than trying to look for specific TLDs, and many other improvements. (UPDATE 2018-07-30: Support for IPv4 addresses, bare hostnames, naked domains, xn-- internationalized domains, and more... see comments for BREAKING CHANGE.)
# Single-line version:
(?i)\b(https?:\/{1,3})?((?:(?:[\w.\-]+\.(?:[a-z]{2,13})|(?<=http:\/\/|https:\/\/)[\w.\-]+)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)(?:\w+(?:[.\-]+\w+)*\.(?:[a-z]{2,13})|(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))*(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])?))
# Commented multi-line version:
(?xi)
\b
(https?:\/{1,3})? # Capture $1: (optional) URL scheme, colon, and slashes
( # Capture $2: Entire matched URL (other than optional protocol://)
(?:
(?:
[\w.\-]+\. # looks like domain name
(?:[a-z]{2,13}) # ending in common popular gTLDs
| #
(?<=http:\/\/|https:\/\/)[\w.\-]+ # hostname preceded by http:// or https://
)
\/ # followed by a slash
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.])
(?:
\w+
(?:[.\-]+\w+)*
\. # avoid matching the last two parts of an email domain like co.uk in person@amazon.co.uk
(?:[a-z]{2,13}) # ending in common popular gTLDs
| # or
(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4} # IPv4 address, as seen in https://stackoverflow.com/a/13166657/650558
)
\b
\/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)*
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)?
)
)
@HenkPoley
Copy link

Any source on the 2 - 13 characters limit for the TLD?

@stuntbox
Copy link

stuntbox commented Feb 9, 2014

Theoretically, would 63 be a reasonably future-proof max for the TLD, since that's upper limit for The DNS? http://en.wikipedia.org/wiki/Domain_Name_System#cite_ref-rfc1034_1-2

@HenkPoley
Copy link

Yes, it's 63 bytes (64 with null byte), for each "label". A label in DNS is basically the parts separated by the dots.

www.google.com -> labels = {www, google, com}

The total max size, after punycoding, for the domain is 255 bytes, afaik.

btw, I think the regex I worked on is better ;) :: https://gist.github.com/HenkPoley/8899766

@mgmort
Copy link

mgmort commented May 12, 2014

there is a typo in the single line version - missing a /) (lines 22-23 in the expanded)
here is the correct single line
(?xi)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+./)(?:[^\s()<>{}[]]+|([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?))+(?:([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?)|[^\s`!()[]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*.\b/?(?!@)))

@mgmort
Copy link

mgmort commented Apr 6, 2016

I just made another change in the second option for naked domains:
changed line 39 to
(?:\b
changed line 40 to
(?<![@.])
these two changes prevent the identifying of double suffixes of emails as urls (for example, test@google.co.il, person@amazon.co.uk were returning co.il and co.uk as results)

@winzig
Copy link
Author

winzig commented May 3, 2018

@mgmort I incorporated your bug fix from May 11, thanks. I tried out your proposals on Apr 6, but they result in overmatching. e.g. I was seeing matches on random word like "this" in the middle of sentences.

@winzig
Copy link
Author

winzig commented May 16, 2018

UPDATE Modified so that naked URLs without protocol prefix now capable of matching more advanced URLs. Also escaped / as / and " as " so it's easier to copy+paste this regex into more code.

@winzig
Copy link
Author

winzig commented May 16, 2018

I also just backed out an accidental edit I did 12 days ago, re-ordering the gTLD code. oops!

@winzig
Copy link
Author

winzig commented May 16, 2018

Just made a tweak so that this regex can match IP address URLs, such as http://127.0.0.1/

@winzig
Copy link
Author

winzig commented Jul 31, 2018

BREAKING CHANGE

$1 now returns the scheme:// e.g. https:// or http://. And $2 returns the remainder of the URL (everything after the scheme://). Previously $1 returned the entire URL, so if you're swapping this updated regex into your code, you'll want to change references of "$1" to "$1$2".

Other Improvements

  • Returning the scheme:// separately from the rest of the URL, which can be useful (see above)
  • IPv4 addresses (I'll likely be dead and buried before anyone wants to link an IPv6 URL)
  • Bare (single) hostnames with no periods, e.g. http://hostname/whatever now works
  • Naked domain URLs, e.g. domain.com (this was somewhat working, but now should work better)
  • Internationalized domains, i.e. domains beginning with xn--

@rmalouf
Copy link

rmalouf commented Jan 31, 2019

Line 26 should probably be )*

@winzig
Copy link
Author

winzig commented Jun 23, 2019

@rmalouf Can you give me an example of where this fails to identify a URL without that change to line 26? (I hate to make adjustments based on hypotheticals because this dang regex is so complex.)

@hrieke
Copy link

hrieke commented Jan 10, 2022

License for this code?
Thanks!

@winzig
Copy link
Author

winzig commented Jan 10, 2022

@hrieke You’d have to ask @gruber since I just forked his regex, but for my changes, public domain is fine.

@gruber
Copy link

gruber commented Jan 10, 2022

Public domain is fine by me. I published this to be freely used.

@winzig
Copy link
Author

winzig commented Jan 10, 2022

❤️

@gruber
Copy link

gruber commented Jan 10, 2022 via email

@winzig
Copy link
Author

winzig commented Jan 10, 2022

Thanks!

@hrieke
Copy link

hrieke commented Jan 11, 2022

Thank you both for the prompt reply!

@b96705008
Copy link

It seems like the regex would still have catastrophic backtracking issue when string has multiple trailing punctuation:
e.g. https://www.google.co.jp/search?q=hello&client=safari?????????????
Check https://regex101.com

@winzig
Copy link
Author

winzig commented May 4, 2022

Possibly, but I'm able to run it in an environment (.NET) where I'm able to specify a timeout for my regex, to handle edge cases like this that have never come up for me.

That being said, if you solve the backtracking issue, definitely let me know. :trollface:

@winzig
Copy link
Author

winzig commented May 4, 2022

To put it in context: I just tested your URL on regex101. When I end your URL with 12 question marks, it executes in TWELVE MILLISECONDS. When I add the 13th question mark, regex101 complains about catastrophic backtracking...

"But is it illegal though."

https://www.youtube.com/watch?v=kH6QJzmLYtw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment