Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Updated @gruber's regex with a modified version that looks for 2-13 letters rather than trying to look for specific TLDs, and many other improvements. (UPDATE 2018-07-30: Support for IPv4 addresses, bare hostnames, naked domains, xn-- internationalized domains, and more... see comments for BREAKING CHANGE.)
# Single-line version:
(?i)\b(https?:\/{1,3})?((?:(?:[\w.\-]+\.(?:[a-z]{2,13})|(?<=http:\/\/|https:\/\/)[\w.\-]+)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)(?:\w+(?:[.\-]+\w+)*\.(?:[a-z]{2,13})|(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))*(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])?))
# Commented multi-line version:
(?xi)
\b
(https?:\/{1,3})? # Capture $1: (optional) URL scheme, colon, and slashes
( # Capture $2: Entire matched URL (other than optional protocol://)
(?:
(?:
[\w.\-]+\. # looks like domain name
(?:[a-z]{2,13}) # ending in common popular gTLDs
| #
(?<=http:\/\/|https:\/\/)[\w.\-]+ # hostname preceded by http:// or https://
)
\/ # followed by a slash
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.])
(?:
\w+
(?:[.\-]+\w+)*
\. # avoid matching the last two parts of an email domain like co.uk in person@amazon.co.uk
(?:[a-z]{2,13}) # ending in common popular gTLDs
| # or
(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4} # IPv4 address, as seen in https://stackoverflow.com/a/13166657/650558
)
\b
\/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)*
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)?
)
)
@HenkPoley

This comment has been minimized.

Copy link

HenkPoley commented Feb 9, 2014

Any source on the 2 - 13 characters limit for the TLD?

@stuntbox

This comment has been minimized.

Copy link

stuntbox commented Feb 9, 2014

Theoretically, would 63 be a reasonably future-proof max for the TLD, since that's upper limit for The DNS? http://en.wikipedia.org/wiki/Domain_Name_System#cite_ref-rfc1034_1-2

@HenkPoley

This comment has been minimized.

Copy link

HenkPoley commented Feb 10, 2014

Yes, it's 63 bytes (64 with null byte), for each "label". A label in DNS is basically the parts separated by the dots.

www.google.com -> labels = {www, google, com}

The total max size, after punycoding, for the domain is 255 bytes, afaik.

btw, I think the regex I worked on is better ;) :: https://gist.github.com/HenkPoley/8899766

@mgmort

This comment has been minimized.

Copy link

mgmort commented May 12, 2014

there is a typo in the single line version - missing a /) (lines 22-23 in the expanded)
here is the correct single line
(?xi)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+./)(?:[^\s()<>{}[]]+|([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?))+(?:([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?)|[^\s`!()[]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*.\b/?(?!@)))

@mgmort

This comment has been minimized.

Copy link

mgmort commented Apr 6, 2016

I just made another change in the second option for naked domains:
changed line 39 to
(?:\b
changed line 40 to
(?<![@.])
these two changes prevent the identifying of double suffixes of emails as urls (for example, test@google.co.il, person@amazon.co.uk were returning co.il and co.uk as results)

@winzig

This comment has been minimized.

Copy link
Owner Author

winzig commented May 3, 2018

@mgmort I incorporated your bug fix from May 11, thanks. I tried out your proposals on Apr 6, but they result in overmatching. e.g. I was seeing matches on random word like "this" in the middle of sentences.

@winzig

This comment has been minimized.

Copy link
Owner Author

winzig commented May 16, 2018

UPDATE Modified so that naked URLs without protocol prefix now capable of matching more advanced URLs. Also escaped / as / and " as " so it's easier to copy+paste this regex into more code.

@winzig

This comment has been minimized.

Copy link
Owner Author

winzig commented May 16, 2018

I also just backed out an accidental edit I did 12 days ago, re-ordering the gTLD code. oops!

@winzig

This comment has been minimized.

Copy link
Owner Author

winzig commented May 16, 2018

Just made a tweak so that this regex can match IP address URLs, such as http://127.0.0.1/

@winzig

This comment has been minimized.

Copy link
Owner Author

winzig commented Jul 31, 2018

BREAKING CHANGE

$1 now returns the scheme:// e.g. https:// or http://. And $2 returns the remainder of the URL (everything after the scheme://). Previously $1 returned the entire URL, so if you're swapping this updated regex into your code, you'll want to change references of "$1" to "$1$2".

Other Improvements

  • Returning the scheme:// separately from the rest of the URL, which can be useful (see above)
  • IPv4 addresses (I'll likely be dead and buried before anyone wants to link an IPv6 URL)
  • Bare (single) hostnames with no periods, e.g. http://hostname/whatever now works
  • Naked domain URLs, e.g. domain.com (this was somewhat working, but now should work better)
  • Internationalized domains, i.e. domains beginning with xn--
@rmalouf

This comment has been minimized.

Copy link

rmalouf commented Jan 31, 2019

Line 26 should probably be )*

@winzig

This comment has been minimized.

Copy link
Owner Author

winzig commented Jun 23, 2019

@rmalouf Can you give me an example of where this fails to identify a URL without that change to line 26? (I hate to make adjustments based on hypotheticals because this dang regex is so complex.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.