Last active

Embed URL

HTTPS clone URL

SSH clone URL

You can clone with HTTPS or SSH.

Download Gist

Liberal, Accurate Regex Pattern for Matching Web URLs

View Liberal Regex Pattern for Web URLs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
The regex patterns in this gist are intended only to match web URLs -- http,
https, and naked domains like "example.com". For a pattern that attempts to
match all URLs, regardless of protocol, see: https://gist.github.com/gruber/249502
 
 
# Single-line version:
 
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))
 
 
# Commented multi-line version:
 
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
# looks like domain name followed by a slash:
[a-z0-9.\-]+[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
/
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_
[a-z0-9]+
(?:[.\-][a-z0-9]+)*
[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
\b
/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
)
)

@gruber thanks for sharing. You might need to add 1400 new gTLDs to line 24? ;-(

http://www.101domain.com/new_gtld_extensions.htm

And line 44. Or it might be at the point where you just give up and look for this instead:

(?:[a-z]{2,13})

I.e. 2 to 13 letters (the longest gTLD I found was .information, so I went with 13 on the high end).

And [a-z] may not be enough. Consider the new gTLD, . 移动, which is Chinese for "mobile." Hmmm...

Looking forward to winzig’s improvements… hint*hint

Aren't all domain names ultimately encoded as ASCII?

e.g. punycode: https://en.wikipedia.org/wiki/Punycode

.移动 --> .xn--6frz82g
.游戏 --> .xn--unup4y

A list of 'all' TLDs: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
ICANN notes: "maintained by the IANA and is updated from time to time" (ref: http://www.icann.org/en/resources/registries/tlds )

Personally I wouldn't encode a changing list into a fixed regex.

A Chinese or other language forum post, RSS feed, blog post, or other. Is not going to have users who type the asci representation of a Chinese domain name! It's difficult to see a use case for this.

@gruber @mcritz Well, here's my fork with the updates I mentioned:

https://gist.github.com/winzig/8894715

I haven't tackled handling unicode TLDs, but given that John has not yet attempted matching them in the domain name matching part itself, I don't feel too bad. :-)

Another issue with supporting unicode domains is I think you'd have to make a decision that would tie this regex to either being PCRE compliant (using \x{1234} unicode characters), or JavaScript compliant, with \u1234 style unicode. I don't think there's currently a way to specify unicode characters that is globally acceptable...

Someone has looked deeper into the problem a while back: http://mathiasbynens.be/demo/url-regex

Also check the link to the gist over there: https://gist.github.com/dperini/729294

It fails a couple of -arguably fishy- URLs:
http://nic.xn--unup4y <-- valid equivalent of http://nic.游戏
http://xn--h32b13vza.xn--3e0b707e/ <-- equiv: http://이메일.한국/
https://localhost/
http://2915201185/search?q=hello

The test.html harness over at https://gist.github.com/michaelpigg/4001961
.. needs a meta charset="utf-8" below the head

I added support for punycode in the domain to dperini's regex here: https://gist.github.com/HenkPoley/8899766
(dperini's version already supported IDNs / UTF-8 domains)

Basically all the domain parts check that xn-- is followed by one or more numbers, letters or dashes.

I have no clue about arcane UTF encodings of the resource path.

@HenkPoley in what world do users of URLs actually type the "xn--*" version of a URL? I'm having trouble trying to work out what you are not understanding here, please help. It sounds like you expect that Chinese people sit around memorising the ASCII version of their favourite Chinese website?

@corydoras you convert the URL to punycode internally before running it through the validator.

@jmesterh To convert a url to punycode, dont you first need a regular expression to find the URL? So you still need a regexp that can match unicode domain names. :D And at this point, if you have a regexp to find the URL, there is no need to convert it to punycode anyway right?

@corydoras – So convert the whole text into Punycode, process it however you wish (i.e. replace URLs with anchors), convert it back.

@millimoose Not only is that inefficient, but it means that by default, sites, projects, systems, that use this regular expression don't support non-American users.

@corydoras

I call bullshit on "inefficient". You're running a nontrivial regex on the whole body of the text already – the O(1) ship has sailed, and my gut tells me the O(n) one has as well. (Pro–tip: whenever say "inefficient" without mentioning O-notation, you're probably wrong.) Besides, it's a straightforward fix for the situation, as opposed to devising a Unicode-based one from scratch. (And good luck with how your RE engine handles Unicode in that case; that's besides the issue that @winzig points out, about Unicode escapes for string literals being unportable.)

I call bullshit on "non-American". I'm Slovak, i.e. not American or born or residing in a country that has English as its primary language. I have yet to see a site with an IDN – by now it'd probably be confusing to users here that they're supposed to use diacritics in an URL. Just to drive my point home, not even Yandex, the major Russian search engine, owns Яндекс.com or .ru.

I have no idea what the situation is elsewhere – say, in the CJK region – but I'd wager that in countries that use essentially the Latin alphabet, or an alphabet that is easily transliterated to Latin, IDN support is a nonissue. Those countries are a non-negligible target-market, and the connotation of ignorant cultural imperialism I assume you were going for vanishes.

You have to do a whole goddamn lot of work besides "using a different regex for URLs" to fully support non-Latin users. The sites that don't choose to do that all that work don't support foreign users anyway. And, given what I've said above about the size of that target market, that choice might make perfect sense. It doesn't necessarily have to be about dismissing a culture, as much as about a cost/benefit decision. To provide an example: Stack Overflow is a deliberately English-only site, to reduce community fragmentation. (I.e. it's better for everyone if they participate in English, possibly somewhat broken English, rather than having foreign-language content inaccessible to the majority of its users.) Why would they ever bother supporting other language?

All the while, this URL with the modifications proposed in the comments is a useful tool in that process. For some reason, you're arguing with the people making those constructive proposals. (And as I've noticed, without making any yourself.) What, exactly, are you trying to accomplish here?

I have not run the code, but it looks like a small typo on lines 21 and 44, in the OR statements of 2 letter country codes.

..si|sj| Ja|sk..

I don't know if Ja is actually supposed to be capitalized, but it appears that way (capitalized) in the single line version on line 2, but without the preceding whitespace character.

@millimoose What am I trying to accomplish? Well at some point the world is going to have to make the transition for natively supporting multiple character sets. It's attitudes like yours that hold up this sort of progress. Sure there are all of the types of problems you suggest, but they are problems that we should seek to resolve. Of course there are english only sites, and sites that avoid resolving the problem for business reasons. Thats not an excuse to not aim to support the best and widest number of character sets.

@corydoras – you're not answering the question I asked. To phrase it in a single sentence: What are you trying to accomplish arguing with the people trying to make this RE useful in supporting IDNs?

Also: "the world is going to have to make the transition", "hold up […] progress", "we should seek to resolve", "that's not an excuse"? By this point it seems you're arguing for the sake of arguing, and veering off-topic. The reason nobody is resolving the problem you mention is because they don't have that problem, or have to make any transition – you do / want them to. How about you a) give a solid economical argument for supporting IDNs in my code; b) give a solid technical issue with the regex or the solutions people gave you; or c) make a constructive contribution?

You were given the proposal to take the above regex, change it to allow punycode escape sequences as well as ASCII, and use the encode-process-decode approach to find URLs. Do you have an actual use case where this fails? Can you give some example code that reproduces the problem? Basically, can we go on to talking about real problems people actually using the above RE are having, instead of preaching and hypotheticals?

wonder what @slevithan could do with this :wink:

At the part for # not preceded by a @, avoid matching foo@_gmail.com_, how can we support it to work with JavaScript that doesn't have negative lookbehind syntax?

Perhaps I am missing something, but is the closed set of valid TLDs a good idea? TLDs can now be bought like normal domain names. Will this regex work in 3-5 years when new TLDs emerge, or will it need to be udpated?

I'm not massively familiar with regex, so when the above didn't work In ruby i discovered i needed to use the following syntax for it to work.

reg = %r{YourSingleLineRegCutAndPasted}

This was to allow the use of / within the regex itself.

alricb commented

Note that this regex may be vulnerable to a denial of service attack if used on untrusted input with an NFA engine. Ref: http://en.wikipedia.org/wiki/ReDoS

I can't really recommend using a Regex for this.

Try something like golang's net/url Parse: http://golang.org/pkg/net/url/#Parse
http://golang.org/src/pkg/net/url/url.go?s=8497:8544#L323

@fujin: "The problem the pattern attempts to solve: identify the URLs in an arbitrary string of text, where by “arbitrary” let’s agree we mean something unstructured such as an email message or a tweet."

The regex gets stuck in an infinite loop (JavaScript) when you have many trailing dots:

var text = 'http://www.google.com............................................';
regex.exec(text); // stuck in infinite loop

Here is a simple Javascript regex I tested on several variants listed here. I'm still not up-to-speed with go's patterns so I tried to keep this simple and not use any lookaheads or anything.

(([a-z]{3,6}://)|(^|\s))([a-zA-Z0-9\-]+\.)+[a-z]{2,13}[\.\?\=\&\%\/\w\-]*\b([^@]|$)
domain.com
www.domain.com
thisisareallylongdomainnamewithunder62parts.co
node-1.www4.example.com.jp
something .com
email@domain.com
@example.com
user.name@example.com
http://domain.com
ftp://foo.1.example.com.uk
url::ecode.foo()
hi there John.Bob
example.com?foo=bar
example.com/foo/bar?baz=true&something=%20alsotrue

I wrote it assuming the \m modifier for a list like this that ends with a newline. You can fiddle with it yourself

It's not actually an infinite loop with lots of trailing dots, it's just that it takes twice as long to run for each additional dot (tested in Python). Exclamation marks, commas and semicolons also cause the same problem.

It get's stuck on ruby as well.

This one works for me for 99% of cases, which is what I needed:

((?<=[^a-zA-Z0-9])(?:https?\:\/\/|[a-zA-Z0-9]{1,}\.{1}|\b)(?:\w{1,}\.{1}){1,5}(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm){1}(?:\/[a-zA-Z0-9]{1,})*)

https://regex101.com/r/fO6mX3/2

Can't seem to get this to work in PHP, even using a <<<EOD ... EOD; type of input into the php variable for preg pattern, it keeps getting hung up on an unknown modifier '\'.

Edit: had to use a delimiter that was not apart of the string at all:
$webpattern = <<<EOD
~pattern here~
EOD;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.