Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Liberal, Accurate Regex Pattern for Matching Web URLs
The regex patterns in this gist are intended only to match web URLs -- http,
https, and naked domains like "example.com". For a pattern that attempts to
match all URLs, regardless of protocol, see: https://gist.github.com/gruber/249502
# Single-line version:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))
# Commented multi-line version:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
# looks like domain name followed by a slash:
[a-z0-9.\-]+[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
/
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_
[a-z0-9]+
(?:[.\-][a-z0-9]+)*
[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
\b
/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
)
)
@winzig

This comment has been minimized.

Show comment Hide comment
@winzig

winzig Feb 8, 2014

@gruber thanks for sharing. You might need to add 1400 new gTLDs to line 24? ;-(

http://www.101domain.com/new_gtld_extensions.htm

winzig commented Feb 8, 2014

@gruber thanks for sharing. You might need to add 1400 new gTLDs to line 24? ;-(

http://www.101domain.com/new_gtld_extensions.htm

@winzig

This comment has been minimized.

Show comment Hide comment
@winzig

winzig Feb 8, 2014

And line 44. Or it might be at the point where you just give up and look for this instead:

(?:[a-z]{2,13})

I.e. 2 to 13 letters (the longest gTLD I found was .information, so I went with 13 on the high end).

winzig commented Feb 8, 2014

And line 44. Or it might be at the point where you just give up and look for this instead:

(?:[a-z]{2,13})

I.e. 2 to 13 letters (the longest gTLD I found was .information, so I went with 13 on the high end).

@winzig

This comment has been minimized.

Show comment Hide comment
@winzig

winzig Feb 8, 2014

And [a-z] may not be enough. Consider the new gTLD, . 移动, which is Chinese for "mobile." Hmmm...

winzig commented Feb 8, 2014

And [a-z] may not be enough. Consider the new gTLD, . 移动, which is Chinese for "mobile." Hmmm...

@mcritz

This comment has been minimized.

Show comment Hide comment
@mcritz

mcritz Feb 9, 2014

Looking forward to winzig’s improvements… _hint_hint*

mcritz commented Feb 9, 2014

Looking forward to winzig’s improvements… _hint_hint*

@HenkPoley

This comment has been minimized.

Show comment Hide comment
@HenkPoley

HenkPoley Feb 9, 2014

Aren't all domain names ultimately encoded as ASCII?

e.g. punycode: https://en.wikipedia.org/wiki/Punycode

.移动 --> .xn--6frz82g
.游戏 --> .xn--unup4y

A list of 'all' TLDs: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
ICANN notes: "maintained by the IANA and is updated from time to time" (ref: http://www.icann.org/en/resources/registries/tlds )

Personally I wouldn't encode a changing list into a fixed regex.

Aren't all domain names ultimately encoded as ASCII?

e.g. punycode: https://en.wikipedia.org/wiki/Punycode

.移动 --> .xn--6frz82g
.游戏 --> .xn--unup4y

A list of 'all' TLDs: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
ICANN notes: "maintained by the IANA and is updated from time to time" (ref: http://www.icann.org/en/resources/registries/tlds )

Personally I wouldn't encode a changing list into a fixed regex.

This comment has been minimized.

Show comment Hide comment
@ghost

ghost Feb 9, 2014

A Chinese or other language forum post, RSS feed, blog post, or other. Is not going to have users who type the asci representation of a Chinese domain name! It's difficult to see a use case for this.

ghost commented Feb 9, 2014

A Chinese or other language forum post, RSS feed, blog post, or other. Is not going to have users who type the asci representation of a Chinese domain name! It's difficult to see a use case for this.

@winzig

This comment has been minimized.

Show comment Hide comment
@winzig

winzig Feb 9, 2014

@gruber @mcritz Well, here's my fork with the updates I mentioned:

https://gist.github.com/winzig/8894715

I haven't tackled handling unicode TLDs, but given that John has not yet attempted matching them in the domain name matching part itself, I don't feel too bad. :-)

winzig commented Feb 9, 2014

@gruber @mcritz Well, here's my fork with the updates I mentioned:

https://gist.github.com/winzig/8894715

I haven't tackled handling unicode TLDs, but given that John has not yet attempted matching them in the domain name matching part itself, I don't feel too bad. :-)

@winzig

This comment has been minimized.

Show comment Hide comment
@winzig

winzig Feb 9, 2014

Another issue with supporting unicode domains is I think you'd have to make a decision that would tie this regex to either being PCRE compliant (using \x{1234} unicode characters), or JavaScript compliant, with \u1234 style unicode. I don't think there's currently a way to specify unicode characters that is globally acceptable...

winzig commented Feb 9, 2014

Another issue with supporting unicode domains is I think you'd have to make a decision that would tie this regex to either being PCRE compliant (using \x{1234} unicode characters), or JavaScript compliant, with \u1234 style unicode. I don't think there's currently a way to specify unicode characters that is globally acceptable...

@HenkPoley

This comment has been minimized.

Show comment Hide comment
@HenkPoley

HenkPoley Feb 9, 2014

Someone has looked deeper into the problem a while back: http://mathiasbynens.be/demo/url-regex

Also check the link to the gist over there: https://gist.github.com/dperini/729294

It fails a couple of -arguably fishy- URLs:
http://nic.xn--unup4y <-- valid equivalent of http://nic.游戏
http://xn--h32b13vza.xn--3e0b707e/ <-- equiv: http://이메일.한국/
https://localhost/
http://2915201185/search?q=hello

The test.html harness over at https://gist.github.com/michaelpigg/4001961
.. needs a meta charset="utf-8" below the head

Someone has looked deeper into the problem a while back: http://mathiasbynens.be/demo/url-regex

Also check the link to the gist over there: https://gist.github.com/dperini/729294

It fails a couple of -arguably fishy- URLs:
http://nic.xn--unup4y <-- valid equivalent of http://nic.游戏
http://xn--h32b13vza.xn--3e0b707e/ <-- equiv: http://이메일.한국/
https://localhost/
http://2915201185/search?q=hello

The test.html harness over at https://gist.github.com/michaelpigg/4001961
.. needs a meta charset="utf-8" below the head

@HenkPoley

This comment has been minimized.

Show comment Hide comment
@HenkPoley

HenkPoley Feb 9, 2014

I added support for punycode in the domain to dperini's regex here: https://gist.github.com/HenkPoley/8899766
(dperini's version already supported IDNs / UTF-8 domains)

Basically all the domain parts check that xn-- is followed by one or more numbers, letters or dashes.

I have no clue about arcane UTF encodings of the resource path.

I added support for punycode in the domain to dperini's regex here: https://gist.github.com/HenkPoley/8899766
(dperini's version already supported IDNs / UTF-8 domains)

Basically all the domain parts check that xn-- is followed by one or more numbers, letters or dashes.

I have no clue about arcane UTF encodings of the resource path.

This comment has been minimized.

Show comment Hide comment
@ghost

ghost Feb 9, 2014

@HenkPoley in what world do users of URLs actually type the "xn--*" version of a URL? I'm having trouble trying to work out what you are not understanding here, please help. It sounds like you expect that Chinese people sit around memorising the ASCII version of their favourite Chinese website?

ghost commented Feb 9, 2014

@HenkPoley in what world do users of URLs actually type the "xn--*" version of a URL? I'm having trouble trying to work out what you are not understanding here, please help. It sounds like you expect that Chinese people sit around memorising the ASCII version of their favourite Chinese website?

@jmesterh

This comment has been minimized.

Show comment Hide comment
@jmesterh

jmesterh Feb 9, 2014

@corydoras you convert the URL to punycode internally before running it through the validator.

jmesterh commented Feb 9, 2014

@corydoras you convert the URL to punycode internally before running it through the validator.

This comment has been minimized.

Show comment Hide comment
@ghost

ghost Feb 9, 2014

@jmesterh To convert a url to punycode, dont you first need a regular expression to find the URL? So you still need a regexp that can match unicode domain names. :D And at this point, if you have a regexp to find the URL, there is no need to convert it to punycode anyway right?

ghost commented Feb 9, 2014

@jmesterh To convert a url to punycode, dont you first need a regular expression to find the URL? So you still need a regexp that can match unicode domain names. :D And at this point, if you have a regexp to find the URL, there is no need to convert it to punycode anyway right?

@millimoose

This comment has been minimized.

Show comment Hide comment
@millimoose

millimoose Feb 9, 2014

@corydoras – So convert the whole text into Punycode, process it however you wish (i.e. replace URLs with anchors), convert it back.

@corydoras – So convert the whole text into Punycode, process it however you wish (i.e. replace URLs with anchors), convert it back.

This comment has been minimized.

Show comment Hide comment
@ghost

ghost Feb 9, 2014

@millimoose Not only is that inefficient, but it means that by default, sites, projects, systems, that use this regular expression don't support non-American users.

ghost commented Feb 9, 2014

@millimoose Not only is that inefficient, but it means that by default, sites, projects, systems, that use this regular expression don't support non-American users.

@millimoose

This comment has been minimized.

Show comment Hide comment
@millimoose

millimoose Feb 10, 2014

@corydoras

I call bullshit on "inefficient". You're running a nontrivial regex on the whole body of the text already – the O(1) ship has sailed, and my gut tells me the O(n) one has as well. (Pro–tip: whenever say "inefficient" without mentioning O-notation, you're probably wrong.) Besides, it's a straightforward fix for the situation, as opposed to devising a Unicode-based one from scratch. (And good luck with how your RE engine handles Unicode in that case; that's besides the issue that @winzig points out, about Unicode escapes for string literals being unportable.)

I call bullshit on "non-American". I'm Slovak, i.e. not American or born or residing in a country that has English as its primary language. I have yet to see a site with an IDN – by now it'd probably be confusing to users here that they're supposed to use diacritics in an URL. Just to drive my point home, not even Yandex, the major Russian search engine, owns Яндекс.com or .ru.

I have no idea what the situation is elsewhere – say, in the CJK region – but I'd wager that in countries that use essentially the Latin alphabet, or an alphabet that is easily transliterated to Latin, IDN support is a nonissue. Those countries are a non-negligible target-market, and the connotation of ignorant cultural imperialism I assume you were going for vanishes.

You have to do a whole goddamn lot of work besides "using a different regex for URLs" to fully support non-Latin users. The sites that don't choose to do that all that work don't support foreign users anyway. And, given what I've said above about the size of that target market, that choice might make perfect sense. It doesn't necessarily have to be about dismissing a culture, as much as about a cost/benefit decision. To provide an example: Stack Overflow is a deliberately English-only site, to reduce community fragmentation. (I.e. it's better for everyone if they participate in English, possibly somewhat broken English, rather than having foreign-language content inaccessible to the majority of its users.) Why would they ever bother supporting other language?

All the while, this URL with the modifications proposed in the comments is a useful tool in that process. For some reason, you're arguing with the people making those constructive proposals. (And as I've noticed, without making any yourself.) What, exactly, are you trying to accomplish here?

@corydoras

I call bullshit on "inefficient". You're running a nontrivial regex on the whole body of the text already – the O(1) ship has sailed, and my gut tells me the O(n) one has as well. (Pro–tip: whenever say "inefficient" without mentioning O-notation, you're probably wrong.) Besides, it's a straightforward fix for the situation, as opposed to devising a Unicode-based one from scratch. (And good luck with how your RE engine handles Unicode in that case; that's besides the issue that @winzig points out, about Unicode escapes for string literals being unportable.)

I call bullshit on "non-American". I'm Slovak, i.e. not American or born or residing in a country that has English as its primary language. I have yet to see a site with an IDN – by now it'd probably be confusing to users here that they're supposed to use diacritics in an URL. Just to drive my point home, not even Yandex, the major Russian search engine, owns Яндекс.com or .ru.

I have no idea what the situation is elsewhere – say, in the CJK region – but I'd wager that in countries that use essentially the Latin alphabet, or an alphabet that is easily transliterated to Latin, IDN support is a nonissue. Those countries are a non-negligible target-market, and the connotation of ignorant cultural imperialism I assume you were going for vanishes.

You have to do a whole goddamn lot of work besides "using a different regex for URLs" to fully support non-Latin users. The sites that don't choose to do that all that work don't support foreign users anyway. And, given what I've said above about the size of that target market, that choice might make perfect sense. It doesn't necessarily have to be about dismissing a culture, as much as about a cost/benefit decision. To provide an example: Stack Overflow is a deliberately English-only site, to reduce community fragmentation. (I.e. it's better for everyone if they participate in English, possibly somewhat broken English, rather than having foreign-language content inaccessible to the majority of its users.) Why would they ever bother supporting other language?

All the while, this URL with the modifications proposed in the comments is a useful tool in that process. For some reason, you're arguing with the people making those constructive proposals. (And as I've noticed, without making any yourself.) What, exactly, are you trying to accomplish here?

@mattduffy

This comment has been minimized.

Show comment Hide comment
@mattduffy

mattduffy Feb 10, 2014

I have not run the code, but it looks like a small typo on lines 21 and 44, in the OR statements of 2 letter country codes.

..si|sj| Ja|sk..

I don't know if Ja is actually supposed to be capitalized, but it appears that way (capitalized) in the single line version on line 2, but without the preceding whitespace character.

I have not run the code, but it looks like a small typo on lines 21 and 44, in the OR statements of 2 letter country codes.

..si|sj| Ja|sk..

I don't know if Ja is actually supposed to be capitalized, but it appears that way (capitalized) in the single line version on line 2, but without the preceding whitespace character.

This comment has been minimized.

Show comment Hide comment
@ghost

ghost Feb 10, 2014

@millimoose What am I trying to accomplish? Well at some point the world is going to have to make the transition for natively supporting multiple character sets. It's attitudes like yours that hold up this sort of progress. Sure there are all of the types of problems you suggest, but they are problems that we should seek to resolve. Of course there are english only sites, and sites that avoid resolving the problem for business reasons. Thats not an excuse to not aim to support the best and widest number of character sets.

ghost commented Feb 10, 2014

@millimoose What am I trying to accomplish? Well at some point the world is going to have to make the transition for natively supporting multiple character sets. It's attitudes like yours that hold up this sort of progress. Sure there are all of the types of problems you suggest, but they are problems that we should seek to resolve. Of course there are english only sites, and sites that avoid resolving the problem for business reasons. Thats not an excuse to not aim to support the best and widest number of character sets.

@millimoose

This comment has been minimized.

Show comment Hide comment
@millimoose

millimoose Feb 10, 2014

@corydoras – you're not answering the question I asked. To phrase it in a single sentence: What are you trying to accomplish arguing with the people trying to make this RE useful in supporting IDNs?

Also: "the world is going to have to make the transition", "hold up […] progress", "we should seek to resolve", "that's not an excuse"? By this point it seems you're arguing for the sake of arguing, and veering off-topic. The reason nobody is resolving the problem you mention is because they don't have that problem, or have to make any transition – you do / want them to. How about you a) give a solid economical argument for supporting IDNs in my code; b) give a solid technical issue with the regex or the solutions people gave you; or c) make a constructive contribution?

You were given the proposal to take the above regex, change it to allow punycode escape sequences as well as ASCII, and use the encode-process-decode approach to find URLs. Do you have an actual use case where this fails? Can you give some example code that reproduces the problem? Basically, can we go on to talking about real problems people actually using the above RE are having, instead of preaching and hypotheticals?

@corydoras – you're not answering the question I asked. To phrase it in a single sentence: What are you trying to accomplish arguing with the people trying to make this RE useful in supporting IDNs?

Also: "the world is going to have to make the transition", "hold up […] progress", "we should seek to resolve", "that's not an excuse"? By this point it seems you're arguing for the sake of arguing, and veering off-topic. The reason nobody is resolving the problem you mention is because they don't have that problem, or have to make any transition – you do / want them to. How about you a) give a solid economical argument for supporting IDNs in my code; b) give a solid technical issue with the regex or the solutions people gave you; or c) make a constructive contribution?

You were given the proposal to take the above regex, change it to allow punycode escape sequences as well as ASCII, and use the encode-process-decode approach to find URLs. Do you have an actual use case where this fails? Can you give some example code that reproduces the problem? Basically, can we go on to talking about real problems people actually using the above RE are having, instead of preaching and hypotheticals?

@mikaelkaron

This comment has been minimized.

Show comment Hide comment
@mikaelkaron

mikaelkaron Feb 11, 2014

wonder what @slevithan could do with this 😉

wonder what @slevithan could do with this 😉

@heycalmdown

This comment has been minimized.

Show comment Hide comment
@heycalmdown

heycalmdown Feb 12, 2014

At the part for # not preceded by a @, avoid matching foo@_gmail.com_, how can we support it to work with JavaScript that doesn't have negative lookbehind syntax?

At the part for # not preceded by a @, avoid matching foo@_gmail.com_, how can we support it to work with JavaScript that doesn't have negative lookbehind syntax?

@tmesser

This comment has been minimized.

Show comment Hide comment
@tmesser

tmesser Mar 9, 2014

Perhaps I am missing something, but is the closed set of valid TLDs a good idea? TLDs can now be bought like normal domain names. Will this regex work in 3-5 years when new TLDs emerge, or will it need to be udpated?

tmesser commented Mar 9, 2014

Perhaps I am missing something, but is the closed set of valid TLDs a good idea? TLDs can now be bought like normal domain names. Will this regex work in 3-5 years when new TLDs emerge, or will it need to be udpated?

@emileswain

This comment has been minimized.

Show comment Hide comment
@emileswain

emileswain Mar 18, 2014

I'm not massively familiar with regex, so when the above didn't work In ruby i discovered i needed to use the following syntax for it to work.

reg = %r{YourSingleLineRegCutAndPasted}

This was to allow the use of / within the regex itself.

I'm not massively familiar with regex, so when the above didn't work In ruby i discovered i needed to use the following syntax for it to work.

reg = %r{YourSingleLineRegCutAndPasted}

This was to allow the use of / within the regex itself.

@alricb

This comment has been minimized.

Show comment Hide comment
@alricb

alricb Jun 25, 2014

Note that this regex may be vulnerable to a denial of service attack if used on untrusted input with an NFA engine. Ref: http://en.wikipedia.org/wiki/ReDoS

alricb commented Jun 25, 2014

Note that this regex may be vulnerable to a denial of service attack if used on untrusted input with an NFA engine. Ref: http://en.wikipedia.org/wiki/ReDoS

@fujin

This comment has been minimized.

Show comment Hide comment
@fujin

fujin Sep 12, 2014

I can't really recommend using a Regex for this.

Try something like golang's net/url Parse: http://golang.org/pkg/net/url/#Parse
http://golang.org/src/pkg/net/url/url.go?s=8497:8544#L323

fujin commented Sep 12, 2014

I can't really recommend using a Regex for this.

Try something like golang's net/url Parse: http://golang.org/pkg/net/url/#Parse
http://golang.org/src/pkg/net/url/url.go?s=8497:8544#L323

@hugows

This comment has been minimized.

Show comment Hide comment
@hugows

hugows Sep 13, 2014

@fujin: "The problem the pattern attempts to solve: identify the URLs in an arbitrary string of text, where by “arbitrary” let’s agree we mean something unstructured such as an email message or a tweet."

hugows commented Sep 13, 2014

@fujin: "The problem the pattern attempts to solve: identify the URLs in an arbitrary string of text, where by “arbitrary” let’s agree we mean something unstructured such as an email message or a tweet."

@berzniz

This comment has been minimized.

Show comment Hide comment
@berzniz

berzniz Oct 1, 2014

The regex gets stuck in an infinite loop (JavaScript) when you have many trailing dots:

var text = 'http://www.google.com............................................';
regex.exec(text); // stuck in infinite loop

berzniz commented Oct 1, 2014

The regex gets stuck in an infinite loop (JavaScript) when you have many trailing dots:

var text = 'http://www.google.com............................................';
regex.exec(text); // stuck in infinite loop
@Xeoncross

This comment has been minimized.

Show comment Hide comment
@Xeoncross

Xeoncross Oct 1, 2014

Here is a simple Javascript regex I tested on several variants listed here. I'm still not up-to-speed with go's patterns so I tried to keep this simple and not use any lookaheads or anything.

(([a-z]{3,6}://)|(^|\s))([a-zA-Z0-9\-]+\.)+[a-z]{2,13}[\.\?\=\&\%\/\w\-]*\b([^@]|$)
domain.com
www.domain.com
thisisareallylongdomainnamewithunder62parts.co
node-1.www4.example.com.jp
something .com
email@domain.com
@example.com
user.name@example.com
http://domain.com
ftp://foo.1.example.com.uk
url::ecode.foo()
hi there John.Bob
example.com?foo=bar
example.com/foo/bar?baz=true&something=%20alsotrue

I wrote it assuming the \m modifier for a list like this that ends with a newline. You can fiddle with it yourself

Here is a simple Javascript regex I tested on several variants listed here. I'm still not up-to-speed with go's patterns so I tried to keep this simple and not use any lookaheads or anything.

(([a-z]{3,6}://)|(^|\s))([a-zA-Z0-9\-]+\.)+[a-z]{2,13}[\.\?\=\&\%\/\w\-]*\b([^@]|$)
domain.com
www.domain.com
thisisareallylongdomainnamewithunder62parts.co
node-1.www4.example.com.jp
something .com
email@domain.com
@example.com
user.name@example.com
http://domain.com
ftp://foo.1.example.com.uk
url::ecode.foo()
hi there John.Bob
example.com?foo=bar
example.com/foo/bar?baz=true&something=%20alsotrue

I wrote it assuming the \m modifier for a list like this that ends with a newline. You can fiddle with it yourself

@scrivener

This comment has been minimized.

Show comment Hide comment
@scrivener

scrivener Oct 3, 2014

It's not actually an infinite loop with lots of trailing dots, it's just that it takes twice as long to run for each additional dot (tested in Python). Exclamation marks, commas and semicolons also cause the same problem.

It's not actually an infinite loop with lots of trailing dots, it's just that it takes twice as long to run for each additional dot (tested in Python). Exclamation marks, commas and semicolons also cause the same problem.

@AvnerCohen

This comment has been minimized.

Show comment Hide comment
@AvnerCohen

AvnerCohen Jan 1, 2015

It get's stuck on ruby as well.

This one works for me for 99% of cases, which is what I needed:

((?<=[^a-zA-Z0-9])(?:https?\:\/\/|[a-zA-Z0-9]{1,}\.{1}|\b)(?:\w{1,}\.{1}){1,5}(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm){1}(?:\/[a-zA-Z0-9]{1,})*)

https://regex101.com/r/fO6mX3/2

It get's stuck on ruby as well.

This one works for me for 99% of cases, which is what I needed:

((?<=[^a-zA-Z0-9])(?:https?\:\/\/|[a-zA-Z0-9]{1,}\.{1}|\b)(?:\w{1,}\.{1}){1,5}(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm){1}(?:\/[a-zA-Z0-9]{1,})*)

https://regex101.com/r/fO6mX3/2

@kbeezie

This comment has been minimized.

Show comment Hide comment
@kbeezie

kbeezie Jan 10, 2015

Can't seem to get this to work in PHP, even using a <<<EOD ... EOD; type of input into the php variable for preg pattern, it keeps getting hung up on an unknown modifier ''.

Edit: had to use a delimiter that was not apart of the string at all:
$webpattern = <<<EOD
pattern here
EOD;

kbeezie commented Jan 10, 2015

Can't seem to get this to work in PHP, even using a <<<EOD ... EOD; type of input into the php variable for preg pattern, it keeps getting hung up on an unknown modifier ''.

Edit: had to use a delimiter that was not apart of the string at all:
$webpattern = <<<EOD
pattern here
EOD;

@dpk

This comment has been minimized.

Show comment Hide comment
@dpk

dpk May 10, 2015

Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.

dpk commented May 10, 2015

Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.

@tavor

This comment has been minimized.

Show comment Hide comment
@tavor

tavor Sep 21, 2015

http://schema.org","@type
Is being matched. Generally, whenever there is a " after a seemingly valid url, the match includes lots of characters after the url.

tavor commented Sep 21, 2015

http://schema.org","@type
Is being matched. Generally, whenever there is a " after a seemingly valid url, the match includes lots of characters after the url.

@CodingNinjaOctocat

This comment has been minimized.

Show comment Hide comment
@CodingNinjaOctocat

CodingNinjaOctocat Nov 9, 2015

How to match Web images URI in JSON? such as :
-- http://p1.xxx.com/45/b9/45b9f057fc1957ed2c946814342c0f02.jpg OR
-- http://pic1.xxx.com/4766e0648_m.jpg OR ETC.
And i want to replace the URIs to
-- C://MyFolder/45b9f057fc1957ed2c946814342c0f02.jpg OR
-- C://MyFolder/4766e0648_m.jpg OR ETC.
I tried:
-- [a-zA-z]+://[^\s]/[^\s].jpg
but if JSON is not formatted code,It can only match one result:
-- http://xxx.com/xxx.jpg...OTHER JSON CODE ... http://xxx.com/xxx.jpg
Which means it is based on the http:// at the beginning and end. jpg as a result
I tried another:
-- (http|ftp|https)://[\w-]+(.[\w-]+)+([\w-.,@?^=%&:/+#]*[\w-@?^=%&/+#])?.jpg
it work OK
but it is cannot use $1、$2 to replace...

How to match Web images URI in JSON? such as :
-- http://p1.xxx.com/45/b9/45b9f057fc1957ed2c946814342c0f02.jpg OR
-- http://pic1.xxx.com/4766e0648_m.jpg OR ETC.
And i want to replace the URIs to
-- C://MyFolder/45b9f057fc1957ed2c946814342c0f02.jpg OR
-- C://MyFolder/4766e0648_m.jpg OR ETC.
I tried:
-- [a-zA-z]+://[^\s]/[^\s].jpg
but if JSON is not formatted code,It can only match one result:
-- http://xxx.com/xxx.jpg...OTHER JSON CODE ... http://xxx.com/xxx.jpg
Which means it is based on the http:// at the beginning and end. jpg as a result
I tried another:
-- (http|ftp|https)://[\w-]+(.[\w-]+)+([\w-.,@?^=%&:/+#]*[\w-@?^=%&/+#])?.jpg
it work OK
but it is cannot use $1、$2 to replace...

@ethaniel

This comment has been minimized.

Show comment Hide comment
@ethaniel

ethaniel Aug 22, 2017

@gruber I think this thread needs your attention :)

@gruber I think this thread needs your attention :)

@scryba

This comment has been minimized.

Show comment Hide comment
@scryba

scryba Oct 14, 2017

in PHP

$regex ="
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+./)(?:[^\s()<>{}[]]+|([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?))+(?:([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?)|[^\s`!()[]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*.\b/?(?!@)))

"
$string = "google.com Inquiry www.yahoo.com test for: http://www.bing.com";
preg_match_all($regex, $string, $match);

var_dump(match[0]);

But I get the error => [error] [php] preg_match_all(): Unknown modifier ' \ '

Any work around this?

scryba commented Oct 14, 2017

in PHP

$regex ="
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+./)(?:[^\s()<>{}[]]+|([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?))+(?:([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?)|[^\s`!()[]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*.\b/?(?!@)))

"
$string = "google.com Inquiry www.yahoo.com test for: http://www.bing.com";
preg_match_all($regex, $string, $match);

var_dump(match[0]);

But I get the error => [error] [php] preg_match_all(): Unknown modifier ' \ '

Any work around this?

@lukapaunovic

This comment has been minimized.

Show comment Hide comment

lukapaunovic commented Oct 20, 2017

@clement-analogue

This comment has been minimized.

Show comment Hide comment
@clement-analogue

clement-analogue Nov 24, 2017

@lukapaunovic This PHP code does not work with ( and ).

@lukapaunovic This PHP code does not work with ( and ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment