The regex patterns in this gist are intended only to match web URLs -- http, | |
https, and naked domains like "example.com". For a pattern that attempts to | |
match all URLs, regardless of protocol, see: https://gist.github.com/gruber/249502 | |
# Single-line version: | |
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@))) | |
# Commented multi-line version: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
https?: # URL protocol and colon | |
(?: | |
/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
# looks like domain name followed by a slash: | |
[a-z0-9.\-]+[.] | |
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw) | |
/ | |
) | |
(?: # One or more: | |
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| # or | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
)+ | |
(?: # End with: | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
| # or | |
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars | |
) | |
| # OR, the following to match naked domains: | |
(?: | |
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_ | |
[a-z0-9]+ | |
(?:[.\-][a-z0-9]+)* | |
[.] | |
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw) | |
\b | |
/? | |
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com" | |
) | |
) |
This comment has been minimized.
This comment has been minimized.
And line 44. Or it might be at the point where you just give up and look for this instead: (?:[a-z]{2,13}) I.e. 2 to 13 letters (the longest gTLD I found was .information, so I went with 13 on the high end). |
This comment has been minimized.
This comment has been minimized.
And [a-z] may not be enough. Consider the new gTLD, . 移动, which is Chinese for "mobile." Hmmm... |
This comment has been minimized.
This comment has been minimized.
Looking forward to winzig’s improvements… _hint_hint* |
This comment has been minimized.
This comment has been minimized.
Aren't all domain names ultimately encoded as ASCII? e.g. punycode: https://en.wikipedia.org/wiki/Punycode .移动 --> .xn--6frz82g A list of 'all' TLDs: https://data.iana.org/TLD/tlds-alpha-by-domain.txt Personally I wouldn't encode a changing list into a fixed regex. |
This comment has been minimized.
This comment has been minimized.
A Chinese or other language forum post, RSS feed, blog post, or other. Is not going to have users who type the asci representation of a Chinese domain name! It's difficult to see a use case for this. |
This comment has been minimized.
This comment has been minimized.
@gruber @mcritz Well, here's my fork with the updates I mentioned: https://gist.github.com/winzig/8894715 I haven't tackled handling unicode TLDs, but given that John has not yet attempted matching them in the domain name matching part itself, I don't feel too bad. :-) |
This comment has been minimized.
This comment has been minimized.
Another issue with supporting unicode domains is I think you'd have to make a decision that would tie this regex to either being PCRE compliant (using \x{1234} unicode characters), or JavaScript compliant, with \u1234 style unicode. I don't think there's currently a way to specify unicode characters that is globally acceptable... |
This comment has been minimized.
This comment has been minimized.
Someone has looked deeper into the problem a while back: http://mathiasbynens.be/demo/url-regex Also check the link to the gist over there: https://gist.github.com/dperini/729294 It fails a couple of -arguably fishy- URLs: The test.html harness over at https://gist.github.com/michaelpigg/4001961 |
This comment has been minimized.
This comment has been minimized.
I added support for punycode in the domain to dperini's regex here: https://gist.github.com/HenkPoley/8899766 Basically all the domain parts check that xn-- is followed by one or more numbers, letters or dashes. I have no clue about arcane UTF encodings of the resource path. |
This comment has been minimized.
This comment has been minimized.
@HenkPoley in what world do users of URLs actually type the "xn--*" version of a URL? I'm having trouble trying to work out what you are not understanding here, please help. It sounds like you expect that Chinese people sit around memorising the ASCII version of their favourite Chinese website? |
This comment has been minimized.
This comment has been minimized.
@corydoras you convert the URL to punycode internally before running it through the validator. |
This comment has been minimized.
This comment has been minimized.
@jmesterh To convert a url to punycode, dont you first need a regular expression to find the URL? So you still need a regexp that can match unicode domain names. :D And at this point, if you have a regexp to find the URL, there is no need to convert it to punycode anyway right? |
This comment has been minimized.
This comment has been minimized.
@corydoras – So convert the whole text into Punycode, process it however you wish (i.e. replace URLs with anchors), convert it back. |
This comment has been minimized.
This comment has been minimized.
@millimoose Not only is that inefficient, but it means that by default, sites, projects, systems, that use this regular expression don't support non-American users. |
This comment has been minimized.
This comment has been minimized.
I call bullshit on "inefficient". You're running a nontrivial regex on the whole body of the text already – the I call bullshit on "non-American". I'm Slovak, i.e. not American or born or residing in a country that has English as its primary language. I have yet to see a site with an IDN – by now it'd probably be confusing to users here that they're supposed to use diacritics in an URL. Just to drive my point home, not even Yandex, the major Russian search engine, owns Яндекс.com or .ru. I have no idea what the situation is elsewhere – say, in the CJK region – but I'd wager that in countries that use essentially the Latin alphabet, or an alphabet that is easily transliterated to Latin, IDN support is a nonissue. Those countries are a non-negligible target-market, and the connotation of ignorant cultural imperialism I assume you were going for vanishes. You have to do a whole goddamn lot of work besides "using a different regex for URLs" to fully support non-Latin users. The sites that don't choose to do that all that work don't support foreign users anyway. And, given what I've said above about the size of that target market, that choice might make perfect sense. It doesn't necessarily have to be about dismissing a culture, as much as about a cost/benefit decision. To provide an example: Stack Overflow is a deliberately English-only site, to reduce community fragmentation. (I.e. it's better for everyone if they participate in English, possibly somewhat broken English, rather than having foreign-language content inaccessible to the majority of its users.) Why would they ever bother supporting other language? All the while, this URL with the modifications proposed in the comments is a useful tool in that process. For some reason, you're arguing with the people making those constructive proposals. (And as I've noticed, without making any yourself.) What, exactly, are you trying to accomplish here? |
This comment has been minimized.
This comment has been minimized.
I have not run the code, but it looks like a small typo on lines 21 and 44, in the OR statements of 2 letter country codes.
I don't know if Ja is actually supposed to be capitalized, but it appears that way (capitalized) in the single line version on line 2, but without the preceding whitespace character. |
This comment has been minimized.
This comment has been minimized.
@millimoose What am I trying to accomplish? Well at some point the world is going to have to make the transition for natively supporting multiple character sets. It's attitudes like yours that hold up this sort of progress. Sure there are all of the types of problems you suggest, but they are problems that we should seek to resolve. Of course there are english only sites, and sites that avoid resolving the problem for business reasons. Thats not an excuse to not aim to support the best and widest number of character sets. |
This comment has been minimized.
This comment has been minimized.
@corydoras – you're not answering the question I asked. To phrase it in a single sentence: What are you trying to accomplish arguing with the people trying to make this RE useful in supporting IDNs? Also: "the world is going to have to make the transition", "hold up […] progress", "we should seek to resolve", "that's not an excuse"? By this point it seems you're arguing for the sake of arguing, and veering off-topic. The reason nobody is resolving the problem you mention is because they don't have that problem, or have to make any transition – you do / want them to. How about you a) give a solid economical argument for supporting IDNs in my code; b) give a solid technical issue with the regex or the solutions people gave you; or c) make a constructive contribution? You were given the proposal to take the above regex, change it to allow punycode escape sequences as well as ASCII, and use the encode-process-decode approach to find URLs. Do you have an actual use case where this fails? Can you give some example code that reproduces the problem? Basically, can we go on to talking about real problems people actually using the above RE are having, instead of preaching and hypotheticals? |
This comment has been minimized.
This comment has been minimized.
wonder what @slevithan could do with this |
This comment has been minimized.
This comment has been minimized.
At the part for |
This comment has been minimized.
This comment has been minimized.
Perhaps I am missing something, but is the closed set of valid TLDs a good idea? TLDs can now be bought like normal domain names. Will this regex work in 3-5 years when new TLDs emerge, or will it need to be udpated? |
This comment has been minimized.
This comment has been minimized.
I'm not massively familiar with regex, so when the above didn't work In ruby i discovered i needed to use the following syntax for it to work. reg = %r{YourSingleLineRegCutAndPasted} This was to allow the use of / within the regex itself. |
This comment has been minimized.
This comment has been minimized.
Note that this regex may be vulnerable to a denial of service attack if used on untrusted input with an NFA engine. Ref: http://en.wikipedia.org/wiki/ReDoS |
This comment has been minimized.
This comment has been minimized.
I can't really recommend using a Regex for this. Try something like golang's net/url Parse: http://golang.org/pkg/net/url/#Parse |
This comment has been minimized.
This comment has been minimized.
@fujin: "The problem the pattern attempts to solve: identify the URLs in an arbitrary string of text, where by “arbitrary” let’s agree we mean something unstructured such as an email message or a tweet." |
This comment has been minimized.
This comment has been minimized.
The regex gets stuck in an infinite loop (JavaScript) when you have many trailing dots: var text = 'http://www.google.com............................................';
regex.exec(text); // stuck in infinite loop |
This comment has been minimized.
This comment has been minimized.
Here is a simple Javascript regex I tested on several variants listed here. I'm still not up-to-speed with go's patterns so I tried to keep this simple and not use any lookaheads or anything.
I wrote it assuming the |
This comment has been minimized.
This comment has been minimized.
It's not actually an infinite loop with lots of trailing dots, it's just that it takes twice as long to run for each additional dot (tested in Python). Exclamation marks, commas and semicolons also cause the same problem. |
This comment has been minimized.
This comment has been minimized.
It get's stuck on ruby as well. This one works for me for 99% of cases, which is what I needed: ((?<=[^a-zA-Z0-9])(?:https?\:\/\/|[a-zA-Z0-9]{1,}\.{1}|\b)(?:\w{1,}\.{1}){1,5}(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm){1}(?:\/[a-zA-Z0-9]{1,})*) |
This comment has been minimized.
This comment has been minimized.
Can't seem to get this to work in PHP, even using a <<<EOD ... EOD; type of input into the php variable for preg pattern, it keeps getting hung up on an unknown modifier ''. Edit: had to use a delimiter that was not apart of the string at all: |
This comment has been minimized.
This comment has been minimized.
Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack. |
This comment has been minimized.
This comment has been minimized.
http://schema.org","@type |
This comment has been minimized.
This comment has been minimized.
How to match Web images URI in JSON? such as : |
This comment has been minimized.
This comment has been minimized.
@gruber I think this thread needs your attention :) |
This comment has been minimized.
This comment has been minimized.
in PHP $regex =" " var_dump(match[0]); But I get the error => [error] [php] preg_match_all(): Unknown modifier ' \ ' Any work around this? |
This comment has been minimized.
This comment has been minimized.
@lukapaunovic This PHP code does not work with |
This comment has been minimized.
This comment has been minimized.
@AvnerCohen |
This comment has been minimized.
This comment has been minimized.
I suggest replacing the initial The outermost Now, this part:
is a recipe for seriously catastrophic runtime. I adapted it to the pattern below, which runs fast. More importantly, it completes quickly for the dozens of cases I tested that made the expression above run for over an hour (until being aborted):
Avoid nesting quantifiers in this way unless you add atomic grouping or make the inner ones possessive. In my engine (Java), you can also change most (not all) of the |
This comment has been minimized.
This comment has been minimized.
On Jun 30, 2020, at 6:26 PM, Andrew Fowler ***@***.***> wrote:
I suggest replacing the initial \b with (?<![a-z0-9@]) (and removing the now-redundant (?<!@) line). The word boundary is easier to read but leads to more wasted work for the rest of the expression.
I think “easier to read” is a huge win. And do you really know that it’s wasting an appreciable amount of work in the rest of the pattern? Unless something can be measured to be slow I think legibility always wins.
The outermost ( ) group also seems redundant? Most (all?) flavors capture the whole match as group #0 or with some other API.
That I’d have to think about, but I know in some engines like Perl accessing group $0 does take a hit performance wise.
—J.G.
|
This comment has been minimized.
This comment has been minimized.
Why bother ?
|
This comment has been minimized.
This comment has been minimized.
Thank you guys, for maintaining & updating this page till date! Landed here via Google > "url regex python" > StackOverflow > here. |
This comment has been minimized.
This comment has been minimized.
.pub, .ski domains are ignored without inital http |
This comment has been minimized.
@gruber thanks for sharing. You might need to add 1400 new gTLDs to line 24? ;-(
http://www.101domain.com/new_gtld_extensions.htm