Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Liberal, Accurate Regex Pattern for Matching Web URLs
The regex patterns in this gist are intended only to match web URLs -- http,
https, and naked domains like "example.com". For a pattern that attempts to
match all URLs, regardless of protocol, see: https://gist.github.com/gruber/249502
# Single-line version:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))
# Commented multi-line version:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
# looks like domain name followed by a slash:
[a-z0-9.\-]+[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
/
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_
[a-z0-9]+
(?:[.\-][a-z0-9]+)*
[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
\b
/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
)
)
@winzig

This comment has been minimized.

Copy link

@winzig winzig commented Feb 8, 2014

@gruber thanks for sharing. You might need to add 1400 new gTLDs to line 24? ;-(

http://www.101domain.com/new_gtld_extensions.htm

@winzig

This comment has been minimized.

Copy link

@winzig winzig commented Feb 8, 2014

And line 44. Or it might be at the point where you just give up and look for this instead:

(?:[a-z]{2,13})

I.e. 2 to 13 letters (the longest gTLD I found was .information, so I went with 13 on the high end).

@winzig

This comment has been minimized.

Copy link

@winzig winzig commented Feb 8, 2014

And [a-z] may not be enough. Consider the new gTLD, . 移动, which is Chinese for "mobile." Hmmm...

@mcritz

This comment has been minimized.

Copy link

@mcritz mcritz commented Feb 9, 2014

Looking forward to winzig’s improvements… _hint_hint*

@HenkPoley

This comment has been minimized.

Copy link

@HenkPoley HenkPoley commented Feb 9, 2014

Aren't all domain names ultimately encoded as ASCII?

e.g. punycode: https://en.wikipedia.org/wiki/Punycode

.移动 --> .xn--6frz82g
.游戏 --> .xn--unup4y

A list of 'all' TLDs: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
ICANN notes: "maintained by the IANA and is updated from time to time" (ref: http://www.icann.org/en/resources/registries/tlds )

Personally I wouldn't encode a changing list into a fixed regex.

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Feb 9, 2014

A Chinese or other language forum post, RSS feed, blog post, or other. Is not going to have users who type the asci representation of a Chinese domain name! It's difficult to see a use case for this.

@winzig

This comment has been minimized.

Copy link

@winzig winzig commented Feb 9, 2014

@gruber @mcritz Well, here's my fork with the updates I mentioned:

https://gist.github.com/winzig/8894715

I haven't tackled handling unicode TLDs, but given that John has not yet attempted matching them in the domain name matching part itself, I don't feel too bad. :-)

@winzig

This comment has been minimized.

Copy link

@winzig winzig commented Feb 9, 2014

Another issue with supporting unicode domains is I think you'd have to make a decision that would tie this regex to either being PCRE compliant (using \x{1234} unicode characters), or JavaScript compliant, with \u1234 style unicode. I don't think there's currently a way to specify unicode characters that is globally acceptable...

@HenkPoley

This comment has been minimized.

Copy link

@HenkPoley HenkPoley commented Feb 9, 2014

Someone has looked deeper into the problem a while back: http://mathiasbynens.be/demo/url-regex

Also check the link to the gist over there: https://gist.github.com/dperini/729294

It fails a couple of -arguably fishy- URLs:
http://nic.xn--unup4y <-- valid equivalent of http://nic.游戏
http://xn--h32b13vza.xn--3e0b707e/ <-- equiv: http://이메일.한국/
https://localhost/
http://2915201185/search?q=hello

The test.html harness over at https://gist.github.com/michaelpigg/4001961
.. needs a meta charset="utf-8" below the head

@HenkPoley

This comment has been minimized.

Copy link

@HenkPoley HenkPoley commented Feb 9, 2014

I added support for punycode in the domain to dperini's regex here: https://gist.github.com/HenkPoley/8899766
(dperini's version already supported IDNs / UTF-8 domains)

Basically all the domain parts check that xn-- is followed by one or more numbers, letters or dashes.

I have no clue about arcane UTF encodings of the resource path.

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Feb 9, 2014

@HenkPoley in what world do users of URLs actually type the "xn--*" version of a URL? I'm having trouble trying to work out what you are not understanding here, please help. It sounds like you expect that Chinese people sit around memorising the ASCII version of their favourite Chinese website?

@jmesterh

This comment has been minimized.

Copy link

@jmesterh jmesterh commented Feb 9, 2014

@corydoras you convert the URL to punycode internally before running it through the validator.

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Feb 9, 2014

@jmesterh To convert a url to punycode, dont you first need a regular expression to find the URL? So you still need a regexp that can match unicode domain names. :D And at this point, if you have a regexp to find the URL, there is no need to convert it to punycode anyway right?

@millimoose

This comment has been minimized.

Copy link

@millimoose millimoose commented Feb 9, 2014

@corydoras – So convert the whole text into Punycode, process it however you wish (i.e. replace URLs with anchors), convert it back.

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Feb 9, 2014

@millimoose Not only is that inefficient, but it means that by default, sites, projects, systems, that use this regular expression don't support non-American users.

@millimoose

This comment has been minimized.

Copy link

@millimoose millimoose commented Feb 10, 2014

@corydoras

I call bullshit on "inefficient". You're running a nontrivial regex on the whole body of the text already – the O(1) ship has sailed, and my gut tells me the O(n) one has as well. (Pro–tip: whenever say "inefficient" without mentioning O-notation, you're probably wrong.) Besides, it's a straightforward fix for the situation, as opposed to devising a Unicode-based one from scratch. (And good luck with how your RE engine handles Unicode in that case; that's besides the issue that @winzig points out, about Unicode escapes for string literals being unportable.)

I call bullshit on "non-American". I'm Slovak, i.e. not American or born or residing in a country that has English as its primary language. I have yet to see a site with an IDN – by now it'd probably be confusing to users here that they're supposed to use diacritics in an URL. Just to drive my point home, not even Yandex, the major Russian search engine, owns Яндекс.com or .ru.

I have no idea what the situation is elsewhere – say, in the CJK region – but I'd wager that in countries that use essentially the Latin alphabet, or an alphabet that is easily transliterated to Latin, IDN support is a nonissue. Those countries are a non-negligible target-market, and the connotation of ignorant cultural imperialism I assume you were going for vanishes.

You have to do a whole goddamn lot of work besides "using a different regex for URLs" to fully support non-Latin users. The sites that don't choose to do that all that work don't support foreign users anyway. And, given what I've said above about the size of that target market, that choice might make perfect sense. It doesn't necessarily have to be about dismissing a culture, as much as about a cost/benefit decision. To provide an example: Stack Overflow is a deliberately English-only site, to reduce community fragmentation. (I.e. it's better for everyone if they participate in English, possibly somewhat broken English, rather than having foreign-language content inaccessible to the majority of its users.) Why would they ever bother supporting other language?

All the while, this URL with the modifications proposed in the comments is a useful tool in that process. For some reason, you're arguing with the people making those constructive proposals. (And as I've noticed, without making any yourself.) What, exactly, are you trying to accomplish here?

@mattduffy

This comment has been minimized.

Copy link

@mattduffy mattduffy commented Feb 10, 2014

I have not run the code, but it looks like a small typo on lines 21 and 44, in the OR statements of 2 letter country codes.

..si|sj| Ja|sk..

I don't know if Ja is actually supposed to be capitalized, but it appears that way (capitalized) in the single line version on line 2, but without the preceding whitespace character.

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Feb 10, 2014

@millimoose What am I trying to accomplish? Well at some point the world is going to have to make the transition for natively supporting multiple character sets. It's attitudes like yours that hold up this sort of progress. Sure there are all of the types of problems you suggest, but they are problems that we should seek to resolve. Of course there are english only sites, and sites that avoid resolving the problem for business reasons. Thats not an excuse to not aim to support the best and widest number of character sets.

@millimoose

This comment has been minimized.

Copy link

@millimoose millimoose commented Feb 10, 2014

@corydoras – you're not answering the question I asked. To phrase it in a single sentence: What are you trying to accomplish arguing with the people trying to make this RE useful in supporting IDNs?

Also: "the world is going to have to make the transition", "hold up […] progress", "we should seek to resolve", "that's not an excuse"? By this point it seems you're arguing for the sake of arguing, and veering off-topic. The reason nobody is resolving the problem you mention is because they don't have that problem, or have to make any transition – you do / want them to. How about you a) give a solid economical argument for supporting IDNs in my code; b) give a solid technical issue with the regex or the solutions people gave you; or c) make a constructive contribution?

You were given the proposal to take the above regex, change it to allow punycode escape sequences as well as ASCII, and use the encode-process-decode approach to find URLs. Do you have an actual use case where this fails? Can you give some example code that reproduces the problem? Basically, can we go on to talking about real problems people actually using the above RE are having, instead of preaching and hypotheticals?

@mikaelkaron

This comment has been minimized.

Copy link

@mikaelkaron mikaelkaron commented Feb 11, 2014

wonder what @slevithan could do with this 😉

@heycalmdown

This comment has been minimized.

Copy link

@heycalmdown heycalmdown commented Feb 12, 2014

At the part for # not preceded by a @, avoid matching foo@_gmail.com_, how can we support it to work with JavaScript that doesn't have negative lookbehind syntax?

@tmesser

This comment has been minimized.

Copy link

@tmesser tmesser commented Mar 9, 2014

Perhaps I am missing something, but is the closed set of valid TLDs a good idea? TLDs can now be bought like normal domain names. Will this regex work in 3-5 years when new TLDs emerge, or will it need to be udpated?

@emileswain

This comment has been minimized.

Copy link

@emileswain emileswain commented Mar 18, 2014

I'm not massively familiar with regex, so when the above didn't work In ruby i discovered i needed to use the following syntax for it to work.

reg = %r{YourSingleLineRegCutAndPasted}

This was to allow the use of / within the regex itself.

@alricb

This comment has been minimized.

Copy link

@alricb alricb commented Jun 25, 2014

Note that this regex may be vulnerable to a denial of service attack if used on untrusted input with an NFA engine. Ref: http://en.wikipedia.org/wiki/ReDoS

@fujin

This comment has been minimized.

Copy link

@fujin fujin commented Sep 12, 2014

I can't really recommend using a Regex for this.

Try something like golang's net/url Parse: http://golang.org/pkg/net/url/#Parse
http://golang.org/src/pkg/net/url/url.go?s=8497:8544#L323

@hugows

This comment has been minimized.

Copy link

@hugows hugows commented Sep 13, 2014

@fujin: "The problem the pattern attempts to solve: identify the URLs in an arbitrary string of text, where by “arbitrary” let’s agree we mean something unstructured such as an email message or a tweet."

@berzniz

This comment has been minimized.

Copy link

@berzniz berzniz commented Oct 1, 2014

The regex gets stuck in an infinite loop (JavaScript) when you have many trailing dots:

var text = 'http://www.google.com............................................';
regex.exec(text); // stuck in infinite loop
@Xeoncross

This comment has been minimized.

Copy link

@Xeoncross Xeoncross commented Oct 1, 2014

Here is a simple Javascript regex I tested on several variants listed here. I'm still not up-to-speed with go's patterns so I tried to keep this simple and not use any lookaheads or anything.

(([a-z]{3,6}://)|(^|\s))([a-zA-Z0-9\-]+\.)+[a-z]{2,13}[\.\?\=\&\%\/\w\-]*\b([^@]|$)
domain.com
www.domain.com
thisisareallylongdomainnamewithunder62parts.co
node-1.www4.example.com.jp
something .com
email@domain.com
@example.com
user.name@example.com
http://domain.com
ftp://foo.1.example.com.uk
url::ecode.foo()
hi there John.Bob
example.com?foo=bar
example.com/foo/bar?baz=true&something=%20alsotrue

I wrote it assuming the \m modifier for a list like this that ends with a newline. You can fiddle with it yourself

@scrivener

This comment has been minimized.

Copy link

@scrivener scrivener commented Oct 3, 2014

It's not actually an infinite loop with lots of trailing dots, it's just that it takes twice as long to run for each additional dot (tested in Python). Exclamation marks, commas and semicolons also cause the same problem.

@AvnerCohen

This comment has been minimized.

Copy link

@AvnerCohen AvnerCohen commented Jan 1, 2015

It get's stuck on ruby as well.

This one works for me for 99% of cases, which is what I needed:

((?<=[^a-zA-Z0-9])(?:https?\:\/\/|[a-zA-Z0-9]{1,}\.{1}|\b)(?:\w{1,}\.{1}){1,5}(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm){1}(?:\/[a-zA-Z0-9]{1,})*)

https://regex101.com/r/fO6mX3/2

@kbeezie

This comment has been minimized.

Copy link

@kbeezie kbeezie commented Jan 10, 2015

Can't seem to get this to work in PHP, even using a <<<EOD ... EOD; type of input into the php variable for preg pattern, it keeps getting hung up on an unknown modifier ''.

Edit: had to use a delimiter that was not apart of the string at all:
$webpattern = <<<EOD
pattern here
EOD;

@dpk

This comment has been minimized.

Copy link

@dpk dpk commented May 10, 2015

Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.

@tavor

This comment has been minimized.

Copy link

@tavor tavor commented Sep 21, 2015

http://schema.org","@type
Is being matched. Generally, whenever there is a " after a seemingly valid url, the match includes lots of characters after the url.

@CodingOctocat

This comment has been minimized.

Copy link

@CodingOctocat CodingOctocat commented Nov 9, 2015

How to match Web images URI in JSON? such as :
-- http://p1.xxx.com/45/b9/45b9f057fc1957ed2c946814342c0f02.jpg OR
-- http://pic1.xxx.com/4766e0648_m.jpg OR ETC.
And i want to replace the URIs to
-- C://MyFolder/45b9f057fc1957ed2c946814342c0f02.jpg OR
-- C://MyFolder/4766e0648_m.jpg OR ETC.
I tried:
-- [a-zA-z]+://[^\s]/[^\s].jpg
but if JSON is not formatted code,It can only match one result:
-- http://xxx.com/xxx.jpg...OTHER JSON CODE ... http://xxx.com/xxx.jpg
Which means it is based on the http:// at the beginning and end. jpg as a result
I tried another:
-- (http|ftp|https)://[\w-]+(.[\w-]+)+([\w-.,@?^=%&:/+#]*[\w-@?^=%&/+#])?.jpg
it work OK
but it is cannot use $1、$2 to replace...

@ethaniel

This comment has been minimized.

Copy link

@ethaniel ethaniel commented Aug 22, 2017

@gruber I think this thread needs your attention :)

@scryba

This comment has been minimized.

Copy link

@scryba scryba commented Oct 14, 2017

in PHP

$regex ="
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+./)(?:[^\s()<>{}[]]+|([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?))+(?:([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?)|[^\s`!()[]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*.\b/?(?!@)))

"
$string = "google.com Inquiry www.yahoo.com test for: http://www.bing.com";
preg_match_all($regex, $string, $match);

var_dump(match[0]);

But I get the error => [error] [php] preg_match_all(): Unknown modifier ' \ '

Any work around this?

@lukapaunovic

This comment has been minimized.

Copy link

@lukapaunovic lukapaunovic commented Oct 20, 2017

@clement-analogue

This comment has been minimized.

Copy link

@clement-analogue clement-analogue commented Nov 24, 2017

@lukapaunovic This PHP code does not work with ( and ).

@deadbits

This comment has been minimized.

Copy link

@deadbits deadbits commented Dec 26, 2018

@AvnerCohen
Your example is missing subdomains with dashes in them. For example, your pattern won't catch all of this:
x-foo.bar.subdomain.com/index.php?q=123

@li-a

This comment has been minimized.

Copy link

@li-a li-a commented Jun 30, 2020

I suggest replacing the initial \b with (?<![a-z0-9_@-]) (and removing the now-redundant (?<!@) line). The word boundary is easier to read but leads to more wasted work for the rest of the expression. The a-z0-9 ensure a proper boundary, and the _@- prevent matches in the middle of something like a typical filename or email address. Others may feel free to add more prefix symbols that indicate a URL should be ignored.

The outermost ( ) group also seems redundant? Most (all?) flavors capture the whole match as group #0 or with some other API.

Now, this part:

  (?:							# One or more:
    [^\s()<>{}\[\]]+						# Run of non-space, non-()<>{}[]
    |								#   or
    \([^\s()]*?\([^\s()]+\)[^\s()]*?\)  # balanced parens, one level deep: (…(…)…)
    |
    \([^\s]+?\)							# balanced parens, non-recursive: (…)
  )+
  (?:							# End with:
    \([^\s()]*?\([^\s()]+\)[^\s()]*?\)  # balanced parens, one level deep: (…(…)…)
    |
    \([^\s]+?\)							# balanced parens, non-recursive: (…)
    |									#   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]		# not a space or one of these punct chars
  )

is a recipe for seriously catastrophic runtime. I adapted it to the pattern below, which runs fast. More importantly, it completes quickly for the dozens of cases I tested that made the expression above run for over an hour (until being aborted):

(?x:                               # Scoping the COMMENT flag; can be removed if compressed or concatenated with more commented expression
  [^\s()<>{}\[\]]*                 # 0+ non-space, non-()<>{}[]
  (?:                              # 0+ times:
    \(                             #   Balanced parens containing:
    [^\s()]*                       #   0+ non-paren chars
    (?:                            #   0+ times:
      \([^\s()]*\)                 #     Inner balanced parens containing 0+ non-paren chars
      [^\s()]*                     #     0+ non-paren chars
    )*
    \)
    [^\s()<>{}\[\]]*               # 0+ non-space, non-()<>{}[]
  )*
  (?:                              # End with:
    \(                             #   Balanced parens containing:
    [^\s()]*                       #   0+ non-paren chars
    (?:                            #   0+ times:
      \([^\s()]*\)                 #     Inner balanced parens containing 0+ non-paren chars
      [^\s()]*                     #     0+ non-paren chars
    )*
    \)
    |                              #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
  )
)

Avoid nesting quantifiers in this way unless you add atomic grouping or make the inner ones possessive. In my engine (Java), you can also change most (not all) of the * to *+ to match the same inputs with better performance.

@gruber

This comment has been minimized.

Copy link
Owner Author

@gruber gruber commented Jul 1, 2020

@jackdeguest

This comment has been minimized.

Copy link

@jackdeguest jackdeguest commented Jul 26, 2020

Why bother ?

use Regexp::Common qw( URI )
if( $str =~ /$RE{URI}{HTTP}/ )
{
    # something here
}
# For https
elsif( $str =~ /$RE{URI}{HTTP}{-scheme => 'https'}/ )
{
    # something else
}
@XtremePwnership

This comment has been minimized.

Copy link

@XtremePwnership XtremePwnership commented Aug 21, 2020

Thank you guys, for maintaining & updating this page till date! Landed here via Google > "url regex python" > StackOverflow > here.

@vellrya

This comment has been minimized.

Copy link

@vellrya vellrya commented Jan 27, 2021

.pub, .ski domains are ignored without inital http

@sjosegarcia

This comment has been minimized.

Copy link

@sjosegarcia sjosegarcia commented May 3, 2021

I am glad I found this. Through stackoverflow.

@fariello

This comment has been minimized.

Copy link

@fariello fariello commented Jun 9, 2021

This was very helpful, but, in some very rare edge cases using either the one-liner or the longer more readible version will "hang" in Python. The long version hangs on this:

"[https://www.surveymonkey.com/r/9RSNDDT](https://www.surveymonkey.com/r/9RSNDDT)"

While the short version handles that but subsequently hangs on this text, but I cannot determine which:

"**Disclaimer: All the accused are innocent until proven guilty by the court of law, even if they may sound as being guilty. Currency in Armenian Drams unless specified otherwise.**\n\n---\n\nMarch 1st murder victims' representatives sent a request to the Constitutional Court and demanded 7 member judges to recuse themselves from hearing the Kocharyan case this August, citing the reasons as political past, lack of trust, and a conflict of interest. \n\nPetition says the judges were involved with making pro-Kocharyan and pro-Serj election result verdicts in 2008, at the time when it is known that Kocharyan regime was directly pressuring the courts, as exposed by US embassy cables. \n\nOther reasons were also mentioned, for example some judges holding extraordinary sessions to approve Kocharyan's now-disputed state of emergency declaration; a judge still being a party member at the time of being elected; judges being elected by a president who came into power controversially after pressing the court; the chief judge being an active HHK member and MP at the time of appointed, who aided the party agenda. \n\nRead the rest of the complaint here... \n\nhttps://armtimes.com/hy/article/165446\n\n---\n\nHHK party is being sued by the family of a businessman who used to own the building where one of HHK's headquarter is (or was) located. The businessman was a wine maker in the 20th century and owned the building. The family alleges that in 2001, HHK used it political powers to rapidly force the 460 sq/m building to be sold to HHK for only 5.5mln Drams. Next year, the state paid HHK 93mln to obtain the same area, which later presumably came under HHK's possession again. The plaintiff alleges that property registration agency was in such a hurry that they didn't wait for ownership names to be properly changed before authorizing the transactions, thus making the process illegal. The court will hear the case this September. \n \nhttps://www.armtimes.com/hy/article/165369\n\n---\n\nThe government session took place. \n\n375mln of the excess tax money will be spent on rebuilding 5 high schools. Hundreds of millions to renovate dozens of others. 1.7bln was dedicated to repair universities. https://armtimes.com/hy/article/165433 -- https://factor.am/165207.html\n\n-\n\n837mln will be dedicated to create 330 robotics labs in schools, in addition to 350 that already exist. https://armtimes.com/hy/article/165417\n\n-\n\nThe plan to double the premium pensions for 397 WW2 vets has been approved. They'll receive 100k/month. It'll probably be sent to Parliament for a final approval. https://armtimes.com/hy/article/165407\n\n-\n\n4bln will be issued to State Water Committee to solve local water debt and management issues. Minister of Infrastructure complained that a poor management for many years has led to a 7bln debt. https://armtimes.com/hy/article/165415\n\n\n-\n\nThe government approved a draft version of a QP bill that regulates contractor work terms to prevent certain types of ""unfair and arbitrary terminations"". Details inside... https://armtimes.com/hy/article/165411\n\n-\n\nThe government approved the agreement to remove double taxation with Singapore, and to prevent tax avoidance 🇸🇬 \n\nhttps://armenpress.am/arm/news/980749.html\n\n-\n\n---\n\nThe hospital healthcare becomes free for anyone below 18, beginning today. Parents of soldiers will also qualify for this free care. Some treatments with expensive equipment will become free for disabled people. https://www.armtimes.com/hy/article/165479\n\n---\n\nSevan lake is greener than usual. Last time it was this green in 1960s and in 2018. Smaller level greentifications have happened frequently for many years. Why does the lake become green?\n\n It's due to flora and other organic processes, which can place the quality of the lake in danger if nothing is done. When the water level rises but the trees near the shore aren't cut beforehand, it cases problems when the trees become submerged. When large quantities of water is drained for irrigation, it causes more problems because it disturbs the organic process that the lake does to clean itself. Add to this the fact that phosphorus, sewage and other materials are added to lake, it results in the lake changing its color.\n\n-\n\nMinister of Nature Protection said the current greenification is caused by a combination of algae (jrimur) growth, phosphorus-nitrogen and other chemical buildup, dirty water flowing into the lake, submerged trees near the shores, record high temperatures. \n\n-\n\nThe government begun an examination to find and treat the causes. They have already identified the areas that need to be cleared of submerged trees. The cleanup can last 2-3 years, and the lake's water levels will be brought to 1901.5 meters afterwards. To identify the areas that need to be cleaned, the ministry used satellite and drones. 770 hectares of problematic areas were identified. In the past years, 70 hectares were cleaned each year. They will increase that number to 300 hectares for 2020 and 2021.\n\nThis season's water drainage is already low compared to past years, which should help. Minister says they're working with the law enforcement and Agricultural ministry to reduce irrigation abuse.\n\n-\n\nGerman institutes are involved with identifying certain pollution sources. The Ministry sent a petition to UNESCO to add Sevan to Bioshpere Reserve program, which will allow better cleanup management and socioeconomic improvements for residents living near the lake. Belarus has offered help with the cleanup.\n\n-\n\nMinistry is working with PM's office to create a plan to clean up 30 rivers that pour into the lake. After the cleanup, the lake's water levels will be raised because more fresh water automatically translates into better quality in some regards. \n\n-\n\nWithin the next 2 weeks, the large deposits of phosphorus will erode away, and the green color, which is ""still safe"", should be significantly reduced, said the Minister. This greenifcation is also happening in the Black Sea and Baikal Lake due to climate change and temperature rise. \n\n\nhttps://hetq.am/hy/article/105250 --- https://www.youtube.com/watch?v=0cDwuFXB2Mc --- https://armtimes.com/hy/article/165424 --- TLDR https://www.youtube.com/watch?v=ORD7jJClIRk\n\n---\n\nPashinyan's government earlier approved raising minimum wage from 55k to at least 63k. However, they want it raised even higher, to 68k. \n\nQP party in Parliament doesn't fully agree with the government. Says extending it to 68k will have ""side effects"". The co-author of the bill QP MP Babken Tunyan says 63k number was initially chosen because it's the minimum necessary to ""survive"", aka the food basket number. ""It makes sense for the minimum wage to equal to it"". \n\nSetting minimum wage above that number will cause government official salaries to also go up, because these salaries are calculated by multiplying the base salary by a number. MP doesn't want this to happen. \n\nThe MP is in favor of raising it to 68k only if the government agrees to make changes on how official salaries are calculated so they'll continue to earn the old salaries, and if the government submits a report on how much extra burden the extended raise will be on the budget. \n\nThe MP says 80,000 workers who earn exactly the current minimum wage will see their wages raised, plus many more workers will see their close-to-minimum wage raised to the new minimum wage. The number of ""affected"" workers is significantly higher than the 80,000 that was initially reported. \n\nhttps://armenpress.am/arm/news/980746.html\n\n......................................... \n\nMinistry of Labor says the 68k is the better number. They examined 1580 businesses and found that the raise wouldn't be a big burden on them. The businesses said they could go up to 70k. The minimum wage should be above the survival food basket, said the deputy Minister. \n\n-\n\nTo survive you need: 63k\n\nParliament: minimum wage should be 63k\n\nGovernment: minimum wage should be 68k\n\nBusiness: anything below 70k won't hurt us (per government study) \n\nhttps://armenpress.am/arm/news/980734.html --- https://www.youtube.com/watch?v=NOjSGdKjeLk\n\n---\n\n29 PACE representatives urged the Armenian government to do more to protect LGBTI *(a random new letter attached to LGBT every day?)*, calling the current actions as insufficient and failed. They cite death threats towards an LGBTI forum in Armenia as an example. \n\nThe report criticizes that there are no new laws to protect LGBTI, and the existing ones don't respect the gender identity and gender choice. They urged the government to publicly denounce hate speech against LGBTI, and to better train public officials and judges to respect the LGBTI. \n\nToday's comment section will have 50 comments, and 49 will be about this topic.\n\nhttps://armtimes.com/hy/article/165445\n\n---\n\nCalifornia will dedicate $5mln to build a Armenian-US museum in Glendale. It'll be about the history of Armenia, diaspora, culture cooperaation, etc. \n\nhttps://www.foxnews.com/us/ocasio-cortez-accused-of-stunt-with-claims-at-the-border-acquitted-navy-seal-to-speak-to-fox-news\n\n---\n\nArtsakh and Armenia Human Rights Ombudsmen met Los Angeles municipality officials and legislators, thanked them for the continuous support, discussed new plans. \nhttps://armtimes.com/hy/article/165448\n\n---\n\nWorld Customs Organization re-elects Armenian Customs Service as member of its audit committee. 182 countries, 12 of which are audit members; Armenia and Netherlands represent Europe. It's about the classification of goods, customs valuation, border cooperation etc.\n\nhttp://arka.am/en/news/business/world_customs_organization_re_elects_armenian_customs_service_as_member_of_its_audit_committee/\n\n---\n\nEveryone: installs a trampoline for a pool\n\nArmenian nibbas: moves the pool under an ancient historical bridge to use it as a trampoline\n\nSmart nibbas: just stop, get some help\n\nhttps://armenpress.am/arm/news/980785.html\n\n---\n\nFireworks explosion in Belarus leaves several injured and one dead. SFW https://youtu.be/OhuyA2BBZDc?t=4 --https://youtu.be/C4MF2XI_raw?t=84\n\n---\n\nSinger Hayko has a message for the haters https://www.youtube.com/watch?v=xZocs8XQUe4"

@wanghaisheng

This comment has been minimized.

Copy link

@wanghaisheng wanghaisheng commented Sep 12, 2021

if i try to define a varaible in python like this
a =r' Single-line version'
it give me invalid syntax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment