The regex patterns in this gist are intended to match any URLs, | |
including "mailto:foo@example.com", "x-whatever://foo", etc. For a | |
pattern that attempts only to match web URLs (http, https), see: | |
https://gist.github.com/gruber/8891611 | |
# Single-line version of pattern: | |
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])) | |
# Multi-line commented version of same pattern: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
[a-z][\w-]+: # URL protocol and colon | |
(?: | |
/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999." | |
| # or | |
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>]+ # Run of non-space, non-()<> | |
| # or | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
)+ | |
(?: # End with: | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
| # or | |
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char | |
) | |
) |
This comment has been minimized.
This comment has been minimized.
KJD: Have you actually tried it? The pattern matches "http://موقع.وزارة-الاتصالات.مصر/" in both PCRE and Perl. What makes you think it doesn't work? |
This comment has been minimized.
This comment has been minimized.
You're of course correct — I jumped the gun in scanning through the expression. I guess the case where the [a-z]{2,4} pattern fails is in the following case, which is not a legal URL and therefore you could definitely argue is less important to catch:
|
This comment has been minimized.
This comment has been minimized.
John, since the last revision, your regex will also match stuff like |
This comment has been minimized.
This comment has been minimized.
Hi.
This link will burn my server :) |
This comment has been minimized.
This comment has been minimized.
This above script will hang on Chrome&NodeJS at 100% CPU usage. "http://www.ghislainproulx.net/Blog/2014/09/contributing-to-a-github-open-source-project-(from-a-visual-studio-developer-perspective)".replace(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi, function(url){
// this will never be executed on Chrome/Node
console.log(url);
}); |
This comment has been minimized.
This comment has been minimized.
Yeah, I'm seeing a hang with the input:
as well. The balanced-paren rules seem to be blowing up. The following version doesn't have the performance problem: /\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i Note that I've just removed three unnecessary/redundant
See https://gerrit.wikimedia.org/r/#/c/170329/1/lib/index.js for a colorized diff. I've tested that this change fixes the problems noted by @FGRibreau and @chemist777 above. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I've come up with the follow version. This version requires that you begin with a protocol like http:// https:// and even mailto: No I'm not a regex genius, but I've been plugging away at test this variation, and it seems to work so far.
|
This comment has been minimized.
This comment has been minimized.
Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack. |
This comment has been minimized.
This comment has been minimized.
Could anybody help me out with a version of this that also optionally allows the url to be enclosed like so: URL:thefullurl This is a format I come across rather often still in an old forum. |
This comment has been minimized.
This comment has been minimized.
To add support for URIs with schemes like |
This comment has been minimized.
This comment has been minimized.
Belated @mattauckland - guessing the reason is for URLs like |
This comment has been minimized.
This comment has been minimized.
The two balanced parens parts use capturing groups, while the rest of the regex uses non-capturing groups (except for the outermost match, obviously). May i suggest to change |
This comment has been minimized.
This comment has been minimized.
Hi To make the regex to match against
but also unfortunately also against
I added {0,1} at the end of the 'balanced parens, up to 2 levels' my final regex is: |
This comment has been minimized.
This comment has been minimized.
I have the same problem , i used but it return some urls like : i need your help please ! |
This comment has been minimized.
This comment has been minimized.
How looks the actual code? I have string $str = "Blaa lorem ipsum domain-name.studio blaa blaa another.com blaa blaa"; and I want to get output: Yes it contains one or more domains: Thanks if you have time to help! |
This comment has been minimized.
This comment has been minimized.
I tried: $regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?«»“”‘’]))"; // SCHEME
But got error: PHP Parse error: syntax error, unexpected ',' |
This comment has been minimized.
This comment has been minimized.
Hi Sorry for asking but regex like this are a bit over my head :-) I was trying to parse some wsdl files (basicaly xml) and I was wondering: Is there any way to avoid matching things like |
This comment has been minimized.
This comment has been minimized.
Using node 14.2, it hangs when I try to match the string https://en.wikipedia.org/wiki/Learning_to_Fly_(Tom_Petty_and_the_Heartbreakers) Looks like some kind of catastrophic backtracking in the balanced parens clauses, but i'm not sure how to fix it. |
This comment has been minimized.
It seems the expression doesn't deal with a number of real-life updates to URI patterns in the last years, like internationalised domains, new top-level domains that are not between 2 and 4 characters long, IRIs etc. Some references: http://www.icann.org/en/topics/TLD-acceptance/ and http://www.ietf.org/rfc/rfc3987.txt. A domain like http://موقع.وزارة-الاتصالات.مصر/ is legal and functional today.