-
-
Save gruber/249502 to your computer and use it in GitHub Desktop.
The regex patterns in this gist are intended to match any URLs, | |
including "mailto:foo@example.com", "x-whatever://foo", etc. For a | |
pattern that attempts only to match web URLs (http, https), see: | |
https://gist.github.com/gruber/8891611 | |
# Single-line version of pattern: | |
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])) | |
# Multi-line commented version of same pattern: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
[a-z][\w-]+: # URL protocol and colon | |
(?: | |
/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999." | |
| # or | |
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>]+ # Run of non-space, non-()<> | |
| # or | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
)+ | |
(?: # End with: | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
| # or | |
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char | |
) | |
) |
KJD: Have you actually tried it? The pattern matches "http://موقع.وزارة-الاتصالات.مصر/" in both PCRE and Perl. What makes you think it doesn't work?
You're of course correct — I jumped the gun in scanning through the expression. I guess the case where the [a-z]{2,4} pattern fails is in the following case, which is not a legal URL and therefore you could definitely argue is less important to catch:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
def matches(s):
if re.match(r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', s):
print "%s matches" % (s)
else:
print "%s doesn't match" % (s)
matches('موقع.وزارة-الاتصالات.مصر/ar/default.aspx')
matches('example.com/index.html')
John, since the last revision, your regex will also match stuff like http://#
, http://##
and http://## /
. The previous version didn’t have that problem. I made a quick test case here: http://mathiasbynens.be/demo/url-regex
Hi.
Please check your regex with link bellow.
http://ddos-link.com/[test.......................................]
This link will burn my server :)
This above script will hang on Chrome&NodeJS at 100% CPU usage.
"http://www.ghislainproulx.net/Blog/2014/09/contributing-to-a-github-open-source-project-(from-a-visual-studio-developer-perspective)".replace(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi, function(url){
// this will never be executed on Chrome/Node
console.log(url);
});
Yeah, I'm seeing a hang with the input:
"Ficheiro:Joseph_Ducreux_(French_-_Self-Portrait,_Yawning_-_Google_Art_Project.jpg"
as well. The balanced-paren rules seem to be blowing up.
The following version doesn't have the performance problem:
/\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Note that I've just removed three unnecessary/redundant +
operators, so that the regexp ends with:
(?: # One or more:
[^\s()<>] # Non-space, non-()<> (removed a + here)
| # or
\(([^\s()<>]|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels (removed a + here)
)+
(?: # End with:
\(([^\s()<>]|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels (removed a + here)
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
)
)
See https://gerrit.wikimedia.org/r/#/c/170329/1/lib/index.js for a colorized diff.
I've tested that this change fixes the problems noted by @FGRibreau and @chemist777 above.
I've come up with the follow version. This version requires that you begin with a protocol like http:// https:// and even mailto:
No I'm not a regex genius, but I've been plugging away at test this variation, and it seems to work so far.
_(?i)\b((?:(?:https?|ftps?)://|ftp\.|ftps\.|mailto:|www\d{0,3}[.])(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))_iuS
Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.
Could anybody help me out with a version of this that also optionally allows the url to be enclosed like so: URL:thefullurl This is a format I come across rather often still in an old forum.
To add support for URIs with schemes like stratum+tcp
and xmlrpc.beep
or paths starting with +
or ?
(e.g. sms:
, magnet:
), I'm using a version with [a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%])
as the first section.
you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?
Belated @mattauckland - guessing the reason is for URLs like http://localhost/
to be matched.
The two balanced parens parts use capturing groups, while the rest of the regex uses non-capturing groups (except for the outermost match, obviously). May i suggest to change (
into (?:
in those four places?
Hi
To make the regex to match against
example.com
abv.bg
google.com
but also unfortunately also against
filename.txt
I added {0,1} at the end of the 'balanced parens, up to 2 levels'
and made backslash optional
(the second 2 groups was eating characters from matched string when match against domain.[a-z]{2,4}, and so {2,4} becomes incorrect in that case)
my final regex is:
\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/?)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\)){0,}(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\!()\[\]{};:\'\"\.\,<>?«»“”‘’]){0,})
I have the same problem , i used
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
but it return some urls like :
https://t.co/h…
i need your help please !
How looks the actual code? I have string $str = "Blaa lorem ipsum domain-name.studio blaa blaa another.com blaa blaa"; and I want to get output:
Yes it contains one or more domains:
domain-name.studio
another.com
Thanks if you have time to help!
I tried:
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"; // SCHEME
$found_url = "";
if(preg_match("~^$regex$~i", $description, $m)) $found_url = $m;
if(preg_match("~^$regex$~i", $description, $m)) $found_url .= $m;
But got error: PHP Parse error: syntax error, unexpected ','
Hi
Sorry for asking but regex like this are a bit over my head :-)
I was trying to parse some wsdl files (basicaly xml) and I was wondering: Is there any way to avoid matching things like ab:1234
, xs:complexType
or this:isnotanurl
?
Using node 14.2, it hangs when I try to match the string
https://en.wikipedia.org/wiki/Learning_to_Fly_(Tom_Petty_and_the_Heartbreakers)
Looks like some kind of catastrophic backtracking in the balanced parens clauses, but i'm not sure how to fix it.
My version of this:
(?i)\b(?:[a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\x60!()\[\]{};:'".,<>?«»“”‘’])
Changes:
- Supports Go (Changed backtick to
\x60
) - Non-URLs like
bit.com/test
aren't recognized - Protocol section is required
- Applied change mentioned above
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ]
):
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
«
is two bytes: "\xc2\xab", which means the pattern will accept \xc2
and \xab
anywhere in the sequence not in a specific order or not even close to each other!
php -r '$s="\xab \xc2 \xc2 \xab"; $v=preg_match_all("/[«]/", $s, $m); var_dump([$v, $m, $s]);' > foo.txt
you need to open foo.txt with a program which can print you bytes.
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression (
[ ]
)
It depends on the language/library. Works fine in Python and node.js
It seems the expression doesn't deal with a number of real-life updates to URI patterns in the last years, like internationalised domains, new top-level domains that are not between 2 and 4 characters long, IRIs etc. Some references: http://www.icann.org/en/topics/TLD-acceptance/ and http://www.ietf.org/rfc/rfc3987.txt. A domain like http://موقع.وزارة-الاتصالات.مصر/ is legal and functional today.