-
-
Save gruber/249502 to your computer and use it in GitHub Desktop.
The regex patterns in this gist are intended to match any URLs, | |
including "mailto:foo@example.com", "x-whatever://foo", etc. For a | |
pattern that attempts only to match web URLs (http, https), see: | |
https://gist.github.com/gruber/8891611 | |
# Single-line version of pattern: | |
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])) | |
# Multi-line commented version of same pattern: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
[a-z][\w-]+: # URL protocol and colon | |
(?: | |
/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999." | |
| # or | |
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>]+ # Run of non-space, non-()<> | |
| # or | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
)+ | |
(?: # End with: | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
| # or | |
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char | |
) | |
) |
you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?
Belated @mattauckland - guessing the reason is for URLs like http://localhost/
to be matched.
The two balanced parens parts use capturing groups, while the rest of the regex uses non-capturing groups (except for the outermost match, obviously). May i suggest to change (
into (?:
in those four places?
Hi
To make the regex to match against
example.com
abv.bg
google.com
but also unfortunately also against
filename.txt
I added {0,1} at the end of the 'balanced parens, up to 2 levels'
and made backslash optional
(the second 2 groups was eating characters from matched string when match against domain.[a-z]{2,4}, and so {2,4} becomes incorrect in that case)
my final regex is:
\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/?)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\)){0,}(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\!()\[\]{};:\'\"\.\,<>?«»“”‘’]){0,})
I have the same problem , i used
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
but it return some urls like :
https://t.co/h…
i need your help please !
How looks the actual code? I have string $str = "Blaa lorem ipsum domain-name.studio blaa blaa another.com blaa blaa"; and I want to get output:
Yes it contains one or more domains:
domain-name.studio
another.com
Thanks if you have time to help!
I tried:
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"; // SCHEME
$found_url = "";
if(preg_match("~^$regex$~i", $description, $m)) $found_url = $m;
if(preg_match("~^$regex$~i", $description, $m)) $found_url .= $m;
But got error: PHP Parse error: syntax error, unexpected ','
Hi
Sorry for asking but regex like this are a bit over my head :-)
I was trying to parse some wsdl files (basicaly xml) and I was wondering: Is there any way to avoid matching things like ab:1234
, xs:complexType
or this:isnotanurl
?
Using node 14.2, it hangs when I try to match the string
https://en.wikipedia.org/wiki/Learning_to_Fly_(Tom_Petty_and_the_Heartbreakers)
Looks like some kind of catastrophic backtracking in the balanced parens clauses, but i'm not sure how to fix it.
My version of this:
(?i)\b(?:[a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\x60!()\[\]{};:'".,<>?«»“”‘’])
Changes:
- Supports Go (Changed backtick to
\x60
) - Non-URLs like
bit.com/test
aren't recognized - Protocol section is required
- Applied change mentioned above
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ]
):
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
«
is two bytes: "\xc2\xab", which means the pattern will accept \xc2
and \xab
anywhere in the sequence not in a specific order or not even close to each other!
php -r '$s="\xab \xc2 \xc2 \xab"; $v=preg_match_all("/[«]/", $s, $m); var_dump([$v, $m, $s]);' > foo.txt
you need to open foo.txt with a program which can print you bytes.
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression (
[ ]
)
It depends on the language/library. Works fine in Python and node.js
To add support for URIs with schemes like
stratum+tcp
andxmlrpc.beep
or paths starting with+
or?
(e.g.sms:
,magnet:
), I'm using a version with[a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%])
as the first section.