Skip to content

Instantly share code, notes, and snippets.

@gruber
Last active December 9, 2024 14:41
Show Gist options
  • Save gruber/249502 to your computer and use it in GitHub Desktop.
Save gruber/249502 to your computer and use it in GitHub Desktop.
Liberal, Accurate Regex Pattern for Matching All URLs
The regex patterns in this gist are intended to match any URLs,
including "mailto:foo@example.com", "x-whatever://foo", etc. For a
pattern that attempts only to match web URLs (http, https), see:
https://gist.github.com/gruber/8891611
# Single-line version of pattern:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
# Multi-line commented version of same pattern:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
)
)
@DanieleQ97
Copy link

Hi

Sorry for asking but regex like this are a bit over my head :-)

I was trying to parse some wsdl files (basicaly xml) and I was wondering: Is there any way to avoid matching things like ab:1234, xs:complexType or this:isnotanurl?

@jonpincus
Copy link

Using node 14.2, it hangs when I try to match the string

https://en.wikipedia.org/wiki/Learning_to_Fly_(Tom_Petty_and_the_Heartbreakers)

Looks like some kind of catastrophic backtracking in the balanced parens clauses, but i'm not sure how to fix it.

@makew0rld
Copy link

My version of this:

(?i)\b(?:[a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\x60!()\[\]{};:'".,<>?«»“”‘’])

Changes:

  • Supports Go (Changed backtick to \x60)
  • Non-URLs like bit.com/test aren't recognized
  • Protocol section is required
  • Applied change mentioned above

@glensc
Copy link

glensc commented Dec 27, 2021

putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ]):

    [^\s`!()\[\]{};:'".,<>?«»“”‘’]		# not a space or one of these punct char

« is two bytes: "\xc2\xab", which means the pattern will accept \xc2 and \xab anywhere in the sequence not in a specific order or not even close to each other!

php -r '$s="\xab \xc2 \xc2 \xab"; $v=preg_match_all("/[«]/", $s, $m); var_dump([$v, $m, $s]);' > foo.txt

you need to open foo.txt with a program which can print you bytes.

@solaluset
Copy link

putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ])

It depends on the language/library. Works fine in Python and node.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment