Last active

Embed URL


SSH clone URL

You can clone with HTTPS or SSH.

Download Gist

Liberal, Accurate Regex Pattern for Matching All URLs

View Liberal Regex Pattern for All URLs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
The regex patterns in this gist are intended to match any URLs,
including "", "x-whatever://foo", etc. For a
pattern that attempts only to match web URLs (http, https), see:
# Single-line version of pattern:
# Multi-line commented version of same pattern:
( # Capture 1: entire matched URL
[a-z][\w-]+: # URL protocol and colon
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
kjd commented

It seems the expression doesn't deal with a number of real-life updates to URI patterns in the last years, like internationalised domains, new top-level domains that are not between 2 and 4 characters long, IRIs etc. Some references: and A domain like http://موقع.وزارة-الاتصالات.مصر/ is legal and functional today.

gruber commented

KJD: Have you actually tried it? The pattern matches "http://موقع.وزارة-الاتصالات.مصر/" in both PCRE and Perl. What makes you think it doesn't work?

kjd commented

You're of course correct — I jumped the gun in scanning through the expression. I guess the case where the [a-z]{2,4} pattern fails is in the following case, which is not a legal URL and therefore you could definitely argue is less important to catch:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def matches(s):
    if re.match(r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', s):
        print "%s matches" % (s)
        print "%s doesn't match" % (s)


John, since the last revision, your regex will also match stuff like http://#, http://## and http://## /. The previous version didn’t have that problem. I made a quick test case here:

Please check your regex with link bellow.[test.......................................]

This link will burn my server :)

This above script will hang on Chrome&NodeJS at 100% CPU usage.

"".replace(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi, function(url){
  // this will never be executed on Chrome/Node

Yeah, I'm seeing a hang with the input:


as well. The balanced-paren rules seem to be blowing up.

The following version doesn't have the performance problem:


Note that I've just removed three unnecessary/redundant + operators, so that the regexp ends with:

  (?:                           # One or more:
    [^\s()<>]                       # Non-space, non-()<>  (removed a + here)
    |                               #   or
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)   # balanced parens, up to 2 levels (removed a + here)
  (?:                           # End with:
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)   # balanced parens, up to 2 levels (removed a + here)
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct char

See for a colorized diff.

I've tested that this change fixes the problems noted by @FGRibreau and @chemist777 above.

After a little testing, both the original @gruber version and @cscott version, I've found that you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?

I've come up with the follow version. This version requires that you begin with a protocol like http:// https:// and even mailto:

No I'm not a regex genius, but I've been plugging away at test this variation, and it seems to work so far.


Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.

Could anybody help me out with a version of this that also optionally allows the url to be enclosed like so: <URL:thefullurl> This is a format I come across rather often still in an old forum.

To add support for URIs with schemes like stratum+tcp and xmlrpc.beep or paths starting with + or ? (e.g. sms:, magnet:), I'm using a version with [a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]) as the first section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.