Create a gist now

Instantly share code, notes, and snippets.

Embed
Liberal, Accurate Regex Pattern for Matching All URLs
The regex patterns in this gist are intended to match any URLs,
including "mailto:foo@example.com", "x-whatever://foo", etc. For a
pattern that attempts only to match web URLs (http, https), see:
https://gist.github.com/gruber/8891611
# Single-line version of pattern:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
# Multi-line commented version of same pattern:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
)
)
@kjd

This comment has been minimized.

Show comment
Hide comment
@kjd

kjd Jul 27, 2010

It seems the expression doesn't deal with a number of real-life updates to URI patterns in the last years, like internationalised domains, new top-level domains that are not between 2 and 4 characters long, IRIs etc. Some references: http://www.icann.org/en/topics/TLD-acceptance/ and http://www.ietf.org/rfc/rfc3987.txt. A domain like http://موقع.وزارة-الاتصالات.مصر/ is legal and functional today.

kjd commented Jul 27, 2010

It seems the expression doesn't deal with a number of real-life updates to URI patterns in the last years, like internationalised domains, new top-level domains that are not between 2 and 4 characters long, IRIs etc. Some references: http://www.icann.org/en/topics/TLD-acceptance/ and http://www.ietf.org/rfc/rfc3987.txt. A domain like http://موقع.وزارة-الاتصالات.مصر/ is legal and functional today.

@gruber

This comment has been minimized.

Show comment
Hide comment
@gruber

gruber Jul 27, 2010

KJD: Have you actually tried it? The pattern matches "http://موقع.وزارة-الاتصالات.مصر/" in both PCRE and Perl. What makes you think it doesn't work?

Owner

gruber commented Jul 27, 2010

KJD: Have you actually tried it? The pattern matches "http://موقع.وزارة-الاتصالات.مصر/" in both PCRE and Perl. What makes you think it doesn't work?

@kjd

This comment has been minimized.

Show comment
Hide comment
@kjd

kjd Jul 27, 2010

You're of course correct — I jumped the gun in scanning through the expression. I guess the case where the [a-z]{2,4} pattern fails is in the following case, which is not a legal URL and therefore you could definitely argue is less important to catch:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def matches(s):
    if re.match(r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', s):
        print "%s matches" % (s)
    else:
        print "%s doesn't match" % (s)

matches('موقع.وزارة-الاتصالات.مصر/ar/default.aspx')
matches('example.com/index.html')

kjd commented Jul 27, 2010

You're of course correct — I jumped the gun in scanning through the expression. I guess the case where the [a-z]{2,4} pattern fails is in the following case, which is not a legal URL and therefore you could definitely argue is less important to catch:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def matches(s):
    if re.match(r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', s):
        print "%s matches" % (s)
    else:
        print "%s doesn't match" % (s)

matches('موقع.وزارة-الاتصالات.مصر/ar/default.aspx')
matches('example.com/index.html')
@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Dec 3, 2010

John, since the last revision, your regex will also match stuff like http://#, http://## and http://## /. The previous version didn’t have that problem. I made a quick test case here: http://mathiasbynens.be/demo/url-regex

John, since the last revision, your regex will also match stuff like http://#, http://## and http://## /. The previous version didn’t have that problem. I made a quick test case here: http://mathiasbynens.be/demo/url-regex

@chemist777

This comment has been minimized.

Show comment
Hide comment
@chemist777

chemist777 Aug 25, 2013

Hi.
Please check your regex with link bellow.

http://ddos-link.com/[test.......................................]

This link will burn my server :)

Hi.
Please check your regex with link bellow.

http://ddos-link.com/[test.......................................]

This link will burn my server :)

@FGRibreau

This comment has been minimized.

Show comment
Hide comment
@FGRibreau

FGRibreau Sep 17, 2014

This above script will hang on Chrome&NodeJS at 100% CPU usage.

"http://www.ghislainproulx.net/Blog/2014/09/contributing-to-a-github-open-source-project-(from-a-visual-studio-developer-perspective)".replace(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi, function(url){
  // this will never be executed on Chrome/Node
  console.log(url);
});

This above script will hang on Chrome&NodeJS at 100% CPU usage.

"http://www.ghislainproulx.net/Blog/2014/09/contributing-to-a-github-open-source-project-(from-a-visual-studio-developer-perspective)".replace(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi, function(url){
  // this will never be executed on Chrome/Node
  console.log(url);
});
@cscott

This comment has been minimized.

Show comment
Hide comment
@cscott

cscott Oct 31, 2014

Yeah, I'm seeing a hang with the input:

"Ficheiro:Joseph_Ducreux_(French_-_Self-Portrait,_Yawning_-_Google_Art_Project.jpg"

as well. The balanced-paren rules seem to be blowing up.

The following version doesn't have the performance problem:

/\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i

Note that I've just removed three unnecessary/redundant + operators, so that the regexp ends with:

  (?:                           # One or more:
    [^\s()<>]                       # Non-space, non-()<>  (removed a + here)
    |                               #   or
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)   # balanced parens, up to 2 levels (removed a + here)
  )+
  (?:                           # End with:
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)   # balanced parens, up to 2 levels (removed a + here)
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct char
  )
)

See https://gerrit.wikimedia.org/r/#/c/170329/1/lib/index.js for a colorized diff.

I've tested that this change fixes the problems noted by @FGRibreau and @chemist777 above.

cscott commented Oct 31, 2014

Yeah, I'm seeing a hang with the input:

"Ficheiro:Joseph_Ducreux_(French_-_Self-Portrait,_Yawning_-_Google_Art_Project.jpg"

as well. The balanced-paren rules seem to be blowing up.

The following version doesn't have the performance problem:

/\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i

Note that I've just removed three unnecessary/redundant + operators, so that the regexp ends with:

  (?:                           # One or more:
    [^\s()<>]                       # Non-space, non-()<>  (removed a + here)
    |                               #   or
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)   # balanced parens, up to 2 levels (removed a + here)
  )+
  (?:                           # End with:
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)   # balanced parens, up to 2 levels (removed a + here)
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct char
  )
)

See https://gerrit.wikimedia.org/r/#/c/170329/1/lib/index.js for a colorized diff.

I've tested that this change fixes the problems noted by @FGRibreau and @chemist777 above.

@mattauckland

This comment has been minimized.

Show comment
Hide comment
@mattauckland

mattauckland Nov 3, 2014

After a little testing, both the original @gruber version and @cscott version, I've found that you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?

After a little testing, both the original @gruber version and @cscott version, I've found that you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?

@mattauckland

This comment has been minimized.

Show comment
Hide comment
@mattauckland

mattauckland Nov 3, 2014

I've come up with the follow version. This version requires that you begin with a protocol like http:// https:// and even mailto:

No I'm not a regex genius, but I've been plugging away at test this variation, and it seems to work so far.

_(?i)\b((?:(?:https?|ftps?)://|ftp\.|ftps\.|mailto:|www\d{0,3}[.])(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))_iuS

I've come up with the follow version. This version requires that you begin with a protocol like http:// https:// and even mailto:

No I'm not a regex genius, but I've been plugging away at test this variation, and it seems to work so far.

_(?i)\b((?:(?:https?|ftps?)://|ftp\.|ftps\.|mailto:|www\d{0,3}[.])(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))_iuS
@dpk

This comment has been minimized.

Show comment
Hide comment
@dpk

dpk Dec 3, 2014

Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.

dpk commented Dec 3, 2014

Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.

@quite

This comment has been minimized.

Show comment
Hide comment
@quite

quite Jan 15, 2015

Could anybody help me out with a version of this that also optionally allows the url to be enclosed like so: URL:thefullurl This is a format I come across rather often still in an old forum.

quite commented Jan 15, 2015

Could anybody help me out with a version of this that also optionally allows the url to be enclosed like so: URL:thefullurl This is a format I come across rather often still in an old forum.

@EricFromCanada

This comment has been minimized.

Show comment
Hide comment
@EricFromCanada

EricFromCanada Jan 28, 2015

To add support for URIs with schemes like stratum+tcp and xmlrpc.beep or paths starting with + or ? (e.g. sms:, magnet:), I'm using a version with [a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]) as the first section.

To add support for URIs with schemes like stratum+tcp and xmlrpc.beep or paths starting with + or ? (e.g. sms:, magnet:), I'm using a version with [a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]) as the first section.

@eliotsykes

This comment has been minimized.

Show comment
Hide comment
@eliotsykes

eliotsykes Jun 18, 2015

you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?

Belated @mattauckland - guessing the reason is for URLs like http://localhost/ to be matched.

you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?

Belated @mattauckland - guessing the reason is for URLs like http://localhost/ to be matched.

@wbolster

This comment has been minimized.

Show comment
Hide comment
@wbolster

wbolster Nov 9, 2015

The two balanced parens parts use capturing groups, while the rest of the regex uses non-capturing groups (except for the outermost match, obviously). May i suggest to change ( into (?: in those four places?

wbolster commented Nov 9, 2015

The two balanced parens parts use capturing groups, while the rest of the regex uses non-capturing groups (except for the outermost match, obviously). May i suggest to change ( into (?: in those four places?

@vaderdan

This comment has been minimized.

Show comment
Hide comment
@vaderdan

vaderdan Dec 13, 2016

Hi

To make the regex to match against

example.com
abv.bg
google.com

but also unfortunately also against

filename.txt

I added {0,1} at the end of the 'balanced parens, up to 2 levels'
and made backslash optional
(the second 2 groups was eating characters from matched string when match against domain.[a-z]{2,4}, and so {2,4} becomes incorrect in that case)

my final regex is:
\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/?)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\)){0,}(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\!()\[\]{};:\'\"\.\,<>?«»“”‘’]){0,})

vaderdan commented Dec 13, 2016

Hi

To make the regex to match against

example.com
abv.bg
google.com

but also unfortunately also against

filename.txt

I added {0,1} at the end of the 'balanced parens, up to 2 levels'
and made backslash optional
(the second 2 groups was eating characters from matched string when match against domain.[a-z]{2,4}, and so {2,4} becomes incorrect in that case)

my final regex is:
\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/?)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\)){0,}(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\!()\[\]{};:\'\"\.\,<>?«»“”‘’]){0,})

@takwaIMR

This comment has been minimized.

Show comment
Hide comment
@takwaIMR

takwaIMR Jul 4, 2017

I have the same problem , i used
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

but it return some urls like :
https://t.co/h…

i need your help please !

takwaIMR commented Jul 4, 2017

I have the same problem , i used
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

but it return some urls like :
https://t.co/h…

i need your help please !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment