public
Last active

Liberal, Accurate Regex Pattern for Matching All URLs

  • Download Gist
Liberal Regex Pattern for All URLs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
The regex patterns in this gist are intended to match any URLs,
including "mailto:foo@example.com", "x-whatever://foo", etc. For a
pattern that attempts only to match web URLs (http, https), see:
https://gist.github.com/gruber/8891611
 
 
# Single-line version of pattern:
 
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
 
 
# Multi-line commented version of same pattern:
 
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
)
)

It seems the expression doesn't deal with a number of real-life updates to URI patterns in the last years, like internationalised domains, new top-level domains that are not between 2 and 4 characters long, IRIs etc. Some references: http://www.icann.org/en/topics/TLD-acceptance/ and http://www.ietf.org/rfc/rfc3987.txt. A domain like http://موقع.وزارة-الاتصالات.مصر/ is legal and functional today.

KJD: Have you actually tried it? The pattern matches "http://موقع.وزارة-الاتصالات.مصر/" in both PCRE and Perl. What makes you think it doesn't work?

You're of course correct — I jumped the gun in scanning through the expression. I guess the case where the [a-z]{2,4} pattern fails is in the following case, which is not a legal URL and therefore you could definitely argue is less important to catch:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def matches(s):
    if re.match(r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', s):
        print "%s matches" % (s)
    else:
        print "%s doesn't match" % (s)

matches('موقع.وزارة-الاتصالات.مصر/ar/default.aspx')
matches('example.com/index.html')

John, since the last revision, your regex will also match stuff like http://#, http://## and http://## /. The previous version didn’t have that problem. I made a quick test case here: http://mathiasbynens.be/demo/url-regex

Hi.
Please check your regex with link bellow.

http://ddos-link.com/[test.......................................]

This link will burn my server :)

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.