Last active

Embed URL

HTTPS clone URL

SSH clone URL

You can clone with HTTPS or SSH.

Download Gist
View regex-weburl.js
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
//
// Regular Expression for URL validation
//
// Author: Diego Perini
// Updated: 2010/12/05
// License: MIT
//
// Copyright (c) 2010-2013 Diego Perini (http://www.iport.it)
//
// Permission is hereby granted, free of charge, to any person
// obtaining a copy of this software and associated documentation
// files (the "Software"), to deal in the Software without
// restriction, including without limitation the rights to use,
// copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following
// conditions:
//
// The above copyright notice and this permission notice shall be
// included in all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
// EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
// OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
// HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
// WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
// OTHER DEALINGS IN THE SOFTWARE.
//
// the regular expression composed & commented
// could be easily tweaked for RFC compliance,
// it was expressly modified to fit & satisfy
// these test for an URL shortener:
//
// http://mathiasbynens.be/demo/url-regex
//
// Notes on possible differences from a standard/generic validation:
//
// - utf-8 char class take in consideration the full Unicode range
// - TLDs have been made mandatory so single names like "localhost" fails
// - protocols have been restricted to ftp, http and https only as requested
//
// Changes:
//
// - IP address dotted notation validation, range: 1.0.0.0 - 223.255.255.255
// first and last IP address of each class is considered invalid
// (since they are broadcast/network addresses)
//
// - Added exclusion of private, reserved and/or local networks ranges
//
// Compressed one-line versions:
//
// Javascript version
//
// /^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?$/i
//
// PHP version
//
// _^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]-*)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]-*)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/\S*)?$_iuS
//
var re_weburl = new RegExp(
"^" +
// protocol identifier
"(?:(?:https?|ftp)://)" +
// user:pass authentication
"(?:\\S+(?::\\S*)?@)?" +
"(?:" +
// IP address exclusion
// private & local networks
"(?!(?:10|127)(?:\\.\\d{1,3}){3})" +
"(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})" +
"(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})" +
// IP address dotted notation octets
// excludes loopback network 0.0.0.0
// excludes reserved space >= 224.0.0.0
// excludes network & broacast addresses
// (first & last IP address of each class)
"(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])" +
"(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}" +
"(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))" +
"|" +
// host name
"(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)" +
// domain name
"(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*" +
// TLD identifier
"(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))" +
")" +
// port number
"(?::\\d{2,5})?" +
// resource path
"(?:/\\S*)?" +
"$", "i"
);

In PHP (for use with preg_match), this becomes:

'%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu'

Thanks for the regex Diego, I’ve added it to the test case and it seems to pass all the tests :) Nice job!

Owner

I have added simple network ranges validation, the rules I used are:
- valid range 1.0.0.0 - 223.255.255.255, network adresses above and including 224.0.0.0 are reserved addresses
- first and last IP address of each class is excluded since they are used as network broadcast addresses
since I don't think this is worth implementing completely in a regular expression, a following pass should exclude the Intranet address space:
10.0.0.0 - 10.255.255.255
172.16.0.0 - 172.31.255.255
192.168.0.0 - 192.168.255.255
the loopback and the automatic configuration address space:
127.0.0.0 - 127.255.255.255
169.254.0.0 - 169.254.255.255
while the local, multicast and and the reserved address spaces:
0.0.0.0 - 0.255.255.255 (SPECIAL-IPV4-LOCAL-ID-IANA-RESERVED)
224.0.0.0 - 239.255.255 (MCAST-NET)
240.0.0.0 - 255.255.255.255 (SPECIAL-IPV4-FUTURE-USE-IANA-RESERVED)
should already be excluded by the above regular expression.

This a very minimal list of tests to add to your testings:

PASS
"http://10.1.1.1",
"http://10.1.1.254",
"http://223.255.255.254"

FAIL
"http://0.0.0.0",
"http://10.1.1.0",
"http://10.1.1.255",
"http://224.1.1.1",
"http://1.1.1.1.1"

Need testing :)

Owner

Need to mention I took the idea of validating the possible IP address ranges in the URL while looking at other developers regular expressions I have seen in your tests, especially the one from @scottgonzales. He also sliced up the Unicode ranges :=), that's the reason his one is so long :)

Awesome stuff Diego!!

Owner

Added IP address validation tweaking and optimizations suggested by @abozhilov

Owner

Added exclusion of private, reserved, auto-configuration and local network ranges as described in the previous message.
Network 0.0.0.0/8 and all networks >= 224.0.0.0/8 are excluded by the second validation block.
The second validation block also takes care of excluding IP address terminating with 0 or 255 (non usable network and broadcast addresses of each class C network).

It is easy to just remove the unwanted parts of the validation to fit different scopes (length, precision) so I will probably add more options like the list of existing TLD (possibly grouped), the list of existing protocols and/or a fall back for a more generic protocol match too.

Hey, just randomly came across this... my JavaScript URI parsing library does strict URI validation as per RFC 3986. It uses a much larger regular expression then this one. Code can be found at: https://github.com/garycourt/uri-js

I changed it a little bit so that it's valid in Ruby. Here it is:

/\A(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!10(?:.\d{1,3}){3})(?!127(?:.\d{1,3}){3})(?!169.254(?:.\d{1,3}){2})(?!192.168(?:.\d{1,3}){2})(?!172.(?:1[6-9]|2\d|3[0-1])(?:.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)[a-z\u00a1-\uffff0-9]+)(?:.(?:[a-z\u00a1-\uffff0-9]+-?)[a-z\u00a1-\uffff0-9]+)(?:.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s])?\z/i

Hi Diego,

Just came across this awesome code. I'd like to use this as a basis, and I'm hoping you can help me with a simple tweak. I'd like to let through URL's without the protocol specified (HTTP(S) or FTP). For some reason I can't seem to get it to work.

Thanks,
NMMM

Hey Diego, Nice work. You make it a bit shorter though:

(?!10(?:\\.\\d{1,3}){3})
(?!127(?:\\.\\d{1,3}){3})
(?!(10|127)(?:\\.\\d{1,3}){3})

Similarly with the 0.0.255.255 subnets

@dperini Can you assign a license to this? MIT or BSD?

+1 for the license information

+1 for the license information from me, too

+infinity on the license Diego

Owner

I have added the MIT License to the gist as requested.

Thank you all for the support.

@dperini: Could you add support for url such this?

//dc8hdnsmzapvm.cloudfront.net/assets/styles/application.css

thanks

Is there a Java version of the regex available? That would be great for my android app!

@mparodi Ruby version untouched by markdown

/\A(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?\z/i

Ruby port:

class Regexp

  PERFECT_URL_PATTERN = %r{
    \A

    # protocol identifier
    (?:(?:https?|ftp)://)

    # user:pass authentication
    (?:\S+(?::\S*)?@)?

    (?:
      # IP address exclusion
      # private & local networks
      (?!10(?:\.\d{1,3}){3})
      (?!127(?:\.\d{1,3}){3})
      (?!169\.254(?:\.\d{1,3}){2})
      (?!192\.168(?:\.\d{1,3}){2})
      (?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})

      # IP address dotted notation octets
      # excludes loopback network 0.0.0.0
      # excludes reserved space >= 224.0.0.0
      # excludes network & broacast addresses
      # (first & last IP address of each class)
      (?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])
      (?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}
      (?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))
    |
      # host name
      (?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)

      # domain name
      (?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*

      # TLD identifier
      (?:\.(?:[a-z\u00a1-\uffff]{2,}))
    )

    # port number
    (?::\d{2,5})?

    # resource path
    (?:/[^\s]*)?

    \z
  }xi

end

And specs:

# encoding: utf-8


require "spec_helper"


describe "Regexp::PERFECT_URL_PATTERN" do

  [
    "http://✪df.ws/123",
    "http://userid:password@example.com:8080",
    "http://userid:password@example.com:8080/",
    "http://userid@example.com",
    "http://userid@example.com/",
    "http://userid@example.com:8080",
    "http://userid@example.com:8080/",
    "http://userid:password@example.com",
    "http://userid:password@example.com/",
    "http://142.42.1.1/",
    "http://142.42.1.1:8080/",
    "http://➡.ws/䨹",
    "http://⌘.ws",
    "http://⌘.ws/",
    "http://foo.com/blah_(wikipedia)#cite-1",
    "http://foo.com/blah_(wikipedia)_blah#cite-1",
    "http://foo.com/unicode_(✪)_in_parens",
    "http://foo.com/(something)?after=parens",
    "http://☺.damowmow.com/",
    "http://code.google.com/events/#&product=browser",
    "http://j.mp",
    "ftp://foo.bar/baz",
    "http://foo.bar/?q=Test%20URL-encoded%20stuff",
    "http://مثال.إختبار",
    "http://例子.测试"
  ].each do |valid_url|
    it "matches #{valid_url}" do
      expect(Regexp::PERFECT_URL_PATTERN =~ valid_url).to eq 0
    end
  end



  [
    "http://",
    "http://.",
    "http://..",
    "http://../",
    "http://?",
    "http://??",
    "http://??/",
    "http://#",
    "http://##",
    "http://##/",
    "http://foo.bar?q=Spaces should be encoded",
    "//",
    "//a",
    "///a",
    "///",
    "http:///a",
    "foo.com",
    "rdar://1234",
    "h://test",
    "http:// shouldfail.com",
    ":// should fail",
    "http://foo.bar/foo(bar)baz quux",
    "ftps://foo.bar/",
    "http://-error-.invalid/",
    "http://a.b--c.de/",
    "http://-a.b.co",
    "http://a.b-.co",
    "http://0.0.0.0",
    "http://10.1.1.0",
    "http://10.1.1.255",
    "http://224.1.1.1",
    "http://1.1.1.1.1",
    "http://123.123.123",
    "http://3628126748",
    "http://.www.foo.bar/",
    "http://www.foo.bar./",
    "http://.www.foo.bar./",
    "http://10.1.1.1",
    "http://10.1.1.254"
  ].each do |invalid_url|
    it "does not match #{invalid_url}" do
      expect(Regexp::PERFECT_URL_PATTERN =~ invalid_url).to be_nil
    end
  end

end

very good, thank you for share

I added support for punycoded domain names: https://gist.github.com/HenkPoley/8899766

Owner

Updated the gist with reductions/shortenings suggested by "jpillora".

Thank you !

Owner

raitucarp,

to do that you can change line 65 from:

"(?:(?:https?|ftp)://)" +

to

"(?:(?:(?:https?|ftp):)?//)" +

this way the protocol and colon becomes an optional macth.

You can also just leave the double slash on that line if no URLs have the protocol prefix:

"//" +

Why can't the maximum range for Unicode strings extend to U0010ffff (instead of uffff)?

What about relative URLs?

../
./
/

@stevenvachon relatives wouldn't be URLs they would be paths, which wouldn't need this validation at that point.

jkj commented

I recently needed this but have a dumb question. In the very last part for the resource path, why do you use [^\\s] rather than \\S ? To my understanding they are equivalent, with the latter being a bit shorter.

dimroc commented

For the following Regex and the one pasted by ixti:

    URL = /\A(?:(?:https?):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[a-z0-9][a-z0-9\-]+)*[a-z0-9]+)(?:\.(?:[a-z0-9\-])*[a-z0-9]+)*(?:\.(?:[a-z]{2,})(:\d{1,5})?))(?:\/[^\s]*)?\z/i

You will end up with extremely slow matching, to the point where you suspect an infinite loop, if you have a long subdomain for a URL ending with a period:

ie:

it { should_not match "http://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.randomstring." }

The longer the subdomain "aaa....", the longer it'll take.

Fixed the URL Regex to make the subdomain match non-recursive thereby improving performance. Long story short: it passed our existing test suite and improved performance dramatically.

    URL = /\A(?:(?:https?):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:([a-z0-9][a-z0-9\-]*)?[a-z0-9]+)(?:\.(?:[a-z0-9\-])*[a-z0-9]+)*(?:\.(?:[a-z]{2,})(:\d{1,5})?))(?:\/[^\s]*)?\z/i

Anyone have a python port? My recollection was that the python regexp engine does have some differences.

@dperini you should add support for 32bit addresses and ipv6 addresses.

https://news.ycombinator.com/item?id=7928990

I vote that this should be turned into a git repository with multi-language ports.

I'm also using the top of the page gist regex in JS and finding it very slow to process long invalid URLs such as:
http://qweqweqweqwesadasdqweeqweqwsd

The more letters added there the slower the response.

It sounds like what @phiyangt is referring to above.

Is there any solution for this for JS?

Thanks.

Owner

@dmroc @Feendish try using Firefox with the same code you say it is slow ... maybe you are just using the wrong browser for your testings/objectives, you didn't specify any code or environment info to replicate.
However I guess you are using Chrome :9) if not please provide more infos.

Owner

Well after a few test I can say the slowdown and further browser crash is a Chrome only problem.
I tried the same in Firefox and everything works correctly with these REGEXP, no problem or slowdown.

I have reduced the original REGEXP to a minimal to be able to show the problem.
Try the following line in Chrome console, it will crash the browser:

/^(?:\w+)(?:.(?:[\w]+-?)[\w]+)(?:.[a-z]{2,})$/i.test('www.isjdfofjasodfjsodifjosadifjsdoiafjaisdjfisdfjs');

So I believe this is just a bug in Chrome RE engine.

Hi Diego,

Yeah I'm on latest stable Chrome (Version 35.0.1916.153 m).

This is the "bad" url I'm checking http://qweqweqweqwesadasdqweeqweqwsdqweqweqweqwesadasdqweeqweqwsd

The original regex I'm using (the one from the Gist on top - 1 liner or full version) locks the browser in Chrome as you say. It also locks up IE11.

In Firefox 29 it gave this error:
InternalError: an error occurred while executing regular expression

I updated to latest Firefox v30. The regex runs and gives false which is correct.

From some research online it appears Chrome does not halt execution when there is catastrophic backtracking in a regex. Safari, Firefox and IE could just report 'no match' after some arbitrary number of backtracks.

I also tried your recent regex above and it doesn't lock any browsers.

However it returns true for 'isjdfofjasodfjsodifjosadifjsdoiafjaisdjfisdfjs' which is invalid.
It also returns false for 'http://isjdfofjasodfjsodifjosadifjsdoiafjaisdjfisdfjs.com' which is incorrect.

Are you sure there isn't a runaway loop in there somewhere?

Owner

@Feendish
I don't know why copying and pasting the above RE in Chrome console mangles some character, it actually doesn't crash the console of the browser window.

Please try to cut and paste the RE from this tweet:
https://twitter.com/diegoperini/status/481449088270229504

I retested it and it actually crashes the console in that it doesn't answer to commands anymore after running that RE test that you can find in the above tweet.

The fact that the original RE also works on Safari pushes me to believe it's a Chrome problem but I need to do more tests. The "weburl" RE also work in PHP and other environments.

I am testing on the same Chrome Version 35.0.1916.153 under OS X 10.9.3.

Suggestion and help on this matter are welcome !

@dperini This seems to be a V8 issue. Relevant bug ticket: https://code.google.com/p/v8/issues/detail?id=430

@dperini I ran the RE from the tweet in RegexBuddy analyser and it says "Your regular expression leads to "catastrophic backtracking", making it too complex to be run to completion."

It locks up Chrome & Opera but not Firefox. As the ticket @mathiasbynens linked to suggests, certain browsers are more lenient when catastrophic backtracking happens. Chrome V8 seems to not have any fail limit for this and puts the onus on the regex format.

Owner

@Feendish
can you contact me via email ?
I have a newer version of the RE that doesn't crash Chrome.
Maybe you can try it and give me some feedback before I push it to a new gist.

Sure sent it there now. Thanks.

EtaiG commented

@dperini, we've found this issue too... looks like there's a highly exponential recursion into infinity on simple strings.

I've managed to reduce this to the way the hostname check is written (since it's followed later (eventually) by TLD).
It's this simple format that will cause the problem:

var regx = new RegExp('^(\\w+)*[^\\w]$');
regx.test('aaaaaaaaaaaaaaaaaaaaaaaaaa');  //chrome will crash

In other words, when you have a repeat of something 1 -> infinity times, and this group is repeated 0->infinity times, and the next match is for anything not in the group (obviously... but I put [^w] just to illustrate), then chrome will keep recursion to search for a possible group of (1->n) which repeats (0->m) times which has that letter matching.

Of course, internally, the regex should first be run 'greedily' to check if there's a possible match by making sure required letters are there..

Essentially, if I were to write the implementation for a regex, when encountering such a group, I would internally be doing this:

var regx = new RegExp('^(?=\w*[^\w])(?:\w+)*[^\w]$');
regx.test('aaaaaaaaaaaaaaaaaaaaaaaaaa');  //chrome will not crash

because first I'm doing a positive lookahead to check if this is even possible... though the complexity for this rises as the nested groups become more complex

Finally, I think this can be fixed here, by changing the host name from:

(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)

to:

(?:(?:[a-z\\u00a1-\\uffff0-9]-?)*[a-z\\u00a1-\\uffff0-9]+)

which is really the same thing, if you think about it.

EtaiG commented

In fact, I believe the whole host-domain-TLD identifier is the same as this (but this should be more performant and not crash):

      // host name
      "(?:[a-z\\u00a1-\\uffff0-9]-?)*[a-z\\u00a1-\\uffff0-9]+" +
      // domain name
      "(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-?)*[a-z\\u00a1-\\uffff0-9])*" +
      // TLD identifier
      "\\.[a-z\\u00a1-\\uffff]{2,}" +

There's no need to add non-capturing groups if you're not doing anything with the group... if you plan to modify a group with a repeater, lookahead or just use an OR operator in it, then use a group, but otherwise there's really no point (since all you want, is to make sure everything in the group is present... which you don't need to use a group for!)

Owner

Thank you @EtaiG,
your expression looks good too.

However I have been pushed to "re-read" the specifications throughly and was answered on a V8 ticket here: https://code.google.com/p/v8/issues/detail?id=430
In post #21 @erik suggested I consider rewriting the labels matching parts using lookahead.

Since most wanted a Javascript to use as a pattern checking inputs I did tests in Javascript only.

This is the result of following his advice, no ftp protocol no special IP handling, only the minimal:

var re_weburl = new RegExp(
    "^" +
        // protocol identifier (optional) + //
        "(?:(?:https?:)?//)?" +
        // user:pass authentication (optional)
        "(?:\\S+(?::\\S*)?@)?" +
        // host (optional) + domain + tld
        "(?:(?!-)[-a-z0-9\\u00a1-\\uffff]*[a-z0-9\\u00a1-\\uffff]+(?!./|\\.$)\\.?){2,}" +
        // server port number (optional)
        "(?::\\d{2,5})?" +
        // resource path (optional)
        "(?:/\\S*)?" +
    "$", "i"
);

This RE fits in a tweet ! But let's see how it works for you.

I also changed [^\s] with a \S as suggested by @jkj and relaxed the match on protocol identifiers.

Consecutive hyphens are allowed by specifications but they must not be found in both 3rd and 4th positions, those sequences are reserved for "xn--" and similar ASCII Compatible Encodings. If that exclusion were necessary maybe a simple lookahead (?|..--) will help there too.

EtaiG commented

@dperini , thanks for responding.
I read all the specifications too last week (RFC's 5890 - 5894 and RFC 3492, several times), due to this issue. I'm also poster #24 in the google v8 thread.

Please note that I will be analysing this issue in depth below, and if I come off critical - that is not my intent, so I apologize in advance.

I disagree with the negative lookaheads. There are rare cases when they are truly useful.
I believe in minimizing them whenever possible, especially when repeating something up to an 'infinite' amount of times, since they can cause dreadful performance for complicated matches..

I like being more explicit about the regex- which may make it more verbose, but it's very clear what the javascript engine needs to do to match it.

For example, when you have:

// host (optional) + domain + tld
        "(?:(?!-)[-a-z0-9\\u00a1-\\uffff]*[a-z0-9\\u00a1-\\uffff]+(?!./|\\.$)\\.?){2,}" +

This part can match long strings in too many different ways, and the regex is too general, so for characters which would match both the first character group and the second (namely, almost anything except for a dot and a hyphen), it can match an exponential number of times.

For example, it can match 'ab' as:
a b | ab
and it can match 'abc' as:
a b c | a bc | abc | ab c
and it can match 'abcd' as:
a b c d | a b cd | a bc d | a bcd | abcd | ab c d | ab cd | abc d

It's easy to see that for a string of length n, it has 2^(n-1) possible matches.

The way a greedy quantifier works is that it will stop as soon as it finds a possible match - otherwise it will try the next possibility in order to continue matching the regular expression.
This means that a sufficiently long string (i.e n = 21) which would result in a non-match, such as:
'aaaaaaaaaaaaaaaaaaaa.' (note the period at the end)
can cause it to take extremely long, an possibly crash (2^20 > 1,000,000)
Ignoring what's actually placed in memory and checked during a regex, by putting this in console, you can see what I mean:

var i=0, len = 2<<20;
console.time('test');
while(i<len){i++}
console.timeEnd('test');
// approximately 8s

You can test out your regex against that string (the one with the period at the end) and you'll see what I mean.

Also, note that 'aaaaaaaaaaaaaaaaaaaaaaaaaa' will match your regex although it's invalid.

This is because of the generalization of the check using greedy quantifiers, enabled by the negative lookahead (?!.\/|.$) (or by both of them?)

This is why I don't like negative lookaheads and prefer to be more declarative. You're almost forced to be more declarative when you don't use the negative lookaheads... but in the end, you are giving 'better instructions' to the javascript engine.

That's why I liked this better (for the host/domain/tld):

/(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9])*\.[a-z\u00a1-\uffff]{2,}/

Note that this is the same as what I posted above, with the exception of switching out the -? for -* (in both host and domain) to allow for as many hyphens in between letters.

This doesn't take care of the xn-- and 3rd/4th position issue, but unless you're allowing someone to register a domain by you, this is less of an issue (since for most cases, it's for a link, and people only need to link to something that is allowed and exists)... and even then, serverside validation would be necessary.

Owner

@EtaiG many thanks for the review and the good suggestions.
After trying myself your tweaks I have to completely agree with your points.
I still believe that by moving the dot matching to the end of the RE the host/domain/tld part can be reduced to only two main groups (since the only label with don't want followed by a dot is the TLD):

// host (optional) + domain + tld
"(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+\\.)+" +
   "(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+" +

I am not sure I should consider digits as valid in the TLD group (also it is considered a label itself).

Now the tests do not lock up Chrome and it also seem the overall speed for URL validation is faster.

Owner

The gist have been corrected/updated so it doesn't lock up Chrome Javascript.
I haven't reduced the host / domain / tld matching groups but I will do after testing.
Many thanks to @EtaiG for the help and the suggestions to resolve the problem.

I believe the slash before query params is optional. http://www.example.com?a=1&b=2 should pass, but it currently does not.

Changing line 93 to

"(?:/?\\S*)?" +

solves that issue, but might break other query-parameter specifications that aren't covered in the test cases.

Owner

@schbetsy I am not sure it is optional either.
Anyway your change fix that if it becomes necessary for some reader.
What I can see is that browsers accept that but then they insert a slash in it when finished.
I am curious to try the effects of this change on my current tests.
Thank you for pointing that out.

Hey @dperini,

Thanks for your great work! Please note that this regex fails on the following url: http://localhost:8080

Owner

@eluck,
it is written in the comments: 'TLDs have been made mandatory so single names like "localhost" fails'.
The regex was built to match URLs having a real domain name (at least 2 labels separated by a dot).
However it will be very easy to add 'localhost' as an acceptable exception.

Hey!

can you help me make this URI valid "foo.com"

thanks ahead!

PYTHON PORT (cc @brifordwylie):

import re
URL_REGEX = re.compile(
    u"^"
    # protocol identifier
    u"(?:(?:https?|ftp)://)"
    # user:pass authentication
    u"(?:\S+(?::\S*)?@)?"
    u"(?:"
    # IP address exclusion
    # private & local networks
    u"(?!(?:10|127)(?:\.\d{1,3}){3})"
    u"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    u"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    u"|"
    # host name
    u"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    # domain name
    u"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
    # TLD identifier
    u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    u")"
    # port number
    u"(?::\d{2,5})?"
    # resource path
    u"(?:/\S*)?"
    u"$"
    , re.UNICODE)

I did make one change: the "-*" in both domain and host was (incorrectly) succeeding against "http://a.b--c.de/" so I changed it to "-?" - I'm not sure why that's in the gist above, I'd think it would fail on a JS unit test also.

Owner

@adamrofer,
it seems the URL "http://a.b--c.de/" you are testing against is actually a valid URL.
As is ""http://g--a.com/". Just test it, it exists and resolves correctly to a Georgia State page.
I have been directed to read the relevant specs here:
http://url.spec.whatwg.org/#concept-host-parser
and the validity criteria are here:
http://www.unicode.org/reports/tr46/#Validity_Criteria
Thank you for the Python port !

@dperini
Can you support international URLs?
For example: http://xn--80aaxitdbjk.xn--p1ai

Owner

@nghuuphuoc,
the regexp already supports international URLs, just write them using natural UTF-8 encoding.
The following is the UTF-8 version of the URL you typed above:
http://папироска.рф
It would be hard to type or remember IDN URLs like the one you typed, nobody will do.
This has been written to validate URLs typed by users and/or found in log files.

@dperini thanks for sharing :+1:

@dperini,
I'm using chai.js assert library to write a simple test for a js object in my rails app. This for initial client side form validation. Some of the uri formats as tested in @ixti spec above are failing to return false, here's the list.

"http://a.b--c.de/",
"http://-a.b.co",
"http://a.b-.co",
"http://0.0.0.0",
"http://10.1.1.0",
"http://10.1.1.255",
"http://224.1.1.1",
"http://1.1.1.1.1",
"http://123.123.123",
"http://3628126748",
"http://.www.foo.bar/",
"http://www.foo.bar./",
"http://.www.foo.bar./",
"http://10.1.1.1",
"http://10.1.1.254"

Heres my code

form_validators.coffee
#= require regex-weburl
class @FormValidators
  uri: (uri)->
    re_weburl.test(uri)
form_validators.js.coffee
#= require ../spec_helper
describe 'FormValidators', ->
  describe '#uri', ->
    beforeEach ->
      @formValidators = new FormValidators()
    it 'returns false for invalid urls', ->
      assert.notOk @formValidators.uri("http://")
      assert.notOk @formValidators.uri("http://.")
      assert.notOk @formValidators.uri("http://..")
      assert.notOk @formValidators.uri("http://../")
      assert.notOk @formValidators.uri("http://?")
      assert.notOk @formValidators.uri("http://??")
      assert.notOk @formValidators.uri("http://??/")
      assert.notOk @formValidators.uri("http://#")
      assert.notOk @formValidators.uri("http://##")
      assert.notOk @formValidators.uri("http://##/")
      assert.notOk @formValidators.uri("http://foo.bar?q=Spaces should be encoded")
      assert.notOk @formValidators.uri("//")
      assert.notOk @formValidators.uri("//a")
      assert.notOk @formValidators.uri("///a")
      assert.notOk @formValidators.uri("///")
      assert.notOk @formValidators.uri("http:///a")
      assert.notOk @formValidators.uri("foo.com")
      assert.notOk @formValidators.uri("rdar://1234")
      assert.notOk @formValidators.uri("http:// shouldfail.com")
      assert.notOk @formValidators.uri(":// should fail")
      assert.notOk @formValidators.uri("http://foo.bar/foo(bar)baz quux")
      assert.notOk @formValidators.uri("http://-error-.invalid/")
      assert.notOk @formValidators.uri("http://a.b--c.de/")
      assert.notOk @formValidators.uri("http://-a.b.co")
      assert.notOk @formValidators.uri("http://a.b-.co")
      assert.notOk @formValidators.uri("http://0.0.0.0")
      assert.notOk @formValidators.uri("http://10.1.1.0")
      assert.notOk @formValidators.uri("http://10.1.1.255")
      assert.notOk @formValidators.uri("http://224.1.1.1")
      assert.notOk @formValidators.uri("http://1.1.1.1.1")
      assert.notOk @formValidators.uri("http://123.123.123")
      assert.notOk @formValidators.uri("http://3628126748")
      assert.notOk @formValidators.uri("http://.www.foo.bar/")
      assert.notOk @formValidators.uri("http://www.foo.bar./")
      assert.notOk @formValidators.uri("http://.www.foo.bar./")
      assert.notOk @formValidators.uri("http://10.1.1.1")
      assert.notOk @formValidators.uri("http://10.1.1.254")

Just thought I would take the time out to let you know. I'm not sure if something changed recently, if you are even supporting this script anymore. Good work by the way, saved me a tone of time.

@adamrofer fix of changing ( -* ) to ( -? ) in the host and domain name section fixed the js unit test for me

Owner

@dsgn1graphics,
I suggest you check your tests and/or the port of the Regular Expression you are currently using.
In the list of URLs failing validation that you sent above only the first one is a valid URL ("http://a.b--c.de/") all the others are not validating against the regex.

I tested them once more within my environment (Javascript) and everything works as expected.

Thanks Diego for your hard work! :+1: to @CMCDragonkai's comment, though: IpV6 support and a Git repo with ports to multiple languages are both really great ideas.

Hi @dperini

I love the expression, but I'm wondering what modification I would need to make, to make the pattern ignore a URL if it is proceeded by either a " or = or ] or > and succeeded with either a " or [/ or </

It is so that the following won't be validated:

[link=http://www.google.com]google.com[/link]

and

<a href="http://www.google.com">google.com</a>

Reason is I currently use modified version gruber's regex as part of a php auto url function in the following manner, but I would like to use your's instead:-

// Regular expression for URLs
// Based on http://daringfireball.net/2010/07/improved_regex_for_matching_urls
// Improved to only pickup links begining with http https ftp ftps mailto and www
$regex = "_(?i)\b((?:https?|ftps?|mailto|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))_iuS";

// If markup is TRUE, convert URLs to html markup
if ($markup == TRUE) $string = preg_replace_callback($regex, array(&$this, 'auto_url'), $string);

Thanks, Matt

Additional, my thinking behind this question is to be able to allow the manual coding of links, using html or bbcode.

Owner

Matt,
just saw this ... as a quick suggestion you can try something like:

(?:\x22|\x3d|\x5d|\x3e)(?:regex-weburl)(?:\x22|\x5b\x2f|\x3c\x2f)

haven't tried it, not sure it does exactly what you asked/depicted.
It's a start anyway :smile:

Owner

Matt,
a better approach to match corresponding open/close brackets and quotes would require more work:

(?:\x5d(?:regex-weburl)\x5b\x2f)|
(?:\x3e(?:regex-weburl)\x3c\x2f)|
(?:\x22(?:regex-weburl)\x22)|
(?:\x3d(?:regex-weburl))

again, I haven't tested it.

Owner

Oire,
yes I believe it would be a good idea to move this to a Git repo.

However I disagree about having patterns that will never be typed by users like "IPV6" and "PunyCode". I am most likely inclined to also remove IPV4 validation from the base regex, nobody remember these numbers and they will most likely change in time.

Nobody will type/remember "PunyCode" URLs and the regex already supports international UTF-8 URLs.
The above is also true for decimal notations, various forms of IPV6 URLs and other "non-human" URLs.

Thanks for sharing, Diego.
I put this in a repo: https://github.com/MarQuisKnox/regex-weburl.js

Thanks @MarQuisKnox, @dperini and @mathiasbynens, it is really helpful!

Hey guys, here is my extended version https://github.com/Fleshgrinder/php-url-validator
It builds upon your regular expression @dperini but has support for more features:

  • IPv6 addresses (actual validation via filter_var).
  • Punycode support.
  • URLs which are not in NFC form are invalid.
  • URLs with a dash on the third and fourth position are invalid.

Would you mind if I release my code with the Unlicense license? I used MIT because you used MIT, but I'm more into total freedom.

Hi,
http://example.com./ is a valid URL but the last dot ist usually not written by convention. See http://tools.ietf.org/html/rfc1035 Paragraph 3.1.
http://en.wikipedia.org./wiki/Domain_name#Domain_name_syntax works in Firefox and IE

Just a small comment about brodcast and network address. these address can be valid in CIDR class. Ex: If a provider have two class like 205.151.128.0/24 and 205.151.129.0/24, they can combine the two in a classless network: 205.151.128.0/23. In that network, 205.151.128.255 and 205.151.129.0 are two valid and usable address.

Any regex can extract URLs from below cases?

"http://google.com" (string contains double quotes)
'http://google.com' (string contains single quote)
[http://google.com\] (string contains brackets)
<br>http://google.com\<\/br> (string contains html tags)

http://markdown-it.github.io/linkify-it/ here is JS demo with full unicode support, including astral characters.

Final regexp in ~6K and generated automatically. Src is here: https://github.com/markdown-it/linkify-it/blob/master/lib/re_url_parts.js . Since astral characters take 2 positions, [^negative] class is impossible. Negative lookahead is used instead

NOTE, that package does fuzzy search, not strict validation. For strict validation (^...$) required.

I changed the last block for the resource path to look like this:

(?:[/?#]\\S*)?

This will allow URLs like http://test.com#MyAnchor or http://test.com/whatever or http://test.com?some=query

while they may not technically be valid, it is something I could see a user typing and most browsers will fix it for them. If they copy it out and back into a browser so they may not know what's wrong with it upon visual inspection.

This is exactly what I've been looking for.
Thank you. The only pattern it won't match for me (Using it in a Java Regex) is where the IP address is '0'(ZERO) padded, like:

http://096.004.012.125/index.html

Which I get as input from other tools.

Thanks again for the GREAT regex!!

anyone have a vb.net port?

'VB Port that handles domains with or without a hostname

    Public Sub MatchUrl(url As String)
    Dim rxs As String = ""
    'protocol identifier

    rxs = rxs + "(?:(?:https?)://)"
    ' user:pass authentication
    rxs = rxs + "(?:\S+(?::\S*)?@)?"
    rxs = rxs + "(?:"
    'IP address exclusion
    'private & local networks
    rxs = rxs + "(?!(?:10|127)(?:\.\d{1,3}){3})"
    rxs = rxs + "(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    rxs = rxs + "(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    'IP address dotted notation octets
    'excludes loopback network 0.0.0.0
    'excludes reserved space >= 224.0.0.0
    'excludes network & broacast addresses
    '(first & last IP address of each class)
    rxs = rxs + "(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    rxs = rxs + "(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    rxs = rxs + "(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    rxs = rxs + "|"
    'host name
    rxs = rxs + "(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    'domain name
    rxs = rxs + "(?:(?:\.[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    ' TLD identifier
    rxs = rxs + "(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    rxs = rxs + ")"
    ' port number
    rxs = rxs + "(?::\d{2,5})?"
    ' resource path
    rxs = rxs + "(?:/\S*)?"


    Dim rx As Regex = New Regex(rxs, RegexOptions.IgnoreCase)
    Dim match As Match = rx.Match(url)
    If match.Success Then
        Console.WriteLine(match.Value.ToString)
    Else
        Console.WriteLine("not a match")
    End If


End Sub

I also discovered that underscores are not valid if you follow this RegExp.
e.g:

The URL

http://a_b.c.com

will fail.

Here's a link to a relevant StackOverflow question:

http://stackoverflow.com/questions/2180465/can-hostname-subdomains-have-an-underscore-in-it

This is my PHP port...

I added (?=\s|$) to the end to prevent matches like http://foo.bar?param=meter (no path-slash).

I added (?<=^|\s) at the beginning to use it within text.

Additionally i reordered the hostname parts, to get it working with preg_replace_callback (I had some BACKTRACE LIMIT EXCEEDED errors).

[a-z\x{00a1}-\x{ffff}0-9]+
(?:-[a-z\x{00a1}-\x{ffff}0-9]+)*

The full expression:

const RX_LINK_ALL = '#
    (?<=^|\s)
    (?:(?:https?|ftp)://)?
    (?:\S+(?::\S*)?@)?
    (?:
        (?!10(?:\.\d{1,3}){3})
        (?!127(?:\.\d{1,3}){3})
        (?!169\.254(?:\.\d{1,3}){2})
        (?!192\.168(?:\.\d{1,3}){2})
        (?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})
        (?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])
        (?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))
    |
        (?:[a-z\x{00a1}-\x{ffff}0-9]+(?:-[a-z\x{00a1}-\x{ffff}0-9]+)*)
        (?:\.[a-z\x{00a1}-\x{ffff}0-9]+(?:-[a-z\x{00a1}-\x{ffff}0-9]+)*)*
        (?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,}))
    )
    (?::\d{2,5})?
    (?:/\S*)?
    (?=\s|$)
#ux';
jnovack commented

10.1.1.255 is a VALID HOST IP for a host within a 10.1.0.0/22 subnet or larger.

  • First IP 10.1.0.1
  • Last IP 10.1.3.254

http://www.adminsub.net/ipv4-subnet-calculator/10.1.0.0/22

At a minimum, there are only two always-invalid IPs in the 10. subnet. I suggest only testing the following:

  • 10.0.0.0 - Subnet address in 10.0.0.0/8 (largest possible 10. subnet)
  • 10.255.255.255 - Broadcast address in 10.0.0.0/8 (largest possible 10. subnet)
  • 10.1.1.256 - For validation testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.