Skip to content
Create a gist now

Instantly share code, notes, and snippets.

Embed URL


Subversion checkout URL

You can clone with
Download ZIP
// Regular Expression for URL validation
// Author: Diego Perini
// Updated: 2010/12/05
// License: MIT
// Copyright (c) 2010-2013 Diego Perini (
// Permission is hereby granted, free of charge, to any person
// obtaining a copy of this software and associated documentation
// files (the "Software"), to deal in the Software without
// restriction, including without limitation the rights to use,
// copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following
// conditions:
// The above copyright notice and this permission notice shall be
// included in all copies or substantial portions of the Software.
// the regular expression composed & commented
// could be easily tweaked for RFC compliance,
// it was expressly modified to fit & satisfy
// these test for an URL shortener:
// Notes on possible differences from a standard/generic validation:
// - utf-8 char class take in consideration the full Unicode range
// - TLDs have been made mandatory so single names like "localhost" fails
// - protocols have been restricted to ftp, http and https only as requested
// Changes:
// - IP address dotted notation validation, range: -
// first and last IP address of each class is considered invalid
// (since they are broadcast/network addresses)
// - Added exclusion of private, reserved and/or local networks ranges
// - Made starting path slash optional (
// - Allow a dot (.) at the end of hostnames (
// Compressed one-line versions:
// Javascript version
// /^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))\.?)(?::\d{2,5})?(?:[/?#]\S*)?$/i
// PHP version
// _^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]-*)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]-*)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,}))\.?)(?::\d{2,5})?(?:[/?#]\S*)?$_iuS
var re_weburl = new RegExp(
"^" +
// protocol identifier
"(?:(?:https?|ftp)://)" +
// user:pass authentication
"(?:\\S+(?::\\S*)?@)?" +
"(?:" +
// IP address exclusion
// private & local networks
"(?!(?:10|127)(?:\\.\\d{1,3}){3})" +
"(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})" +
"(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})" +
// IP address dotted notation octets
// excludes loopback network
// excludes reserved space >=
// excludes network & broacast addresses
// (first & last IP address of each class)
"(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])" +
"(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}" +
"(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))" +
"|" +
// host name
"(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)" +
// domain name
"(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*" +
// TLD identifier
"(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))" +
// TLD may end with dot
"\\.?" +
")" +
// port number
"(?::\\d{2,5})?" +
// resource path
"(?:[/?#]\\S*)?" +
"$", "i"

In PHP (for use with preg_match), this becomes:


Thanks for the regex Diego, I’ve added it to the test case and it seems to pass all the tests :) Nice job!


I have added simple network ranges validation, the rules I used are:
- valid range -, network adresses above and including are reserved addresses
- first and last IP address of each class is excluded since they are used as network broadcast addresses
since I don't think this is worth implementing completely in a regular expression, a following pass should exclude the Intranet address space: - - -
the loopback and the automatic configuration address space: - -
while the local, multicast and and the reserved address spaces: - (SPECIAL-IPV4-LOCAL-ID-IANA-RESERVED) - 239.255.255 (MCAST-NET) - (SPECIAL-IPV4-FUTURE-USE-IANA-RESERVED)
should already be excluded by the above regular expression.

This a very minimal list of tests to add to your testings:



Need testing :)


Need to mention I took the idea of validating the possible IP address ranges in the URL while looking at other developers regular expressions I have seen in your tests, especially the one from @scottgonzales. He also sliced up the Unicode ranges :=), that's the reason his one is so long :)


Awesome stuff Diego!!


Added IP address validation tweaking and optimizations suggested by @abozhilov


Added exclusion of private, reserved, auto-configuration and local network ranges as described in the previous message.
Network and all networks >= are excluded by the second validation block.
The second validation block also takes care of excluding IP address terminating with 0 or 255 (non usable network and broadcast addresses of each class C network).

It is easy to just remove the unwanted parts of the validation to fit different scopes (length, precision) so I will probably add more options like the list of existing TLD (possibly grouped), the list of existing protocols and/or a fall back for a more generic protocol match too.


Hey, just randomly came across this... my JavaScript URI parsing library does strict URI validation as per RFC 3986. It uses a much larger regular expression then this one. Code can be found at:


I changed it a little bit so that it's valid in Ruby. Here it is:



Hi Diego,

Just came across this awesome code. I'd like to use this as a basis, and I'm hoping you can help me with a simple tweak. I'd like to let through URL's without the protocol specified (HTTP(S) or FTP). For some reason I can't seem to get it to work.



Hey Diego, Nice work. You make it a bit shorter though:


Similarly with the subnets


@dperini Can you assign a license to this? MIT or BSD?


+1 for the license information


+1 for the license information from me, too


+infinity on the license Diego


I have added the MIT License to the gist as requested.

Thank you all for the support.


@dperini: Could you add support for url such this?




Is there a Java version of the regex available? That would be great for my android app!


@mparodi Ruby version untouched by markdown



Ruby port:

class Regexp


    # protocol identifier

    # user:pass authentication

      # IP address exclusion
      # private & local networks

      # IP address dotted notation octets
      # excludes loopback network
      # excludes reserved space >=
      # excludes network & broacast addresses
      # (first & last IP address of each class)
      # host name

      # domain name

      # TLD identifier

    # port number

    # resource path



And specs:

# encoding: utf-8

require "spec_helper"

describe "Regexp::PERFECT_URL_PATTERN" do

  ].each do |valid_url|
    it "matches #{valid_url}" do
      expect(Regexp::PERFECT_URL_PATTERN =~ valid_url).to eq 0

    " should be encoded",
    ":// should fail",
    " quux",
  ].each do |invalid_url|
    it "does not match #{invalid_url}" do
      expect(Regexp::PERFECT_URL_PATTERN =~ invalid_url).to be_nil


very good, thank you for share


I added support for punycoded domain names:


Updated the gist with reductions/shortenings suggested by "jpillora".

Thank you !



to do that you can change line 65 from:

"(?:(?:https?|ftp)://)" +


"(?:(?:(?:https?|ftp):)?//)" +

this way the protocol and colon becomes an optional macth.

You can also just leave the double slash on that line if no URLs have the protocol prefix:

"//" +

Why can't the maximum range for Unicode strings extend to U0010ffff (instead of uffff)?


What about relative URLs?


@stevenvachon relatives wouldn't be URLs they would be paths, which wouldn't need this validation at that point.

jkj commented

I recently needed this but have a dumb question. In the very last part for the resource path, why do you use [^\\s] rather than \\S ? To my understanding they are equivalent, with the latter being a bit shorter.


For the following Regex and the one pasted by ixti:

    URL = /\A(?:(?:https?):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[a-z0-9][a-z0-9\-]+)*[a-z0-9]+)(?:\.(?:[a-z0-9\-])*[a-z0-9]+)*(?:\.(?:[a-z]{2,})(:\d{1,5})?))(?:\/[^\s]*)?\z/i

You will end up with extremely slow matching, to the point where you suspect an infinite loop, if you have a long subdomain for a URL ending with a period:


it { should_not match "http://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.randomstring." }

The longer the subdomain "aaa....", the longer it'll take.


Fixed the URL Regex to make the subdomain match non-recursive thereby improving performance. Long story short: it passed our existing test suite and improved performance dramatically.

    URL = /\A(?:(?:https?):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:([a-z0-9][a-z0-9\-]*)?[a-z0-9]+)(?:\.(?:[a-z0-9\-])*[a-z0-9]+)*(?:\.(?:[a-z]{2,})(:\d{1,5})?))(?:\/[^\s]*)?\z/i

Anyone have a python port? My recollection was that the python regexp engine does have some differences.


@dperini you should add support for 32bit addresses and ipv6 addresses.

I vote that this should be turned into a git repository with multi-language ports.


I'm also using the top of the page gist regex in JS and finding it very slow to process long invalid URLs such as:

The more letters added there the slower the response.

It sounds like what @phiyangt is referring to above.

Is there any solution for this for JS?



@dmroc @Feendish try using Firefox with the same code you say it is slow ... maybe you are just using the wrong browser for your testings/objectives, you didn't specify any code or environment info to replicate.
However I guess you are using Chrome :9) if not please provide more infos.


Well after a few test I can say the slowdown and further browser crash is a Chrome only problem.
I tried the same in Firefox and everything works correctly with these REGEXP, no problem or slowdown.

I have reduced the original REGEXP to a minimal to be able to show the problem.
Try the following line in Chrome console, it will crash the browser:


So I believe this is just a bug in Chrome RE engine.


Hi Diego,

Yeah I'm on latest stable Chrome (Version 35.0.1916.153 m).

This is the "bad" url I'm checking http://qweqweqweqwesadasdqweeqweqwsdqweqweqweqwesadasdqweeqweqwsd

The original regex I'm using (the one from the Gist on top - 1 liner or full version) locks the browser in Chrome as you say. It also locks up IE11.

In Firefox 29 it gave this error:
InternalError: an error occurred while executing regular expression

I updated to latest Firefox v30. The regex runs and gives false which is correct.

From some research online it appears Chrome does not halt execution when there is catastrophic backtracking in a regex. Safari, Firefox and IE could just report 'no match' after some arbitrary number of backtracks.

I also tried your recent regex above and it doesn't lock any browsers.

However it returns true for 'isjdfofjasodfjsodifjosadifjsdoiafjaisdjfisdfjs' which is invalid.
It also returns false for '' which is incorrect.

Are you sure there isn't a runaway loop in there somewhere?


I don't know why copying and pasting the above RE in Chrome console mangles some character, it actually doesn't crash the console of the browser window.

Please try to cut and paste the RE from this tweet:

I retested it and it actually crashes the console in that it doesn't answer to commands anymore after running that RE test that you can find in the above tweet.

The fact that the original RE also works on Safari pushes me to believe it's a Chrome problem but I need to do more tests. The "weburl" RE also work in PHP and other environments.

I am testing on the same Chrome Version 35.0.1916.153 under OS X 10.9.3.

Suggestion and help on this matter are welcome !


@dperini This seems to be a V8 issue. Relevant bug ticket:


@dperini I ran the RE from the tweet in RegexBuddy analyser and it says "Your regular expression leads to "catastrophic backtracking", making it too complex to be run to completion."

It locks up Chrome & Opera but not Firefox. As the ticket @mathiasbynens linked to suggests, certain browsers are more lenient when catastrophic backtracking happens. Chrome V8 seems to not have any fail limit for this and puts the onus on the regex format.


can you contact me via email ?
I have a newer version of the RE that doesn't crash Chrome.
Maybe you can try it and give me some feedback before I push it to a new gist.


Sure sent it there now. Thanks.


@dperini, we've found this issue too... looks like there's a highly exponential recursion into infinity on simple strings.

I've managed to reduce this to the way the hostname check is written (since it's followed later (eventually) by TLD).
It's this simple format that will cause the problem:

var regx = new RegExp('^(\\w+)*[^\\w]$');
regx.test('aaaaaaaaaaaaaaaaaaaaaaaaaa');  //chrome will crash

In other words, when you have a repeat of something 1 -> infinity times, and this group is repeated 0->infinity times, and the next match is for anything not in the group (obviously... but I put [^w] just to illustrate), then chrome will keep recursion to search for a possible group of (1->n) which repeats (0->m) times which has that letter matching.

Of course, internally, the regex should first be run 'greedily' to check if there's a possible match by making sure required letters are there..

Essentially, if I were to write the implementation for a regex, when encountering such a group, I would internally be doing this:

var regx = new RegExp('^(?=\w*[^\w])(?:\w+)*[^\w]$');
regx.test('aaaaaaaaaaaaaaaaaaaaaaaaaa');  //chrome will not crash

because first I'm doing a positive lookahead to check if this is even possible... though the complexity for this rises as the nested groups become more complex

Finally, I think this can be fixed here, by changing the host name from:




which is really the same thing, if you think about it.


In fact, I believe the whole host-domain-TLD identifier is the same as this (but this should be more performant and not crash):

      // host name
      "(?:[a-z\\u00a1-\\uffff0-9]-?)*[a-z\\u00a1-\\uffff0-9]+" +
      // domain name
      "(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-?)*[a-z\\u00a1-\\uffff0-9])*" +
      // TLD identifier
      "\\.[a-z\\u00a1-\\uffff]{2,}" +

There's no need to add non-capturing groups if you're not doing anything with the group... if you plan to modify a group with a repeater, lookahead or just use an OR operator in it, then use a group, but otherwise there's really no point (since all you want, is to make sure everything in the group is present... which you don't need to use a group for!)


Thank you @EtaiG,
your expression looks good too.

However I have been pushed to "re-read" the specifications throughly and was answered on a V8 ticket here:
In post #21 @erik suggested I consider rewriting the labels matching parts using lookahead.

Since most wanted a Javascript to use as a pattern checking inputs I did tests in Javascript only.

This is the result of following his advice, no ftp protocol no special IP handling, only the minimal:

var re_weburl = new RegExp(
    "^" +
        // protocol identifier (optional) + //
        "(?:(?:https?:)?//)?" +
        // user:pass authentication (optional)
        "(?:\\S+(?::\\S*)?@)?" +
        // host (optional) + domain + tld
        "(?:(?!-)[-a-z0-9\\u00a1-\\uffff]*[a-z0-9\\u00a1-\\uffff]+(?!./|\\.$)\\.?){2,}" +
        // server port number (optional)
        "(?::\\d{2,5})?" +
        // resource path (optional)
        "(?:/\\S*)?" +
    "$", "i"

This RE fits in a tweet ! But let's see how it works for you.

I also changed [^\s] with a \S as suggested by @jkj and relaxed the match on protocol identifiers.

Consecutive hyphens are allowed by specifications but they must not be found in both 3rd and 4th positions, those sequences are reserved for "xn--" and similar ASCII Compatible Encodings. If that exclusion were necessary maybe a simple lookahead (?|..--) will help there too.


@dperini , thanks for responding.
I read all the specifications too last week (RFC's 5890 - 5894 and RFC 3492, several times), due to this issue. I'm also poster #24 in the google v8 thread.

Please note that I will be analysing this issue in depth below, and if I come off critical - that is not my intent, so I apologize in advance.

I disagree with the negative lookaheads. There are rare cases when they are truly useful.
I believe in minimizing them whenever possible, especially when repeating something up to an 'infinite' amount of times, since they can cause dreadful performance for complicated matches..

I like being more explicit about the regex- which may make it more verbose, but it's very clear what the javascript engine needs to do to match it.

For example, when you have:

// host (optional) + domain + tld
        "(?:(?!-)[-a-z0-9\\u00a1-\\uffff]*[a-z0-9\\u00a1-\\uffff]+(?!./|\\.$)\\.?){2,}" +

This part can match long strings in too many different ways, and the regex is too general, so for characters which would match both the first character group and the second (namely, almost anything except for a dot and a hyphen), it can match an exponential number of times.

For example, it can match 'ab' as:
a b | ab
and it can match 'abc' as:
a b c | a bc | abc | ab c
and it can match 'abcd' as:
a b c d | a b cd | a bc d | a bcd | abcd | ab c d | ab cd | abc d

It's easy to see that for a string of length n, it has 2^(n-1) possible matches.

The way a greedy quantifier works is that it will stop as soon as it finds a possible match - otherwise it will try the next possibility in order to continue matching the regular expression.
This means that a sufficiently long string (i.e n = 21) which would result in a non-match, such as:
'aaaaaaaaaaaaaaaaaaaa.' (note the period at the end)
can cause it to take extremely long, an possibly crash (2^20 > 1,000,000)
Ignoring what's actually placed in memory and checked during a regex, by putting this in console, you can see what I mean:

var i=0, len = 2<<20;
// approximately 8s

You can test out your regex against that string (the one with the period at the end) and you'll see what I mean.

Also, note that 'aaaaaaaaaaaaaaaaaaaaaaaaaa' will match your regex although it's invalid.

This is because of the generalization of the check using greedy quantifiers, enabled by the negative lookahead (?!.\/|.$) (or by both of them?)

This is why I don't like negative lookaheads and prefer to be more declarative. You're almost forced to be more declarative when you don't use the negative lookaheads... but in the end, you are giving 'better instructions' to the javascript engine.

That's why I liked this better (for the host/domain/tld):


Note that this is the same as what I posted above, with the exception of switching out the -? for -* (in both host and domain) to allow for as many hyphens in between letters.

This doesn't take care of the xn-- and 3rd/4th position issue, but unless you're allowing someone to register a domain by you, this is less of an issue (since for most cases, it's for a link, and people only need to link to something that is allowed and exists)... and even then, serverside validation would be necessary.


@EtaiG many thanks for the review and the good suggestions.
After trying myself your tweaks I have to completely agree with your points.
I still believe that by moving the dot matching to the end of the RE the host/domain/tld part can be reduced to only two main groups (since the only label with don't want followed by a dot is the TLD):

// host (optional) + domain + tld
"(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+\\.)+" +
   "(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+" +

I am not sure I should consider digits as valid in the TLD group (also it is considered a label itself).

Now the tests do not lock up Chrome and it also seem the overall speed for URL validation is faster.


The gist have been corrected/updated so it doesn't lock up Chrome Javascript.
I haven't reduced the host / domain / tld matching groups but I will do after testing.
Many thanks to @EtaiG for the help and the suggestions to resolve the problem.


I believe the slash before query params is optional. should pass, but it currently does not.

Changing line 93 to

"(?:/?\\S*)?" +

solves that issue, but might break other query-parameter specifications that aren't covered in the test cases.


@schbetsy I am not sure it is optional either.
Anyway your change fix that if it becomes necessary for some reader.
What I can see is that browsers accept that but then they insert a slash in it when finished.
I am curious to try the effects of this change on my current tests.
Thank you for pointing that out.


Hey @dperini,

Thanks for your great work! Please note that this regex fails on the following url: http://localhost:8080


it is written in the comments: 'TLDs have been made mandatory so single names like "localhost" fails'.
The regex was built to match URLs having a real domain name (at least 2 labels separated by a dot).
However it will be very easy to add 'localhost' as an acceptable exception.



can you help me make this URI valid ""

thanks ahead!


PYTHON PORT (cc @brifordwylie):

import re
URL_REGEX = re.compile(
    # protocol identifier
    # user:pass authentication
    # IP address exclusion
    # private & local networks
    # IP address dotted notation octets
    # excludes loopback network
    # excludes reserved space >=
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    # host name
    # domain name
    # TLD identifier
    # port number
    # resource path
    , re.UNICODE)

I did make one change: the "-*" in both domain and host was (incorrectly) succeeding against "" so I changed it to "-?" - I'm not sure why that's in the gist above, I'd think it would fail on a JS unit test also.


it seems the URL "" you are testing against is actually a valid URL.
As is """. Just test it, it exists and resolves correctly to a Georgia State page.
I have been directed to read the relevant specs here:
and the validity criteria are here:
Thank you for the Python port !


Can you support international URLs?
For example: http://xn--80aaxitdbjk.xn--p1ai


the regexp already supports international URLs, just write them using natural UTF-8 encoding.
The following is the UTF-8 version of the URL you typed above:
It would be hard to type or remember IDN URLs like the one you typed, nobody will do.
This has been written to validate URLs typed by users and/or found in log files.


@dperini thanks for sharing :+1:


I'm using chai.js assert library to write a simple test for a js object in my rails app. This for initial client side form validation. Some of the uri formats as tested in @ixti spec above are failing to return false, here's the list.


Heres my code
#= require regex-weburl
class @FormValidators
  uri: (uri)->
#= require ../spec_helper
describe 'FormValidators', ->
  describe '#uri', ->
    beforeEach ->
      @formValidators = new FormValidators()
    it 'returns false for invalid urls', ->
      assert.notOk @formValidators.uri("http://")
      assert.notOk @formValidators.uri("http://.")
      assert.notOk @formValidators.uri("http://..")
      assert.notOk @formValidators.uri("http://../")
      assert.notOk @formValidators.uri("http://?")
      assert.notOk @formValidators.uri("http://??")
      assert.notOk @formValidators.uri("http://??/")
      assert.notOk @formValidators.uri("http://#")
      assert.notOk @formValidators.uri("http://##")
      assert.notOk @formValidators.uri("http://##/")
      assert.notOk @formValidators.uri(" should be encoded")
      assert.notOk @formValidators.uri("//")
      assert.notOk @formValidators.uri("//a")
      assert.notOk @formValidators.uri("///a")
      assert.notOk @formValidators.uri("///")
      assert.notOk @formValidators.uri("http:///a")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("rdar://1234")
      assert.notOk @formValidators.uri("http://")
      assert.notOk @formValidators.uri(":// should fail")
      assert.notOk @formValidators.uri(" quux")
      assert.notOk @formValidators.uri("http://-error-.invalid/")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("http://123.123.123")
      assert.notOk @formValidators.uri("http://3628126748")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")
      assert.notOk @formValidators.uri("")

Just thought I would take the time out to let you know. I'm not sure if something changed recently, if you are even supporting this script anymore. Good work by the way, saved me a tone of time.


@adamrofer fix of changing ( -* ) to ( -? ) in the host and domain name section fixed the js unit test for me


I suggest you check your tests and/or the port of the Regular Expression you are currently using.
In the list of URLs failing validation that you sent above only the first one is a valid URL ("") all the others are not validating against the regex.

I tested them once more within my environment (Javascript) and everything works as expected.


Thanks Diego for your hard work! :+1: to @CMCDragonkai's comment, though: IpV6 support and a Git repo with ports to multiple languages are both really great ideas.


Hi @dperini

I love the expression, but I'm wondering what modification I would need to make, to make the pattern ignore a URL if it is proceeded by either a " or = or ] or > and succeeded with either a " or [/ or </

It is so that the following won't be validated:



<a href=""></a>

Reason is I currently use modified version gruber's regex as part of a php auto url function in the following manner, but I would like to use your's instead:-

// Regular expression for URLs
// Based on
// Improved to only pickup links begining with http https ftp ftps mailto and www
$regex = "_(?i)\b((?:https?|ftps?|mailto|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))_iuS";

// If markup is TRUE, convert URLs to html markup
if ($markup == TRUE) $string = preg_replace_callback($regex, array(&$this, 'auto_url'), $string);

Thanks, Matt


Additional, my thinking behind this question is to be able to allow the manual coding of links, using html or bbcode.


just saw this ... as a quick suggestion you can try something like:


haven't tried it, not sure it does exactly what you asked/depicted.
It's a start anyway :smile:


a better approach to match corresponding open/close brackets and quotes would require more work:


again, I haven't tested it.


yes I believe it would be a good idea to move this to a Git repo.

However I disagree about having patterns that will never be typed by users like "IPV6" and "PunyCode". I am most likely inclined to also remove IPV4 validation from the base regex, nobody remember these numbers and they will most likely change in time.

Nobody will type/remember "PunyCode" URLs and the regex already supports international UTF-8 URLs.
The above is also true for decimal notations, various forms of IPV6 URLs and other "non-human" URLs.


Thanks for sharing, Diego.
I put this in a repo:


Thanks @MarQuisKnox, @dperini and @mathiasbynens, it is really helpful!


Hey guys, here is my extended version
It builds upon your regular expression @dperini but has support for more features:

  • IPv6 addresses (actual validation via filter_var).
  • Punycode support.
  • URLs which are not in NFC form are invalid.
  • URLs with a dash on the third and fourth position are invalid.

Would you mind if I release my code with the Unlicense license? I used MIT because you used MIT, but I'm more into total freedom.


Hi, is a valid URL but the last dot ist usually not written by convention. See Paragraph 3.1. works in Firefox and IE


Just a small comment about brodcast and network address. these address can be valid in CIDR class. Ex: If a provider have two class like and, they can combine the two in a classless network: In that network, and are two valid and usable address.


Any regex can extract URLs from below cases?

"" (string contains double quotes)
'' (string contains single quote)
[\] (string contains brackets)
<br>\<\/br> (string contains html tags)

@puzrin here is JS demo with full unicode support, including astral characters.

Final regexp in ~6K and generated automatically. Src is here: . Since astral characters take 2 positions, [^negative] class is impossible. Negative lookahead is used instead

NOTE, that package does fuzzy search, not strict validation. For strict validation (^...$) required.


I changed the last block for the resource path to look like this:


This will allow URLs like or or

while they may not technically be valid, it is something I could see a user typing and most browsers will fix it for them. If they copy it out and back into a browser so they may not know what's wrong with it upon visual inspection.


This is exactly what I've been looking for.
Thank you. The only pattern it won't match for me (Using it in a Java Regex) is where the IP address is '0'(ZERO) padded, like:

Which I get as input from other tools.

Thanks again for the GREAT regex!!


anyone have a port?


'VB Port that handles domains with or without a hostname

    Public Sub MatchUrl(url As String)
    Dim rxs As String = ""
    'protocol identifier

    rxs = rxs + "(?:(?:https?)://)"
    ' user:pass authentication
    rxs = rxs + "(?:\S+(?::\S*)?@)?"
    rxs = rxs + "(?:"
    'IP address exclusion
    'private & local networks
    rxs = rxs + "(?!(?:10|127)(?:\.\d{1,3}){3})"
    rxs = rxs + "(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    rxs = rxs + "(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    'IP address dotted notation octets
    'excludes loopback network
    'excludes reserved space >=
    'excludes network & broacast addresses
    '(first & last IP address of each class)
    rxs = rxs + "(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    rxs = rxs + "(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    rxs = rxs + "(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    rxs = rxs + "|"
    'host name
    rxs = rxs + "(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    'domain name
    rxs = rxs + "(?:(?:\.[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    ' TLD identifier
    rxs = rxs + "(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    rxs = rxs + ")"
    ' port number
    rxs = rxs + "(?::\d{2,5})?"
    ' resource path
    rxs = rxs + "(?:/\S*)?"

    Dim rx As Regex = New Regex(rxs, RegexOptions.IgnoreCase)
    Dim match As Match = rx.Match(url)
    If match.Success Then
        Console.WriteLine("not a match")
    End If

End Sub

I also discovered that underscores are not valid if you follow this RegExp.


will fail.

Here's a link to a relevant StackOverflow question:


This is my PHP port...

I added (?=\s|$) to the end to prevent matches like (no path-slash).

I added (?<=^|\s) at the beginning to use it within text.

Additionally i reordered the hostname parts, to get it working with preg_replace_callback (I had some BACKTRACE LIMIT EXCEEDED errors).


The full expression:

const RX_LINK_ALL = '#
@jnovack is a VALID HOST IP for a host within a subnet or larger.

  • First IP
  • Last IP

At a minimum, there are only two always-invalid IPs in the 10. subnet. I suggest only testing the following:

  • - Subnet address in (largest possible 10. subnet)
  • - Broadcast address in (largest possible 10. subnet)
  • - For validation testing.

(I'm french)
I don't understand why it don't match my string, using the javascript version of the regex ?

function fTest() {

var str = "aaa bbb ccc aaa bbb eee";

var res = str.match("/^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?$/i");


--> res is empty

Anybody could explain to me why it dosn't work ?

Thx !


@danyboy85 This is because the RegExp is conceived to validate strings and not to match URLs in a strings. The ^ at the start of the RegExp means that the string should start with the URL protocol and the $ at the end of the RegExp means that the string should end with the URL pathname.


I am not sure if anybody mentioned it before, but some of the "invalid" URL's are in fact valid!
So I make an example here: must parse as valid. could parse as valid.
If you disagree, read the actual specifications. The domain should actually be suffixed by a .


Shouldn't this be valid?


Just noted the workaround provided by @johnjaylward worked.


This regex and everyone's comments have been really informative! Thanks for writing this.

I'm confused about this regex's handling of UTF-8 characters. The RFC spec does not allow "\" characters, so why does the regex use "\" to match UTF-8 characters? From the spec:

" URI producing applications must not use percent-encoding in host unless it is used
to represent a UTF-8 character sequence. When a non-ASCII registered
name represents an internationalized domain name intended for
resolution via the DNS, the name must be transformed to the IDNA
encoding [RFC3490] prior to name lookup. URI producers should
provide these registered names in the IDNA encoding, rather than a
percent-encoding, if they wish to maximize interoperability with
legacy URI resolvers."

So, UTF-8 characters other than alphanumeric characters should be represented using % encoding and IDNA encoding. I'll post the regex I have in mind later on.

I answered my own question. Browsers reduce UTF-8 in URIs to punycode now, so from the perspective of the RFC spec, the URI actually sent over the wire will be valid.


Many thanks to everybody for the comments and the suggestions.

I have updated the gist:

- Made starting path slash optional (
- Allow a dot (.) at the end of hostnames (

This is an answer to @halloamt & @muessigb questions.
They are related to having/allowing a trailing dot at the end of the hostname.
I answered to this question previously on Twitter, here is an interesting link with additional info:

The title of the article say it all: "The danger of the trailing dot in the domain name".
As you can see from the previous message I recently allowed it in my regular expression.
So be careful if you use a trailing dot at the end of the domain name, it may not work in all situations.


Looks like the "allowed a trailing dot" clause is missing a backslash in front of the dot, so it in fact allows a trailing character of any type, including whitespace, since that is the semantics of the . character in a RegExp.


You are correct @dmose, thank you for noticing that.
I just fixed that both in the Javascript and in the PHP versions.


I added the following example URLs to my tests:

all the above URLs are now passing the tests correctly !


@dperini: I don't believe your javascript one liner will match against the period in front of the TLD without two backslashes. I found this out the hard way when I put a question mark after the protocol match, making it optional.... and discovered it was passing any word ex: sethnewton

I forked and made the change here: ... hopefully it's of some use to you.


I did a cut&paste of the one liner in your gist inside my tests and most of the tests fail.
It seems you have added the double backslash in the wrong place (not after the TLD block).
If you look to the one liner regular expression there is no place where a backslash need to be escaped.
It is only inside the new RegExp() constructor that it is necessary to double the backslashes (escape them).


There is a subtle inefficiency in this construct:


On a string without any -, the regex degenerates to [a-z\\u00a1-\\uffff0-9]*[a-z\\u00a1-\\uffff0-9]+, which is of the form A*A*. It will cause quadratic complexity in worst case. The effect is not very visible, until the length of the non-matching string goes up to a few thousand to tens of thousands characters.

This is my suggested fix:


It can only starts and ends with [a-z\\u00a1-\\uffff0-9], and any stretch of - or [a-z\\u00a1-\\uffff0-9] is still allowed. Likewise, minimum matching length is still 1.


Control-F Perl .. nothing.

A perl version is the one line Javascript version with \x{00a1}-\x{ffff} instead of \u00a1-\uffff
Tested against the test-case list and passed.


This doesn't seem to allow http://3628126748

It is a decimal address which resolves to an IP owned by The Coca Cola Corp (not an internal IP).


The patterns for username/password are overly lax and allow you to put in almost anything as a url, if you finish with something that looks like eg re_weburl.test(""), or re_weburl.test("http://???/")


@gburtini Actually although browsers allow and resolve URLs with IP addresses that are in hexadecimal, octal or without a dot-notation, these formats are made invalid in a URL by RFC 3986: section 7.4 Rare IP Address Formats


'' check failed

o5 commented

A few changes are visible here.

1) Line 85 - - is also a valid address
2) Line 97 - - port could be < 10


Thanks for the great regex! I am trying to use it within a custom validation rule in Laravel.

But it's not validating anything at all...

My code bellow:

$regex = '_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:.\d{1,3}){3})(?!(?:169.254|192.168)(?:.\d{1,3}){2})(?!172.(?:1[6-9]|2\d|3[0-1])(?:.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]-)[a-z\x{00a1}-\x{ffff}0-9]+)(?:.(?:[a-z\x{00a1}-\x{ffff}0-9]-)[a-z\x{00a1}-\x{ffff}0-9]+)(?:.(?:[a-z\x{00a1}-\x{ffff}]{2,})).?)(?::\d{2,5})?(?:[/?#]\S)?$_iuS';

$rules = array('hewit' => array('required', 'regex:'. $regex));

Am I doing something wrong?


Hello Diego,
Awesome work!!!

In reference to the link:
These 2 test cases (should return false) are returning true:


Hi @dperini, very nice RegEx!
I ran the regex on thousands of user input text data and it seems that urls like is recognized by the regex, im not sure whether this is intended or not.


Thanks for the great work.
I also found that someone created a npm package for this gist:


Hi @dperini !
I have a question.
Your regexp for JS:
var urlReStr = '^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S)?@)?(?:(?!(?:10|127)(?:.\d{1,3}){3})(?!(?:169.254|192.168)(?:.\d{1,3}){2})(?!172.(?:1[6-9]|2\d|3[0-1])(?:.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-)[a-z\u00a1-\uffff0-9]+)(?:.(?:[a-z\u00a1-\uffff0-9]-)[a-z\u00a1-\uffff0-9]+)(?:.(?:[a-z\u00a1-\uffff]{2,})).?)(?::\d{2,5})?(?:[/?#]\S*)?$';

var urlRe = new RegExp(urlReStr, 'i');

It works perfectly for some tricky cases, but doesn't fail for such simple case as http://dddddddddddddd.
Maybe I'm doing something wrong. Please, give me advise.


Unfortunately... - Fails
http://google - Passes

@goa - Fails, although it is a valid YouTube url.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.