Skip to content

Instantly share code, notes, and snippets.

@gruber
Last active April 22, 2024 19:02
Show Gist options
  • Save gruber/8891611 to your computer and use it in GitHub Desktop.
Save gruber/8891611 to your computer and use it in GitHub Desktop.
Liberal, Accurate Regex Pattern for Matching Web URLs
The regex patterns in this gist are intended only to match web URLs -- http,
https, and naked domains like "example.com". For a pattern that attempts to
match all URLs, regardless of protocol, see: https://gist.github.com/gruber/249502
# Single-line version:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))
# Commented multi-line version:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
# looks like domain name followed by a slash:
[a-z0-9.\-]+[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
/
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_
[a-z0-9]+
(?:[.\-][a-z0-9]+)*
[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
\b
/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
)
)
@wanghaisheng
Copy link

if i try to define a varaible in python like this
a =r' Single-line version'
it give me invalid syntax

@Traumatizn
Copy link

I just created a regex pattern that aims to help with this... If you feel so inclined to do so, give this a try:
https://github.com/Traumatizn/RegEx

@AfonsoAbreu
Copy link

AfonsoAbreu commented Dec 6, 2021

This regex helped me a lot, but it crashed my react project when used to check the following string:

https://avatars.githubusercontent.com/u/65315866?

It gives the following error:
Unhandled Rejection (InternalError): too much recursion test C:/source/front-end/node_modules/yup/es/string.js:113 validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests</< C:/source/front-end/node_modules/yup/es/object.js:160 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/object.js:177 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:26 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/object.js:139 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests[idx] C:/source/front-end/node_modules/yup/es/array.js:102 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/array.js:105 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/array.js:72 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests</< C:/source/front-end/node_modules/yup/es/object.js:160 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/object.js:177 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:26 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/object.js:139 validate/< C:/source/front-end/node_modules/yup/es/schema.js:245 validate C:/source/front-end/node_modules/yup/es/schema.js:245

I have absolutely no idea whether the problem lies on my own code or this regex though, but I was only able to fix this by replacing this regex with a simpler and less complete one that I made.

P.S. In this project, i'm using React, Formik and yup

@tony
Copy link

tony commented Jun 25, 2022

Anybody have an example with python named groupings? e.g. (?P<tld>...), so on?

@thatrandomperson5
Copy link

Does not work with python

@c0dezer019
Copy link

I feel the TLD's should be just generalized, due to the amount of new ones that pop up. This is not maintainable, and hard to read, so IMO I think it would be better to just match against Alpha of 2 or more. That way some poor sap with some bizarre TLD doesn't have any issues just because the regex doesn't match the domain.

@danila-schelkov
Copy link

It doesn't work for urls with russian letters. I guess any letter may be in the pure url without punicode, can anyone provide another regexp?

@winzig
Copy link

winzig commented Jun 8, 2023

@danila-schelkov Can you provide an example URL that isn't matching?

@danila-schelkov
Copy link

Ofc, url: https://www.bagandwallet.ru/collection/sumki-bellroy/product/sumka-bellroy-venture-hip-pack-15l-kupit?variant_id=617976795&utm_source=pnn&utm_medium=email&utm_campaign=Сумка%20Bellroy%20Venture%20Hip%20Pack%201.5L

@winzig
Copy link

winzig commented Jun 8, 2023

@danila-schelkov When I try that URL in regex101 along with Gruber's one-liner, it seems to match it correctly?

screenshot 2023-06-08 at 10 15 19

Is it possible that your code is not treating the URL string as unicode (e.g. utf-8), and therefore might not be handling the Cyrillic correctly? (I'm not that familiar with Cyrillic alphabet.)

@danila-schelkov
Copy link

Oh it really works now, thank you! But I have another question. What is the purpose of those domain names?
image

I have checked and the regexp is working for any other domain name like "www.yandex"

@gruber
Copy link
Author

gruber commented Jun 12, 2023 via email

@KOUISAmine
Copy link

thanks this works for me in both js and pcre, here is a demo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment