Skip to content

Instantly share code, notes, and snippets.

@gruber
Last active April 22, 2024 19:02
Show Gist options
  • Save gruber/8891611 to your computer and use it in GitHub Desktop.
Save gruber/8891611 to your computer and use it in GitHub Desktop.
Liberal, Accurate Regex Pattern for Matching Web URLs
The regex patterns in this gist are intended only to match web URLs -- http,
https, and naked domains like "example.com". For a pattern that attempts to
match all URLs, regardless of protocol, see: https://gist.github.com/gruber/249502
# Single-line version:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))
# Commented multi-line version:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
# looks like domain name followed by a slash:
[a-z0-9.\-]+[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
/
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_
[a-z0-9]+
(?:[.\-][a-z0-9]+)*
[.]
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)
\b
/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
)
)
@AvnerCohen
Copy link

It get's stuck on ruby as well.

This one works for me for 99% of cases, which is what I needed:

((?<=[^a-zA-Z0-9])(?:https?\:\/\/|[a-zA-Z0-9]{1,}\.{1}|\b)(?:\w{1,}\.{1}){1,5}(?:com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|iq|io|ac|ly|sm){1}(?:\/[a-zA-Z0-9]{1,})*)

https://regex101.com/r/fO6mX3/2

@kbeezie
Copy link

kbeezie commented Jan 10, 2015

Can't seem to get this to work in PHP, even using a <<<EOD ... EOD; type of input into the php variable for preg pattern, it keeps getting hung up on an unknown modifier ''.

Edit: had to use a delimiter that was not apart of the string at all:
$webpattern = <<<EOD
pattern here
EOD;

@dpk
Copy link

dpk commented May 10, 2015

Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.

@marko-galesic
Copy link

http://schema.org","@type
Is being matched. Generally, whenever there is a " after a seemingly valid url, the match includes lots of characters after the url.

@CodingOctocat
Copy link

How to match Web images URI in JSON? such as :
-- http://p1.xxx.com/45/b9/45b9f057fc1957ed2c946814342c0f02.jpg OR
-- http://pic1.xxx.com/4766e0648_m.jpg OR ETC.
And i want to replace the URIs to
-- C://MyFolder/45b9f057fc1957ed2c946814342c0f02.jpg OR
-- C://MyFolder/4766e0648_m.jpg OR ETC.
I tried:
-- [a-zA-z]+://[^\s]/[^\s].jpg
but if JSON is not formatted code,It can only match one result:
-- http://xxx.com/xxx.jpg...OTHER JSON CODE ... http://xxx.com/xxx.jpg
Which means it is based on the http:// at the beginning and end. jpg as a result
I tried another:
-- (http|ftp|https)://[\w-]+(.[\w-]+)+([\w-.,@?^=%&:/+#]*[\w-@?^=%&/+#])?.jpg
it work OK
but it is cannot use $1、$2 to replace...

@ethaniel
Copy link

@gruber I think this thread needs your attention :)

@scryba
Copy link

scryba commented Oct 14, 2017

in PHP

$regex ="
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+./)(?:[^\s()<>{}\[\]]+|([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?))+(?:([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*.\b/?(?!@)))

"
$string = "google.com Inquiry www.yahoo.com test for: http://www.bing.com";
preg_match_all($regex, $string, $match);

var_dump(match[0]);

But I get the error => [error] [php] preg_match_all(): Unknown modifier ' \ '

Any work around this?

@lukapaunovic
Copy link

lukapaunovic commented Oct 20, 2017

@clement-analogue
Copy link

@lukapaunovic This PHP code does not work with ( and ).

@deadbits
Copy link

@AvnerCohen
Your example is missing subdomains with dashes in them. For example, your pattern won't catch all of this:
x-foo.bar.subdomain.com/index.php?q=123

@li-a
Copy link

li-a commented Jun 30, 2020

I suggest replacing the initial \b with (?<![a-z0-9_@-]) (and removing the now-redundant (?<!@) line). The word boundary is easier to read but leads to more wasted work for the rest of the expression. The a-z0-9 ensure a proper boundary, and the _@- prevent matches in the middle of something like a typical filename or email address. Others may feel free to add more prefix symbols that indicate a URL should be ignored.

The outermost ( ) group also seems redundant? Most (all?) flavors capture the whole match as group #0 or with some other API.

Now, this part:

  (?:							# One or more:
    [^\s()<>{}\[\]]+						# Run of non-space, non-()<>{}[]
    |								#   or
    \([^\s()]*?\([^\s()]+\)[^\s()]*?\)  # balanced parens, one level deep: (…(…)…)
    |
    \([^\s]+?\)							# balanced parens, non-recursive: (…)
  )+
  (?:							# End with:
    \([^\s()]*?\([^\s()]+\)[^\s()]*?\)  # balanced parens, one level deep: (…(…)…)
    |
    \([^\s]+?\)							# balanced parens, non-recursive: (…)
    |									#   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]		# not a space or one of these punct chars
  )

is a recipe for seriously catastrophic runtime. I adapted it to the pattern below, which runs fast. More importantly, it completes quickly for the dozens of cases I tested that made the expression above run for over an hour (until being aborted):

(?x:                               # Scoping the COMMENT flag; can be removed if compressed or concatenated with more commented expression
  [^\s()<>{}\[\]]*                 # 0+ non-space, non-()<>{}[]
  (?:                              # 0+ times:
    \(                             #   Balanced parens containing:
    [^\s()]*                       #   0+ non-paren chars
    (?:                            #   0+ times:
      \([^\s()]*\)                 #     Inner balanced parens containing 0+ non-paren chars
      [^\s()]*                     #     0+ non-paren chars
    )*
    \)
    [^\s()<>{}\[\]]*               # 0+ non-space, non-()<>{}[]
  )*
  (?:                              # End with:
    \(                             #   Balanced parens containing:
    [^\s()]*                       #   0+ non-paren chars
    (?:                            #   0+ times:
      \([^\s()]*\)                 #     Inner balanced parens containing 0+ non-paren chars
      [^\s()]*                     #     0+ non-paren chars
    )*
    \)
    |                              #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
  )
)

Avoid nesting quantifiers in this way unless you add atomic grouping or make the inner ones possessive. In my engine (Java), you can also change most (not all) of the * to *+ to match the same inputs with better performance.

@gruber
Copy link
Author

gruber commented Jul 1, 2020 via email

@jackdeguest
Copy link

Why bother ?

use Regexp::Common qw( URI )
if( $str =~ /$RE{URI}{HTTP}/ )
{
    # something here
}
# For https
elsif( $str =~ /$RE{URI}{HTTP}{-scheme => 'https'}/ )
{
    # something else
}

@firozzer
Copy link

Thank you guys, for maintaining & updating this page till date! Landed here via Google > "url regex python" > StackOverflow > here.

@vellrya
Copy link

vellrya commented Jan 27, 2021

.pub, .ski domains are ignored without inital http

@sjosegarcia
Copy link

I am glad I found this. Through stackoverflow.

@fariello
Copy link

fariello commented Jun 9, 2021

This was very helpful, but, in some very rare edge cases using either the one-liner or the longer more readible version will "hang" in Python. The long version hangs on this:

"[https://www.surveymonkey.com/r/9RSNDDT](https://www.surveymonkey.com/r/9RSNDDT)"

While the short version handles that but subsequently hangs on this text, but I cannot determine which:

"**Disclaimer: All the accused are innocent until proven guilty by the court of law, even if they may sound as being guilty. Currency in Armenian Drams unless specified otherwise.**\n\n---\n\nMarch 1st murder victims' representatives sent a request to the Constitutional Court and demanded 7 member judges to recuse themselves from hearing the Kocharyan case this August, citing the reasons as political past, lack of trust, and a conflict of interest. \n\nPetition says the judges were involved with making pro-Kocharyan and pro-Serj election result verdicts in 2008, at the time when it is known that Kocharyan regime was directly pressuring the courts, as exposed by US embassy cables. \n\nOther reasons were also mentioned, for example some judges holding extraordinary sessions to approve Kocharyan's now-disputed state of emergency declaration; a judge still being a party member at the time of being elected; judges being elected by a president who came into power controversially after pressing the court; the chief judge being an active HHK member and MP at the time of appointed, who aided the party agenda. \n\nRead the rest of the complaint here... \n\nhttps://armtimes.com/hy/article/165446\n\n---\n\nHHK party is being sued by the family of a businessman who used to own the building where one of HHK's headquarter is (or was) located. The businessman was a wine maker in the 20th century and owned the building. The family alleges that in 2001, HHK used it political powers to rapidly force the 460 sq/m building to be sold to HHK for only 5.5mln Drams. Next year, the state paid HHK 93mln to obtain the same area, which later presumably came under HHK's possession again. The plaintiff alleges that property registration agency was in such a hurry that they didn't wait for ownership names to be properly changed before authorizing the transactions, thus making the process illegal. The court will hear the case this September. \n \nhttps://www.armtimes.com/hy/article/165369\n\n---\n\nThe government session took place. \n\n375mln of the excess tax money will be spent on rebuilding 5 high schools. Hundreds of millions to renovate dozens of others. 1.7bln was dedicated to repair universities. https://armtimes.com/hy/article/165433 -- https://factor.am/165207.html\n\n-\n\n837mln will be dedicated to create 330 robotics labs in schools, in addition to 350 that already exist. https://armtimes.com/hy/article/165417\n\n-\n\nThe plan to double the premium pensions for 397 WW2 vets has been approved. They'll receive 100k/month. It'll probably be sent to Parliament for a final approval. https://armtimes.com/hy/article/165407\n\n-\n\n4bln will be issued to State Water Committee to solve local water debt and management issues. Minister of Infrastructure complained that a poor management for many years has led to a 7bln debt. https://armtimes.com/hy/article/165415\n\n\n-\n\nThe government approved a draft version of a QP bill that regulates contractor work terms to prevent certain types of ""unfair and arbitrary terminations"". Details inside... https://armtimes.com/hy/article/165411\n\n-\n\nThe government approved the agreement to remove double taxation with Singapore, and to prevent tax avoidance 🇸🇬 \n\nhttps://armenpress.am/arm/news/980749.html\n\n-\n\n---\n\nThe hospital healthcare becomes free for anyone below 18, beginning today. Parents of soldiers will also qualify for this free care. Some treatments with expensive equipment will become free for disabled people. https://www.armtimes.com/hy/article/165479\n\n---\n\nSevan lake is greener than usual. Last time it was this green in 1960s and in 2018. Smaller level greentifications have happened frequently for many years. Why does the lake become green?\n\n It's due to flora and other organic processes, which can place the quality of the lake in danger if nothing is done. When the water level rises but the trees near the shore aren't cut beforehand, it cases problems when the trees become submerged. When large quantities of water is drained for irrigation, it causes more problems because it disturbs the organic process that the lake does to clean itself. Add to this the fact that phosphorus, sewage and other materials are added to lake, it results in the lake changing its color.\n\n-\n\nMinister of Nature Protection said the current greenification is caused by a combination of algae (jrimur) growth, phosphorus-nitrogen and other chemical buildup, dirty water flowing into the lake, submerged trees near the shores, record high temperatures. \n\n-\n\nThe government begun an examination to find and treat the causes. They have already identified the areas that need to be cleared of submerged trees. The cleanup can last 2-3 years, and the lake's water levels will be brought to 1901.5 meters afterwards. To identify the areas that need to be cleaned, the ministry used satellite and drones. 770 hectares of problematic areas were identified. In the past years, 70 hectares were cleaned each year. They will increase that number to 300 hectares for 2020 and 2021.\n\nThis season's water drainage is already low compared to past years, which should help. Minister says they're working with the law enforcement and Agricultural ministry to reduce irrigation abuse.\n\n-\n\nGerman institutes are involved with identifying certain pollution sources. The Ministry sent a petition to UNESCO to add Sevan to Bioshpere Reserve program, which will allow better cleanup management and socioeconomic improvements for residents living near the lake. Belarus has offered help with the cleanup.\n\n-\n\nMinistry is working with PM's office to create a plan to clean up 30 rivers that pour into the lake. After the cleanup, the lake's water levels will be raised because more fresh water automatically translates into better quality in some regards. \n\n-\n\nWithin the next 2 weeks, the large deposits of phosphorus will erode away, and the green color, which is ""still safe"", should be significantly reduced, said the Minister. This greenifcation is also happening in the Black Sea and Baikal Lake due to climate change and temperature rise. \n\n\nhttps://hetq.am/hy/article/105250 --- https://www.youtube.com/watch?v=0cDwuFXB2Mc --- https://armtimes.com/hy/article/165424 --- TLDR https://www.youtube.com/watch?v=ORD7jJClIRk\n\n---\n\nPashinyan's government earlier approved raising minimum wage from 55k to at least 63k. However, they want it raised even higher, to 68k. \n\nQP party in Parliament doesn't fully agree with the government. Says extending it to 68k will have ""side effects"". The co-author of the bill QP MP Babken Tunyan says 63k number was initially chosen because it's the minimum necessary to ""survive"", aka the food basket number. ""It makes sense for the minimum wage to equal to it"". \n\nSetting minimum wage above that number will cause government official salaries to also go up, because these salaries are calculated by multiplying the base salary by a number. MP doesn't want this to happen. \n\nThe MP is in favor of raising it to 68k only if the government agrees to make changes on how official salaries are calculated so they'll continue to earn the old salaries, and if the government submits a report on how much extra burden the extended raise will be on the budget. \n\nThe MP says 80,000 workers who earn exactly the current minimum wage will see their wages raised, plus many more workers will see their close-to-minimum wage raised to the new minimum wage. The number of ""affected"" workers is significantly higher than the 80,000 that was initially reported. \n\nhttps://armenpress.am/arm/news/980746.html\n\n......................................... \n\nMinistry of Labor says the 68k is the better number. They examined 1580 businesses and found that the raise wouldn't be a big burden on them. The businesses said they could go up to 70k. The minimum wage should be above the survival food basket, said the deputy Minister. \n\n-\n\nTo survive you need: 63k\n\nParliament: minimum wage should be 63k\n\nGovernment: minimum wage should be 68k\n\nBusiness: anything below 70k won't hurt us (per government study) \n\nhttps://armenpress.am/arm/news/980734.html --- https://www.youtube.com/watch?v=NOjSGdKjeLk\n\n---\n\n29 PACE representatives urged the Armenian government to do more to protect LGBTI *(a random new letter attached to LGBT every day?)*, calling the current actions as insufficient and failed. They cite death threats towards an LGBTI forum in Armenia as an example. \n\nThe report criticizes that there are no new laws to protect LGBTI, and the existing ones don't respect the gender identity and gender choice. They urged the government to publicly denounce hate speech against LGBTI, and to better train public officials and judges to respect the LGBTI. \n\nToday's comment section will have 50 comments, and 49 will be about this topic.\n\nhttps://armtimes.com/hy/article/165445\n\n---\n\nCalifornia will dedicate $5mln to build a Armenian-US museum in Glendale. It'll be about the history of Armenia, diaspora, culture cooperaation, etc. \n\nhttps://www.foxnews.com/us/ocasio-cortez-accused-of-stunt-with-claims-at-the-border-acquitted-navy-seal-to-speak-to-fox-news\n\n---\n\nArtsakh and Armenia Human Rights Ombudsmen met Los Angeles municipality officials and legislators, thanked them for the continuous support, discussed new plans. \nhttps://armtimes.com/hy/article/165448\n\n---\n\nWorld Customs Organization re-elects Armenian Customs Service as member of its audit committee. 182 countries, 12 of which are audit members; Armenia and Netherlands represent Europe. It's about the classification of goods, customs valuation, border cooperation etc.\n\nhttp://arka.am/en/news/business/world_customs_organization_re_elects_armenian_customs_service_as_member_of_its_audit_committee/\n\n---\n\nEveryone: installs a trampoline for a pool\n\nArmenian nibbas: moves the pool under an ancient historical bridge to use it as a trampoline\n\nSmart nibbas: just stop, get some help\n\nhttps://armenpress.am/arm/news/980785.html\n\n---\n\nFireworks explosion in Belarus leaves several injured and one dead. SFW https://youtu.be/OhuyA2BBZDc?t=4 --https://youtu.be/C4MF2XI_raw?t=84\n\n---\n\nSinger Hayko has a message for the haters https://www.youtube.com/watch?v=xZocs8XQUe4"

@wanghaisheng
Copy link

if i try to define a varaible in python like this
a =r' Single-line version'
it give me invalid syntax

@Traumatizn
Copy link

I just created a regex pattern that aims to help with this... If you feel so inclined to do so, give this a try:
https://github.com/Traumatizn/RegEx

@AfonsoAbreu
Copy link

AfonsoAbreu commented Dec 6, 2021

This regex helped me a lot, but it crashed my react project when used to check the following string:

https://avatars.githubusercontent.com/u/65315866?

It gives the following error:
Unhandled Rejection (InternalError): too much recursion test C:/source/front-end/node_modules/yup/es/string.js:113 validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests</< C:/source/front-end/node_modules/yup/es/object.js:160 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/object.js:177 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:26 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/object.js:139 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests[idx] C:/source/front-end/node_modules/yup/es/array.js:102 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/array.js:105 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/array.js:72 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests</< C:/source/front-end/node_modules/yup/es/object.js:160 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/object.js:177 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:26 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/object.js:139 validate/< C:/source/front-end/node_modules/yup/es/schema.js:245 validate C:/source/front-end/node_modules/yup/es/schema.js:245

I have absolutely no idea whether the problem lies on my own code or this regex though, but I was only able to fix this by replacing this regex with a simpler and less complete one that I made.

P.S. In this project, i'm using React, Formik and yup

@tony
Copy link

tony commented Jun 25, 2022

Anybody have an example with python named groupings? e.g. (?P<tld>...), so on?

@thatrandomperson5
Copy link

Does not work with python

@c0dezer019
Copy link

I feel the TLD's should be just generalized, due to the amount of new ones that pop up. This is not maintainable, and hard to read, so IMO I think it would be better to just match against Alpha of 2 or more. That way some poor sap with some bizarre TLD doesn't have any issues just because the regex doesn't match the domain.

@danila-schelkov
Copy link

It doesn't work for urls with russian letters. I guess any letter may be in the pure url without punicode, can anyone provide another regexp?

@winzig
Copy link

winzig commented Jun 8, 2023

@danila-schelkov Can you provide an example URL that isn't matching?

@danila-schelkov
Copy link

Ofc, url: https://www.bagandwallet.ru/collection/sumki-bellroy/product/sumka-bellroy-venture-hip-pack-15l-kupit?variant_id=617976795&utm_source=pnn&utm_medium=email&utm_campaign=Сумка%20Bellroy%20Venture%20Hip%20Pack%201.5L

@winzig
Copy link

winzig commented Jun 8, 2023

@danila-schelkov When I try that URL in regex101 along with Gruber's one-liner, it seems to match it correctly?

screenshot 2023-06-08 at 10 15 19

Is it possible that your code is not treating the URL string as unicode (e.g. utf-8), and therefore might not be handling the Cyrillic correctly? (I'm not that familiar with Cyrillic alphabet.)

@danila-schelkov
Copy link

Oh it really works now, thank you! But I have another question. What is the purpose of those domain names?
image

I have checked and the regexp is working for any other domain name like "www.yandex"

@gruber
Copy link
Author

gruber commented Jun 12, 2023 via email

@KOUISAmine
Copy link

thanks this works for me in both js and pcre, here is a demo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment