Skip to content

Instantly share code, notes, and snippets.

@uogbuji
Created November 18, 2010 18:28
Show Gist options
  • Star 50 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save uogbuji/705383 to your computer and use it in GitHub Desktop.
Save uogbuji/705383 to your computer and use it in GitHub Desktop.
John Gruber's regex to find URLs in plain text, converted to Python/Unicode
#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
import re, urllib
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"):
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ]
@arunchaganty
Copy link

arunchaganty commented Jul 13, 2016

The pattern can lead to a catastrophic backtracking because the 2nd group (?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+ is basically (x+|y)+ where x = [^\s()<>].

This caused the pattern to take an inordinate amount of time on the (real-life) input New laws in 2015 to benefit undocumented immigrants https://t.co/EKICT9YCwu..................................

A simple fix is to remove the + matching pattern for the inner [^\s()<>]:

GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()[]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

Note: The earlier version of this comment did not remove the + and had a backtick that caused formatting errors. The one above should have fixed these errors!

@maurice-g
Copy link

@arunchaganty nice find! But the regex you posted as fix is exactly the same as the original as I see it. How did you mean to modify it?

@christinazavou
Copy link

can you provide the correct answer please?

@takwaIMR
Copy link

takwaIMR commented Jul 4, 2017

I have the same problem , i used
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

but it return some urls like :
https://t.co/h…

i need the correct answer please !

@samtatasurya
Copy link

The fix provided by @arunchaganty missing escaping backslashes in the very last exclusion group. Code below adds them back.

GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

@glubsy
Copy link

glubsy commented May 1, 2018

This regex is awesome, I had to slightly modify it but I found a rare catastrophic backtracking bug with the string below (even without my modifications) :(

http://Download%20(Album%20of%20six%20wallpapers)

The problem is the %20(

Tested here: https://regex101.com/r/DAA8ww/1

Had to resort to using google's re2 (through pyre2 wrapper) instead of python's native re in order to avoid the problem (had to remove the \u escaped unicode characters though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment