-
-
Save uogbuji/705383 to your computer and use it in GitHub Desktop.
#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls | |
import re, urllib | |
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))') | |
for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"): | |
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ] |
can you provide the correct answer please?
I have the same problem , i used
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
but it return some urls like :
https://t.co/h…
i need the correct answer please !
The fix provided by @arunchaganty missing escaping backslashes in the very last exclusion group. Code below adds them back.
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
This regex is awesome, I had to slightly modify it but I found a rare catastrophic backtracking bug with the string below (even without my modifications) :(
http://Download%20(Album%20of%20six%20wallpapers)
The problem is the %20(
Tested here: https://regex101.com/r/DAA8ww/1
Had to resort to using google's re2
(through pyre2 wrapper) instead of python's native re
in order to avoid the problem (had to remove the \u escaped unicode characters though).
@arunchaganty nice find! But the regex you posted as fix is exactly the same as the original as I see it. How did you mean to modify it?