Skip to content

Instantly share code, notes, and snippets.

@uogbuji
Created November 18, 2010 18:28
Show Gist options
  • Star 50 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save uogbuji/705383 to your computer and use it in GitHub Desktop.
Save uogbuji/705383 to your computer and use it in GitHub Desktop.
John Gruber's regex to find URLs in plain text, converted to Python/Unicode
#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
import re, urllib
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"):
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ]
@glubsy
Copy link

glubsy commented May 1, 2018

This regex is awesome, I had to slightly modify it but I found a rare catastrophic backtracking bug with the string below (even without my modifications) :(

http://Download%20(Album%20of%20six%20wallpapers)

The problem is the %20(

Tested here: https://regex101.com/r/DAA8ww/1

Had to resort to using google's re2 (through pyre2 wrapper) instead of python's native re in order to avoid the problem (had to remove the \u escaped unicode characters though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment