Skip to content

Instantly share code, notes, and snippets.

@faisalmahmud
Created October 21, 2015 16:58
Show Gist options
  • Save faisalmahmud/e86c31d20705cbb8d44e to your computer and use it in GitHub Desktop.
Save faisalmahmud/e86c31d20705cbb8d44e to your computer and use it in GitHub Desktop.
import re
import HTMLParser
def get_urls_from_text(text):
# See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
# Copied from http://copia.posthaven.com/finding-urls-in-plain-text
parser = HTMLParser.HTMLParser()
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# Remove HTML entities from url
urls = [parser.unescape(mgroups[0]) for mgroups in GRUBER_URLINTEXT_PAT.findall(text)]
return urls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment