Skip to content

Instantly share code, notes, and snippets.

@Hispar
Last active August 29, 2015 14:28
Show Gist options
  • Save Hispar/e73c76849503ee4ed1b1 to your computer and use it in GitHub Desktop.
Save Hispar/e73c76849503ee4ed1b1 to your computer and use it in GitHub Desktop.
Script to retrieve domains from a text. Based on http://stackoverflow.com/questions/21211572/extract-all-domains-from-text but with different regexp.
# python 3 imports
from __future__ import unicode_literals, print_function
# imports
import re
# vars
text = " A long text with url like www.google.com and www.twitter.com and lorem ipsum dolor sit amet www.tabga.es"
regex = r'([a-z0-9][-a-z0-9]*[a-z0-9]|[a-z0-9])\.(([a-z]{2,4}|[a-z]{2,3}.uk))(?![-0-9a-z])(?!\.[a-z0-9])'
urls = set(re.findall(regex, text))
for url in urls:
print('{}.{}'.format(url[0], url[1]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment