Skip to content

Instantly share code, notes, and snippets.

@hallvors
Created March 21, 2016 08:36
Show Gist options
  • Save hallvors/bef5957658f04315fef6 to your computer and use it in GitHub Desktop.
Save hallvors/bef5957658f04315fef6 to your computer and use it in GitHub Desktop.
Using tldextract to remove www. safely and extract the domain name and its public suffix
def extract_domain_name(url):
'''Extract the domain name from a given URL'''
prefix_blacklist = ['www']
parts = tldextract.extract(url)
# We want to drop any prefixes mentioned in the blacklist
# They typically do not add information that's useful to
# distinguish the "identity" of a specific site
# Sometimes the blacklisted domain is part of subdomain,
# for example when parsing www.mail.example.com
subdomain = parts.subdomain
for prefix in prefix_blacklist:
subdomain = parts.subdomain.replace(prefix_blacklist, '')
if subdomain in prefix_blacklist:
return '.'.join([parts.domain, parts.suffix])
else:
return '.'.join([subdomain, parts.domain, parts.suffix])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment