Skip to content

Instantly share code, notes, and snippets.

@ohld
Created February 4, 2021 12:03
Show Gist options
  • Save ohld/bcf4a32b8db230d3680470789010f7eb to your computer and use it in GitHub Desktop.
Save ohld/bcf4a32b8db230d3680470789010f7eb to your computer and use it in GitHub Desktop.
Transform any URL to a standard form (useful for joining tables)
# Want to join your data based on URLs (links)?
# You need to convert all urls to one format.
# E.g. remove www., remove https://, remove url params
# This is how I do it:
def prettify(url):
if not url or not isinstance(url, str):
return None # not sure that this is the best approach
url = url.lower().strip()
# remove garbage
url = url.replace("www.", "")
url = url.replace("http://", "")
url = url.replace("https://", "")
# remove url params
__q = url.find("?")
if __q != -1:
url = url[:__q]
# remove backslash
url = url.rstrip("/")
return url
# Sometimes a Linkedin link has useless subdomain
def prettify_linkedin(url):
url = prettify(url)
if not url:
return None
# cut & everything before domain
__l = url.rfind("linkedin.com")
if __l != -1:
url = url[__l:]
return url
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment