Skip to content

Instantly share code, notes, and snippets.

@franchb
Created August 30, 2018 07:44
Show Gist options
  • Save franchb/7d8bbca82a33ea40cb7f03dd85d44d83 to your computer and use it in GitHub Desktop.
Save franchb/7d8bbca82a33ea40cb7f03dd85d44d83 to your computer and use it in GitHub Desktop.
Check typos in domain names based on Levenshtein distance
import pandas as pd
import Levenshtein as lev
import numpy as np
white_domains = [
'gmail.com',
'yahoo.com',
'icloud.com',
'mail.ru',
'yandex.ru',
]
df = pd.DataFrame()
df['email'] = ['yandex.ru', 'yandax.ru', 'mail.ru', 'maik.ru']
df['typo_in_email_domain_flag'] = df['email'].apply(lambda x: min([
i for i in [
lev.distance(x, d) for d in white_domains
] if i != 0
]) < 3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment