Created
December 3, 2019 20:02
-
-
Save hamletbatista/8fa76c253014be7a633c53a4a6a0bfce to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
#load URL sets to data frames | |
df_404s = pd.read_csv("404-urls.csv") | |
df_canonicals = pd.read_csv("canonical-urls.csv") | |
import re | |
#replace / - _ and .html with spaces | |
df_404s["phrase"] = df_404s["404 url"].apply(lambda x: re.sub(r"[/_-]|\.html", " ", x)) | |
df_canonicals["phrase"] = df_canonicals["canonical url"].apply(lambda x: re.sub(r"[/_-]|\.html", " ", x)) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment