Skip to content

Instantly share code, notes, and snippets.

@hamletbatista
Created December 3, 2019 20:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hamletbatista/8fa76c253014be7a633c53a4a6a0bfce to your computer and use it in GitHub Desktop.
Save hamletbatista/8fa76c253014be7a633c53a4a6a0bfce to your computer and use it in GitHub Desktop.
import pandas as pd
#load URL sets to data frames
df_404s = pd.read_csv("404-urls.csv")
df_canonicals = pd.read_csv("canonical-urls.csv")
import re
#replace / - _ and .html with spaces
df_404s["phrase"] = df_404s["404 url"].apply(lambda x: re.sub(r"[/_-]|\.html", " ", x))
df_canonicals["phrase"] = df_canonicals["canonical url"].apply(lambda x: re.sub(r"[/_-]|\.html", " ", x))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment