Created
November 16, 2019 13:22
-
-
Save Robin-Lord/8088d3968b2238a87577d8b3be750855 to your computer and use it in GitHub Desktop.
Fixing Broken Templated Links and Redirect Chains
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "Fixing Broken Templated Links and Redirect Chains", | |
"provenance": [], | |
"collapsed_sections": [], | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/Robin-Lord/8088d3968b2238a87577d8b3be750855/fixing-broken-templated-links-and-redirect-chains.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "UJvRkX5bQPrQ", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Fixing internal redirects and redirect chains\n", | |
"\n", | |
"This notebook is designed to take exports from Screaming Frog - find instances where your site is linking to redirecting pages, as well as redirect chains, and give instructions of how to fix them.\n", | |
"\n", | |
"The most direct way to fix internal redirect problems is to fix the internal links, so we'll do that first. However - often we can't change external links, and <i>sometimes</i> it's still easier to fix redirect chains than to change certain templated links, so this notebook will identify both problems.\n", | |
"\n", | |
"For more information on <strong>why</strong> <a href=\"http://www.therobinlord.com/how-to-fix-on-site-redirects-and-why-you-should/\">we should fix redirects and broken links</a> or <a href=\"http://www.therobinlord.com/how-to-fix-broken-or-redirecting-links/\">how to read the output from this notebook</a> go to my blog posts." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "A846i-M36nua", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# How to use this notebook with no coding knowledge\n", | |
"\n", | |
"Below I've written out pretty detailed instructions about what each step is doing in case you want to learn. <strong>However</strong> there's no shame in just wanting to use this without learning code or more about Google Colab.\n", | |
"\n", | |
"If you just want to get the results, just do the following;\n", | |
"\n", | |
"- Open up the left-hand pane by clicking the arrow in the top-left\n", | |
"\n", | |
"- Click the \"Files\" tab\n", | |
"\n", | |
"- Click UPLOAD and upload your Screaming Frog Inlinks and Redirect Chains reports (they need to be csv files)\n", | |
"\n", | |
"- When they are uploaded, make sure they are called <strong>redirect_and_canonical_chains.csv</strong> and <strong>all_inlinks.csv</strong> respectively. If they aren't, either rename them here by right-clicking on them or rename them on your desktop and reupload.\n", | |
"\n", | |
"- In the box below this one, write out the domain of your website. If you're not sure what that means, the domain of http://www.therobinlord.com/ is just \"therobinlord.com\". \n", | |
"\n", | |
"- At the top of this window, click \"Runtime\" then \"Run All\"\n", | |
"\n", | |
"- Leave this for a little while to do its thing, it should make a <strong>\"OH YEAH\"</strong> sound when it's done but your browser might block that so just check back after a little while. When it's finished it will save two [Complete] files which you can find in the same place where you uploaded the files to begin with. \n", | |
"\n", | |
"- Download the files by right-clicking on them and selecting Download. If you want information on what the different columns mean, <a href=\"http://www.therobinlord.com/how-to-fix-broken-or-redirecting-links/\">read my blog post.</a>\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "xHwxIoWfDOJL", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"my_domain=\"lego.com\"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "m9iYb6d1XFEq", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Make sure to update the cell above this one\n", | |
"\n", | |
"\n", | |
"\n", | |
"---\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "jgMWgcpIZpvx", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# How to use this notebook if you want to learn\n", | |
"\n", | |
"I've written out quite a lot of explanation with each of the steps of this notebook so if you want to try to learn how this is working my advice would be to go through step-by-step, read what I've written, and then run the cell.\n", | |
"\n", | |
"If you haven't used Google Colab before, and you're not sure what to do, Google have produced a lot of helpful example notebooks which I recommend you check out. For instance <a href=\"https://colab.research.google.com/notebooks/io.ipynb#scrollTo=vz-jH8T_Uk2c\">this one</a>." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "caNi8PBTX24O", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Importing libraries we need" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "DbsxLuVWQDP-", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# The main thing we need is pandas - it handles Excel-type stuff and we're going \n", | |
"# to use it to work with our Screaming Frog exports.\n", | |
"\n", | |
"import pandas as pd\n", | |
"\n", | |
"# We also need numpy at one point to make specific recommendations based on\n", | |
"# what we've found in our sheet\n", | |
"\n", | |
"import numpy as np\n", | |
"\n", | |
"\n", | |
"# We need to use xlswriter to create an Excel file later on. Colab doesn't have\n", | |
"# it installed by default so we'll try just importing it to begin with but if\n", | |
"# that doesn't work we'll have to install then import\n", | |
"\n", | |
"try:\n", | |
" import xlsxwriter\n", | |
"except:\n", | |
" !pip install xlsxwriter\n", | |
" import xlsxwriter\n", | |
"\n", | |
"\n", | |
"# Importing time to we can measure how long this process takes\n", | |
"import time\n", | |
"start_time=time.time()\n", | |
"\n", | |
"\n", | |
"# This is so that Google Colab can play a \"success\" sound and show a success image\n", | |
"try:\n", | |
" from google.colab import output\n", | |
" from IPython.display import HTML\n", | |
"\n", | |
"except:\n", | |
" print (\"Presumably you're using this locally, process for importing the files and saving them should still work but if you run into problems with them that might be why\")" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "phBz-IVXYEVh", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Getting our data\n", | |
"\n", | |
"You should have already run a Screaming Frog crawl of your site and exported the \"All Inlinks\" and \"Redirect Chains\" reports.\n", | |
"\n", | |
"<strong>Your file will need to be a .csv file</strong>\n", | |
"\n", | |
"Upload your files using the left-hand Google Colab pane, then write out the filenames in the cell below.\n", | |
"\n", | |
"If you are using this notebook locally in something like Jupyter, or if you're using files in your Google Drive these steps are the same, you just have to give the full file name.\n", | |
"\n", | |
"Next, run the cell below. It will use pandas to \"open\" each of the files you've told it to and call the redirect data <i>redirect_chains</i> and the inlinks data <i>inlinks</i>\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "p9L8Hu7BZTtT", | |
"colab_type": "code", | |
"outputId": "60c69be3-9830-4e66-9d65-a5da441d3c2f", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 70 | |
} | |
}, | |
"source": [ | |
"redirect_chains_file=\"redirect_and_canonical_chains.csv\"\n", | |
"inlinks_file=\"all_inlinks.csv\"\n", | |
"\n", | |
"\n", | |
"# Screaming Frog has changed the format of their reports to \n", | |
"# remove the top line so we're just starting with a check\n", | |
"# of whether your reports include that top line\n", | |
"redirect_chains=pd.read_csv(redirect_chains_file, nrows=5)\n", | |
"\n", | |
"if len(list(redirect_chains))>1:\n", | |
" rows_to_skip=0\n", | |
"else:\n", | |
" rows_to_skip=1\n", | |
"\n", | |
"\n", | |
"# Now importing our data\n", | |
"redirect_chains=pd.read_csv(redirect_chains_file, skiprows=rows_to_skip)\n", | |
"inlinks=pd.read_csv(inlinks_file, skiprows=rows_to_skip)" | |
], | |
"execution_count": 14, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (25,26,27,29) have mixed types. Specify dtype option on import or set low_memory=False.\n", | |
" interactivity=interactivity, compiler=compiler, result=result)\n" | |
], | |
"name": "stderr" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "4cZLKS60dBqA", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Previewing our data\n", | |
"\n", | |
"Running the cells below will let you see the top five rows of each set of data so you can make sure it's all being read as you expect." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "lPpQSmC3dLAE", | |
"colab_type": "code", | |
"outputId": "360aa23e-8d6d-47cd-f58f-f1ee3c374bb0", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 316 | |
} | |
}, | |
"source": [ | |
"# This will show us the top five rows of our inlinks data\n", | |
"inlinks.head(5)" | |
], | |
"execution_count": 15, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Type</th>\n", | |
" <th>Source</th>\n", | |
" <th>Destination</th>\n", | |
" <th>Size (Bytes)</th>\n", | |
" <th>Alt Text</th>\n", | |
" <th>Anchor</th>\n", | |
" <th>Status Code</th>\n", | |
" <th>Status</th>\n", | |
" <th>Follow</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>AHREF</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>693502</td>\n", | |
" <td>Lego</td>\n", | |
" <td>NaN</td>\n", | |
" <td>200</td>\n", | |
" <td>OK</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>AHREF</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>693502</td>\n", | |
" <td>Lego</td>\n", | |
" <td>NaN</td>\n", | |
" <td>200</td>\n", | |
" <td>OK</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>AHREF</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/gift-ca...</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>693502</td>\n", | |
" <td>Lego</td>\n", | |
" <td>NaN</td>\n", | |
" <td>200</td>\n", | |
" <td>OK</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>AHREF</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/gift-ca...</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>693502</td>\n", | |
" <td>Lego</td>\n", | |
" <td>NaN</td>\n", | |
" <td>200</td>\n", | |
" <td>OK</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>AHREF</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/gift-ca...</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>693502</td>\n", | |
" <td>NaN</td>\n", | |
" <td>Home</td>\n", | |
" <td>200</td>\n", | |
" <td>OK</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Type Source ... Status Follow\n", | |
"0 AHREF https://www.lego.com/en-gb ... OK True\n", | |
"1 AHREF https://www.lego.com/en-gb ... OK True\n", | |
"2 AHREF https://www.lego.com/en-gb/page/static/gift-ca... ... OK True\n", | |
"3 AHREF https://www.lego.com/en-gb/page/static/gift-ca... ... OK True\n", | |
"4 AHREF https://www.lego.com/en-gb/page/static/gift-ca... ... OK True\n", | |
"\n", | |
"[5 rows x 9 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 15 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "SJLSOprddL_q", | |
"colab_type": "code", | |
"outputId": "c9a81019-ec85-4dff-9b5d-f61bcb415dd3", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 417 | |
} | |
}, | |
"source": [ | |
"# This will show us the top five rows of our redirect chains data\n", | |
"redirect_chains.head(5)" | |
], | |
"execution_count": 16, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Chain Type</th>\n", | |
" <th>Number of Redirects</th>\n", | |
" <th>Redirect Loop</th>\n", | |
" <th>Source</th>\n", | |
" <th>Temp Redirect In Chain</th>\n", | |
" <th>Address</th>\n", | |
" <th>Final Address</th>\n", | |
" <th>Indexability</th>\n", | |
" <th>Indexability Status</th>\n", | |
" <th>Final Content</th>\n", | |
" <th>Final Status Code</th>\n", | |
" <th>Final Status</th>\n", | |
" <th>Content 1</th>\n", | |
" <th>Status Code 1</th>\n", | |
" <th>Status 1</th>\n", | |
" <th>Redirect Type 1</th>\n", | |
" <th>Redirect URL 1</th>\n", | |
" <th>Content 2</th>\n", | |
" <th>Status Code 2</th>\n", | |
" <th>Status 2</th>\n", | |
" <th>Redirect Type 2</th>\n", | |
" <th>Redirect URL 2</th>\n", | |
" <th>Content 3</th>\n", | |
" <th>Status Code 3</th>\n", | |
" <th>Status 3</th>\n", | |
" <th>Redirect Type 3</th>\n", | |
" <th>Redirect URL 3</th>\n", | |
" <th>Content 4</th>\n", | |
" <th>Status Code 4</th>\n", | |
" <th>Status 4</th>\n", | |
" <th>Redirect Type 4</th>\n", | |
" <th>Redirect URL 4</th>\n", | |
" <th>Content 5</th>\n", | |
" <th>Status Code 5</th>\n", | |
" <th>Status 5</th>\n", | |
" <th>Redirect Type 5</th>\n", | |
" <th>Redirect URL 5</th>\n", | |
" <th>Content 6</th>\n", | |
" <th>Status Code 6</th>\n", | |
" <th>Status 6</th>\n", | |
" <th>Redirect Type 6</th>\n", | |
" <th>Redirect URL 6</th>\n", | |
" <th>Content 7</th>\n", | |
" <th>Status Code 7</th>\n", | |
" <th>Status 7</th>\n", | |
" <th>Redirect Type 7</th>\n", | |
" <th>Redirect URL 7</th>\n", | |
" <th>Content 8</th>\n", | |
" <th>Status Code 8</th>\n", | |
" <th>Status 8</th>\n", | |
" <th>Redirect Type 8</th>\n", | |
" <th>Redirect URL 8</th>\n", | |
" <th>Content 9</th>\n", | |
" <th>Status Code 9</th>\n", | |
" <th>Status 9</th>\n", | |
" <th>Redirect Type 9</th>\n", | |
" <th>Redirect URL 9</th>\n", | |
" <th>Content 10</th>\n", | |
" <th>Status Code 10</th>\n", | |
" <th>Status 10</th>\n", | |
" <th>Redirect Type 10</th>\n", | |
" <th>Redirect URL 10</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb/catalogues</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>Indexable</td>\n", | |
" <td>NaN</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>text/html; charset=utf-8</td>\n", | |
" <td>301</td>\n", | |
" <td>Moved Permanently</td>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb/catalogues</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>Indexable</td>\n", | |
" <td>NaN</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>text/html; charset=utf-8</td>\n", | |
" <td>301</td>\n", | |
" <td>Moved Permanently</td>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/gift-ca...</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb/catalogues</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>Indexable</td>\n", | |
" <td>NaN</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>text/html; charset=utf-8</td>\n", | |
" <td>301</td>\n", | |
" <td>Moved Permanently</td>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/gift-ca...</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb/catalogues</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>Indexable</td>\n", | |
" <td>NaN</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>text/html; charset=utf-8</td>\n", | |
" <td>301</td>\n", | |
" <td>Moved Permanently</td>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb/themes/duplo</td>\n", | |
" <td>False</td>\n", | |
" <td>https://www.lego.com/en-gb/catalogues</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>Indexable</td>\n", | |
" <td>NaN</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>text/html; charset=utf-8</td>\n", | |
" <td>301</td>\n", | |
" <td>Moved Permanently</td>\n", | |
" <td>HTTP Redirect</td>\n", | |
" <td>https://www.lego.com/en-gb/page/static/catalogues</td>\n", | |
" <td>text/html</td>\n", | |
" <td>200.0</td>\n", | |
" <td>OK</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Chain Type Number of Redirects ... Redirect Type 10 Redirect URL 10\n", | |
"0 HTTP Redirect 1 ... NaN NaN\n", | |
"1 HTTP Redirect 1 ... NaN NaN\n", | |
"2 HTTP Redirect 1 ... NaN NaN\n", | |
"3 HTTP Redirect 1 ... NaN NaN\n", | |
"4 HTTP Redirect 1 ... NaN NaN\n", | |
"\n", | |
"[5 rows x 62 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 16 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "7AO2IzVXcL3D", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Filtering the data to just our site\n", | |
"We don't care so much about us linking to 301s on other sites so we start by getting only the links which are going to our own pages.\n", | |
"\n", | |
"First write out your own domain below, then run the cell." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "YdWXNdzJcpdZ", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"inlinks_data=inlinks[inlinks[\"Destination\"].str.contains(my_domain)]" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "eTkpeHSjc_GN", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Finding number of links broken down by status code\n", | |
"Now we group our inlinks data by status code to get an idea of how often we are linking to different status codes. If you see a lot of links to numbers that aren't \"200\" then that's something you need to look into.\n", | |
"\n", | |
"First we filter the data to just the status code and page being linked to then we group by status code and count the total number of rows. Then we sort from high to low so we see the most common link types first." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "0bFGZS5XdlM0", | |
"colab_type": "code", | |
"outputId": "3c6153b2-1f34-4c8c-8ac6-af0df12185bc", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 314 | |
} | |
}, | |
"source": [ | |
"discovered_pages_by_status_code=inlinks_data[[\"Destination\", \"Status Code\"]].groupby(\"Status Code\").count().rename(columns={\"Destination\":\"Number of links\"}).sort_values(\"Number of links\", ascending=False)\n", | |
"discovered_pages_by_status_code" | |
], | |
"execution_count": 18, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Number of links</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Status Code</th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>200</th>\n", | |
" <td>348031</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>301</th>\n", | |
" <td>20762</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>303</th>\n", | |
" <td>7183</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>6599</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>307</th>\n", | |
" <td>1062</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>302</th>\n", | |
" <td>638</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>404</th>\n", | |
" <td>23</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>500</th>\n", | |
" <td>6</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Number of links\n", | |
"Status Code \n", | |
"200 348031\n", | |
"301 20762\n", | |
"303 7183\n", | |
"0 6599\n", | |
"307 1062\n", | |
"302 638\n", | |
"404 23\n", | |
"500 6" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 18 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "WgIbSALrh7ol", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Filter to just the status codes we care about\n", | |
"\n", | |
"Now that we've seen the breakdown, we're going to remove the common status codes whic are nothing to worry about. i.e. we don't care about links to code 200 pages, we also don't care about things like 307 redirects because they are normally a http->https redirect working properly." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "l-xaJnz7iD4x", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"inlinks_data=inlinks_data[(inlinks_data[\"Status Code\"]!=200)&(inlinks_data[\"Status Code\"]!=0)]" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "19GMk3ocdxn9", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Finding patterns in broken internal links\n", | |
"\n", | |
"Now we take our data and this time group by <i>Destination, Anchor text, and Status Code.</i>\n", | |
"\n", | |
"The reason we do this is templated links will always be pointing to the same page (Destination) and will use the same text in the link (Anchor text). This way we can identify patterns in internal links to pages which aren't code 200.\n", | |
"\n", | |
"We've sorted by our inlinks count from highest to lowest so in our final output the prime offenders will be easily identified." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Iq1L9o3Cdo_z", | |
"colab_type": "code", | |
"outputId": "a6415a95-0288-46ca-c95e-c59ff84defe1", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 330 | |
} | |
}, | |
"source": [ | |
"# Giving ourselves a column where the count is always 1, for consistency\n", | |
"inlinks_data[\"Number of Linking Pages\"]=1\n", | |
"\n", | |
"# Filter the table to just the columns we need. We are going to use \"Source Count\"\n", | |
"# As the number of inlinks and we'll take an example url from the total list of \n", | |
"# sources using \"min\" so that we have a page to go check.\n", | |
"count_by_link_format=inlinks_data[[\"Source\", \"Destination\", \"Anchor\", \"Number of Linking Pages\", \"Status Code\"]].groupby([\"Destination\", \"Anchor\", \"Status Code\"]).agg({\"Number of Linking Pages\":\"count\",\"Source\":\"min\"}).reset_index().sort_values(\"Number of Linking Pages\", ascending=False)\n", | |
"count_by_link_format=count_by_link_format.rename(columns={\"Source\":\"Example Source Page\"})\n", | |
"count_by_link_format.head(5)" | |
], | |
"execution_count": 20, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Destination</th>\n", | |
" <th>Anchor</th>\n", | |
" <th>Status Code</th>\n", | |
" <th>Number of Linking Pages</th>\n", | |
" <th>Example Source Page</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>84</th>\n", | |
" <td>https://www.lego.com/life</td>\n", | |
" <td>LEGO® LIFE</td>\n", | |
" <td>303</td>\n", | |
" <td>3179</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>106</th>\n", | |
" <td>https://www.lego.com/stores</td>\n", | |
" <td>Find a store</td>\n", | |
" <td>303</td>\n", | |
" <td>2120</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>79</th>\n", | |
" <td>https://www.lego.com/en-gb/service/help-topics</td>\n", | |
" <td>Common Questions</td>\n", | |
" <td>301</td>\n", | |
" <td>2110</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75</th>\n", | |
" <td>https://www.lego.com/en-gb/liebherr-r-9800-exc...</td>\n", | |
" <td>Liebherr R 9800 Excavator</td>\n", | |
" <td>301</td>\n", | |
" <td>2110</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>61</th>\n", | |
" <td>https://www.lego.com/en-gb/dinosaur-fossils-21320</td>\n", | |
" <td>Dinosaur Fossils</td>\n", | |
" <td>301</td>\n", | |
" <td>2110</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Destination ... Example Source Page\n", | |
"84 https://www.lego.com/life ... https://www.lego.com/en-gb\n", | |
"106 https://www.lego.com/stores ... https://www.lego.com/en-gb\n", | |
"79 https://www.lego.com/en-gb/service/help-topics ... https://www.lego.com/en-gb\n", | |
"75 https://www.lego.com/en-gb/liebherr-r-9800-exc... ... https://www.lego.com/en-gb\n", | |
"61 https://www.lego.com/en-gb/dinosaur-fossils-21320 ... https://www.lego.com/en-gb\n", | |
"\n", | |
"[5 rows x 5 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 20 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "cFW1U2PKh1MN", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Identifying where we should change those links to\n", | |
"\n", | |
"Now we know what links are wrong, we want to identify where those pages should be redirected to. We won't be able to do that with 404 pages but we <strong>can</strong> do that with redirected pages by merging this data with our redirect chains report but first we need to <strong>melt</strong> our redirect chains report.\n", | |
"\n", | |
"We can think of <strong>melt</strong> like <strong>unpivot</strong>. For more information on how <strong>melt</strong> works and why I think it's awesome, check out <a href=\"http://www.therobinlord.com/why-melt-unpivot-is-the-most-powerful-function-in-pandas/\"> this post</a>.\n", | |
"\n", | |
"In brief, what we're doing here is taking all of the url columns in this sheet and stacking them one on top of the other so we only have to check one column instead of, like, 20.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "MpaFkKMziwdQ", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# The redirect chains report is spread out over a number of columns based on \n", | |
"# how many steps there are in a chain, so first we need to identify all of\n", | |
"# the columns which contain urls - we do that with this list comprehension.\n", | |
"address_columns=[col for col in list(redirect_chains) if 'Address'in col or 'URL' in col]\n", | |
"filter_columns=[\"Number of Redirects\", \"Temp Redirect In Chain\", \"Final Status Code\", \"Redirect Loop\"]+address_columns\n", | |
"\n", | |
"# Then we use \"melt\" which I've explained above\n", | |
"chain_melted=pd.melt(redirect_chains[filter_columns], id_vars=[\"Final Address\",\"Number of Redirects\", \"Final Status Code\", \"Redirect Loop\", \"Temp Redirect In Chain\"])\n", | |
"chain_melted=chain_melted.drop_duplicates(\"value\")\n", | |
"chain_melted=chain_melted.rename(columns={\"value\":\"Destination\"})\n", | |
"\n", | |
"# Removing all the rows where the url we're looking at is actually the final url\n", | |
"chain_melted=chain_melted[(chain_melted[\"Destination\"]!=chain_melted[\"Final Address\"])]" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "-TwqqqUNExMC", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Next we combine our redirect chains report with the inlinks report we've been using. To do this, we use the pandas function <strong>merge</strong>. <strong>merge</strong> is like Excel's vlookup on absolutely legal performance enhancers which would still ruin an Olympian's career.\n", | |
"\n", | |
"When we use <merge> below we'll take our inlinks file and say, whenever the Destination column matches the Destination column in our newly melted redirect chains sheet, add the data from the redirects sheet to the inlinks sheet.\n", | |
"\n", | |
"What we end up with should be very similar to if we manually did vlookup on each of the rows of our inlinks sheet, only difference is we get to do this just once and it's <i>super</i> quick in comparison.\n", | |
"\n", | |
"So here we merge count_by_link_format with chain_melted but we're only taking the \"Final Address\" and \"Destination\" columns from chain_melted. So wherever the inlinks report and the redirects report are talking about the same page, we'll add in where that page eventually redirects to into our inlinks sheet.\n", | |
"\n", | |
"You can find more details about merge in the <a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html\">pandas documentation</a> but we choose a \"left\" merge because we want to keep all of our inlinks, not only the ones which are in the redirects report (because there's things like 404s in there too." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "IDPXA6zYEptA", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# Merge our count_by_link_format sheet with our chain_melted sheet\n", | |
"inlinks_and_redirects=count_by_link_format.merge(chain_melted[[\"Final Address\", \"Destination\", \"Redirect Loop\", \"Temp Redirect In Chain\", \"Final Status Code\", \"Number of Redirects\"]], how=\"left\", on=\"Destination\")\n", | |
"\n", | |
"\n", | |
"# Make sure that when the merged columns are empty, they are filled in a \n", | |
"# consistent way we can check\n", | |
"\n", | |
"inlinks_and_redirects[\"Final Address\"]=inlinks_and_redirects[\n", | |
" \"Final Address\"].fillna(\"\")\n" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "g0NoRnxsAxCL", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Defining our recommendations\n", | |
"\n", | |
"Next we write out our recommendations based on what we find in various columns of our file.\n", | |
"\n", | |
"We'll use numpy <strong>where</strong>. If <strong>merge</strong> is like a better version of vlookup, <strong>where</strong> is like a better version of Excel's =if() formula.\n", | |
"\n", | |
"Each time we use np.where, inside the brackets we write;\n", | |
"\n", | |
"({our condition},{what to do if it's true}, {what to do if it's false})\n", | |
"\n", | |
"In this case, we're working in the same column multiple times so if the condition is true we need to change the value in that column, otherwise we want to keep the value of that column the same.\n", | |
"\n", | |
"When we are overwriting the same column multiple times, we need to start with the <i>most common</i> and <i>least problematic</i> things first, then focus on the more specific and more problematic things so that our output will focus on the biggest issues.\n", | |
"\n", | |
"When we are using any conditions in Pandas, we can nest those conditions within brackets, we can use & to mean \"both these things\" and | to mean \"either of these things\". So for example the below means both <strong>a <i>and</i> b</strong> have to be true <i>or</i> <strong>c</strong> has to be true.\n", | |
"\n", | |
"```\n", | |
"((a & b)|c)\n", | |
"```\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "B0_iM9trAnwZ", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# First we create our Recommendation column, the default is going to be\n", | |
"# \"Update link\" because if we find a reason we SHOULDN'T do\n", | |
"# that then we'll be changing our recommendation\n", | |
"inlinks_and_redirects[\"Recommendation\"]=\"Update link\"\n", | |
"\n", | |
"# Next we highlight the times when we're linking to a redirect chain, which \n", | |
"# we'll need to update using the separate redirect chains sheet.\n", | |
"inlinks_and_redirects[\"Recommendation\"]=np.where(\n", | |
" (inlinks_and_redirects[\"Number of Redirects\"]>1),\n", | |
" \"Update link, fix redirect chain\",\n", | |
" inlinks_and_redirects[\"Recommendation\"])\n", | |
"\n", | |
"# Next we check if there is a temporary link involved. Partially because what\n", | |
"# we need to do is different, but mainly because how we prioritise it might be\n", | |
"inlinks_and_redirects[\"Recommendation\"]=np.where((\n", | |
" (inlinks_and_redirects[\"Status Code\"]==302)|\n", | |
" (inlinks_and_redirects[\"Temp Redirect In Chain\"]==True)),\n", | |
" \"Update link, fix temp redirect\",\n", | |
" inlinks_and_redirects[\"Recommendation\"])\n", | |
"\n", | |
"\n", | |
"# If the page is a 404 or the redirect chain results in a 404 then we need \n", | |
"# to choose a new page and fix the 404.\n", | |
"inlinks_and_redirects[\"Recommendation\"]=np.where(\n", | |
" ((inlinks_and_redirects[\"Status Code\"]==404)|\n", | |
" (inlinks_and_redirects[\"Final Status Code\"]==404)|\n", | |
" (inlinks_and_redirects[\"Final Address\"]==\"\")),\n", | |
" \"Redirect 404, update link\",\n", | |
" inlinks_and_redirects[\"Recommendation\"])\n", | |
"\n", | |
"# If we're linking to a redirect LOOP, that's pretty similar to a 404 for our\n", | |
"# purposes so the result is the same\n", | |
"inlinks_and_redirects[\"Recommendation\"]=np.where(\n", | |
" inlinks_and_redirects[\"Redirect Loop\"]==True,\n", | |
" \"Fix redirect loop, update link\",\n", | |
" inlinks_and_redirects[\"Recommendation\"])\n", | |
"\n", | |
"\n", | |
"# We also want to get an idea of when it makes sense to include a \"change to\"\n", | |
"# column (which shows the url we should update the link to point to) and when\n", | |
"# that name would be misleading. For instance, if we're fixing 404s or redirect\n", | |
"# loops, the \"change to\" column is just where we end up, not where the link\n", | |
"# should actually be pointing to\n", | |
"\n", | |
"recommendations_where_we_should_rename_column=[\n", | |
" \"Update link\",\n", | |
" \"Update link, fix temp redirect\"]" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "IlFUdC4uF4HO", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Cleaning up our sheet\n", | |
"\n", | |
"Now because people have to read this output, we just clarify some of the names of our columns and reorder the sheet so it's in a logical structure.\n", | |
"\n", | |
"Pandas rename takes a list of columns and what you want to change them to.\n", | |
"\n", | |
"When you're using Pandas, if you filter it to a list of existing columns, but in a different order, that'll reorder the columns for you." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "TTPHuZNRAtTE", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 758 | |
}, | |
"outputId": "dc4abcc7-bd2d-471a-bb8f-5d63a9c86853" | |
}, | |
"source": [ | |
"# Renaming columns to make the output clearer\n", | |
"inlinks_and_redirects=inlinks_and_redirects.rename(columns={\"Destination\":\"Link URL\", \"Status Code\":\"Linked-page Status Code\"})\n", | |
"\n", | |
"# Restructuring sheet to make clearer\n", | |
"inlinks_and_redirects=inlinks_and_redirects[[\"Recommendation\",\"Link URL\",\"Anchor\",\"Final Address\",\"Example Source Page\", \"Linked-page Status Code\", \"Number of Linking Pages\", \"Temp Redirect In Chain\", \"Redirect Loop\"]]\n", | |
"inlinks_and_redirects" | |
], | |
"execution_count": 28, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Recommendation</th>\n", | |
" <th>Link URL</th>\n", | |
" <th>Anchor</th>\n", | |
" <th>Final Address</th>\n", | |
" <th>Example Source Page</th>\n", | |
" <th>Linked-page Status Code</th>\n", | |
" <th>Number of Linking Pages</th>\n", | |
" <th>Temp Redirect In Chain</th>\n", | |
" <th>Redirect Loop</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Update link</td>\n", | |
" <td>https://www.lego.com/life</td>\n", | |
" <td>LEGO® LIFE</td>\n", | |
" <td>https://www.lego.com/en-gb/life</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>303</td>\n", | |
" <td>3179</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Update link</td>\n", | |
" <td>https://www.lego.com/stores</td>\n", | |
" <td>Find a store</td>\n", | |
" <td>https://www.lego.com/en-gb/stores</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>303</td>\n", | |
" <td>2120</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Update link</td>\n", | |
" <td>https://www.lego.com/en-gb/service/help-topics</td>\n", | |
" <td>Common Questions</td>\n", | |
" <td>https://www.lego.com/en-gb/service/help</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>301</td>\n", | |
" <td>2110</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Update link</td>\n", | |
" <td>https://www.lego.com/en-gb/liebherr-r-9800-exc...</td>\n", | |
" <td>Liebherr R 9800 Excavator</td>\n", | |
" <td>https://www.lego.com/en-gb/product/liebherr-r-...</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>301</td>\n", | |
" <td>2110</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Update link</td>\n", | |
" <td>https://www.lego.com/en-gb/dinosaur-fossils-21320</td>\n", | |
" <td>Dinosaur Fossils</td>\n", | |
" <td>https://www.lego.com/en-gb/product/dinosaur-fo...</td>\n", | |
" <td>https://www.lego.com/en-gb</td>\n", | |
" <td>301</td>\n", | |
" <td>2110</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>107</th>\n", | |
" <td>Update link</td>\n", | |
" <td>http://shop.lego.com/Shipping-Handling</td>\n", | |
" <td>Shipping and Handling</td>\n", | |
" <td>https://shop.lego.com/Shipping-Handling</td>\n", | |
" <td>https://www.lego.com/en-gb/page/terms-and-cond...</td>\n", | |
" <td>301</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>108</th>\n", | |
" <td>Update link</td>\n", | |
" <td>http://shop.lego.com/Privacy-Policy</td>\n", | |
" <td>Privacy Policy</td>\n", | |
" <td>https://shop.lego.com/Privacy-Policy</td>\n", | |
" <td>https://www.lego.com/en-gb/page/terms-and-cond...</td>\n", | |
" <td>301</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>109</th>\n", | |
" <td>Update link</td>\n", | |
" <td>http://shop.lego.com/</td>\n", | |
" <td>LEGO Shop</td>\n", | |
" <td>https://shop.lego.com/</td>\n", | |
" <td>https://www.lego.com/en-gb/service/product-rec...</td>\n", | |
" <td>301</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>110</th>\n", | |
" <td>Update link</td>\n", | |
" <td>https://www.lego.com/en-gb/product/the-minifig...</td>\n", | |
" <td>Minifigure to the max Grow your collection wit...</td>\n", | |
" <td>https://www.lego.com/en-gb/product/lego-minfig...</td>\n", | |
" <td>https://www.lego.com/en-gb/themes/lego-origina...</td>\n", | |
" <td>301</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>111</th>\n", | |
" <td>Update link</td>\n", | |
" <td>http://shop.lego.com/en-GB/Site-Information</td>\n", | |
" <td>Site Information</td>\n", | |
" <td>https://shop.lego.com/en-GB/Site-Information</td>\n", | |
" <td>https://www.lego.com/en-gb/page/payment-methods</td>\n", | |
" <td>301</td>\n", | |
" <td>1</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>112 rows × 9 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Recommendation ... Redirect Loop\n", | |
"0 Update link ... False\n", | |
"1 Update link ... False\n", | |
"2 Update link ... False\n", | |
"3 Update link ... False\n", | |
"4 Update link ... False\n", | |
".. ... ... ...\n", | |
"107 Update link ... False\n", | |
"108 Update link ... False\n", | |
"109 Update link ... False\n", | |
"110 Update link ... False\n", | |
"111 Update link ... False\n", | |
"\n", | |
"[112 rows x 9 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 28 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "bu4ILy4lAs7d", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Creating filtered versions of our output\n", | |
"\n", | |
"Now we create different versions of our output so that when people are using our results they can easily check for different things.\n", | |
"\n", | |
"For instance, we'll probably want to treat the \"fix 404\" stuff differently to the \"fix temporary redirect\" stuff and all of that differently again to the general \"fix this link\" stuff.\n", | |
"\n", | |
"Fortunately we've already categorised all of our data based on the specific problems so we just need to create an Excel sheet where it's all separated into different tabs.\n", | |
"\n", | |
"We can do this easily by getting a list of the unique categories we've created, then using a <strong>for</strong> loop." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "2lXy6-JqAnX7", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# Getting the unique recommendations we've listed, then turning that\n", | |
"# into a list. unique() is something Pandas just does for us.\n", | |
"recommendations=list(inlinks_and_redirects[\"Recommendation\"].unique())\n", | |
"\n", | |
"# Creating something which will help us add our results into one Excel file\n", | |
"# We use the xlsxwriter engine (which we installed at the start) because\n", | |
"# sometimes the standard writer gets confused with unusual characters.\n", | |
"\n", | |
"# We have to set strings_to_urls as False because Excel has a limit of 40k\n", | |
"# urls which it will allow in one work sheet so if we don't do this it'll\n", | |
"# stop us from saving all of our data\n", | |
"writer = pd.ExcelWriter(\"[Complete] redirects and broken links patterns.xlsx\", \n", | |
" engine=\"xlsxwriter\", \n", | |
" options={'strings_to_urls': False})\n", | |
"\n", | |
"\n", | |
"# Putting all of the general data into the first sheet so people can look\n", | |
"# at everything and sort by number of inlinks to find the most important stuff\n", | |
"# we say index=False because that stops Pandas from writing out line numbers\n", | |
"# in the first column of our sheets\n", | |
"inlinks_and_redirects.to_excel(writer, \n", | |
" sheet_name=\"All internal links\", \n", | |
" index=False)\n", | |
"\n", | |
"\n", | |
"# Going through each recommendation we've given and creating a separate sheet \n", | |
"# for each\n", | |
"for recommendation in recommendations:\n", | |
"\n", | |
" # Creating a filtered version of our data for just this recommendation\n", | |
" recommendation_data=inlinks_and_redirects[\n", | |
" inlinks_and_redirects[\"Recommendation\"]==recommendation]\n", | |
"\n", | |
" # Only where\n", | |
" if recommendation in recommendations_where_we_should_rename_column:\n", | |
" recommendation_data=recommendation_data.rename(columns={\n", | |
" \"Final Address\":\"Change Link To\"})\n", | |
"\n", | |
" # Putting that filtered version of our data in its own Excel tab\n", | |
" recommendation_data.to_excel(writer, sheet_name=recommendation, index=False)\n", | |
"\n", | |
" # Having constructed our data we now need to save it\n", | |
"writer.save()" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "x81k08nMHj8M", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Next we turn this data into Excel-readable files and download them.\n", | |
"\n", | |
"We'll turn the redirect chains data into one .csv file because that's more widely readable, and simpler, and it's only one set of data.\n", | |
"\n", | |
"We'll turn the broken link patterns into an Excel file because there's a few ways we might want to see the data. For example, it matters more if links are pointing to 404 pages <i>but</i> there could be <i>way</i> more templated 301 links so we need to be able to see both in the same place <i>and</i> the two separately, easily. Same goes for 302 internal links etc." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "jdKwjtvCNnho", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Fixing redirect chains\n", | |
"\n", | |
"While updating internal links helps us avoid linking to things we don't want to, we can't stop <i>other sites</i> from doing so, so it's a good idea to fix redirect chains too anyway (particularly as we've already done most of the work by reformatting our redirect chains file).\n", | |
"\n", | |
"Because we're just fixing redirect chains we don't care so much about the last step in the chain (we're not going to make it any less than 1) so we start by going through each of the redirect steps we have found and filtering our everything which isn't the final hop." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "1sbI-EFqQGBH", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# Focusing on redirect chains which end on our own site\n", | |
"chain_melted=chain_melted[(chain_melted[\"Final Address\"].str.contains(my_domain))]\n", | |
"\n", | |
"\n", | |
"# Extract the number from the label we have for this redirect step\n", | |
"# That will be a string so we have to convert it to a number.\n", | |
"chain_melted[\"Redirect number\"]=chain_melted['variable'].str.extract(\"(\\d)\").fillna(0)\n", | |
"chain_melted[\"Redirect number\"]=pd.to_numeric(chain_melted[\"Redirect number\"])\n", | |
"\n", | |
"# Calculate the difference between this redirect step and the total number of\n", | |
"# redirects in this chain, to work out how many redirect hops there are after this one.\n", | |
"chain_melted[\"Redirects left\"]=chain_melted[\"Number of Redirects\"]-chain_melted[\"Redirect number\"]\n", | |
"\n", | |
"# If there is more than one hop after this one, then this is part of a redirect\n", | |
"# chain and we need to change it to redirect directly to the final address\n", | |
"chain_melted[\"Not direct\"]=chain_melted[\"Redirects left\"]>1\n", | |
"\n", | |
"# Filter our redirect chains data to just be the ones which we need to make direct\n", | |
"melted_chains_to_redirect=chain_melted[chain_melted['Not direct']==True]\n", | |
"\n", | |
"# Renaming columns to make our output clear\n", | |
"melted_chains_to_redirect=melted_chains_to_redirect.rename(columns={\"Destination\":\"URL to redirect\", \"Final Address\":\"Change redirect to\"})" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "9SXKhSzNL23X", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Downloading our redirect chains data\n", | |
"\n", | |
"Now we save the redirect chains data we worked with into a csv.\n", | |
"\n", | |
"We won't split this into separate tabs the way we did with the main data because that's already covered a bunch of different actions for us." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "et4yacvXMXN4", | |
"colab_type": "code", | |
"outputId": "adbe90de-40da-41a4-99b4-30df65fc7661", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 281 | |
} | |
}, | |
"source": [ | |
"# Converting our redirect chains data to csv\n", | |
"melted_chains_to_redirect[[\"URL to redirect\", \"Change redirect to\"]].to_csv(\"\"\"[Complete] redirect chains to fix.csv\"\"\", index=False)\n", | |
"\n", | |
"\n", | |
"end_time=time.time()\n", | |
"\n", | |
"execution_time=end_time-start_time\n", | |
"\n", | |
"print (\"Took {} seconds\".format(execution_time))\n", | |
"\n", | |
"time.sleep(5)\n", | |
"\n", | |
"try:\n", | |
" output.eval_js('new Audio(\"https://www.myinstants.com/media/sounds/kool-aid-oh-yeah.mp3\").play()')\n", | |
" display(HTML(\"\"\"<div align=\"center\" width=100%><img src=\"https://media.giphy.com/media/OHZ1gSUThmEso/giphy.gif\" width=30% align=middle>\n", | |
" <p></br></br></br></br>\n", | |
" Now look in the files pane for the [Complete] files and click on them to download. If they haven't appeared, click \"Refresh\".</p></div>\"\"\"))\n", | |
"\n", | |
"except:\n", | |
" print (\"\"\"\n", | |
" Couldn't play sound\n", | |
"\n", | |
"\n", | |
"\n", | |
" Now look in the files pane for the [Complete] files and click on them to download. If they haven't appeared, click \"Refresh\".\"\"\")\n", | |
"\n", | |
" \n" | |
], | |
"execution_count": 27, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Took 2.0636372566223145 seconds\n" | |
], | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/html": [ | |
"<div align=\"center\" width=100%><img src=\"https://media.giphy.com/media/OHZ1gSUThmEso/giphy.gif\" width=30% align=middle>\n", | |
" <p></br></br></br></br>\n", | |
" Now look in the files pane for the [Complete] files and click on them to download. If they haven't appeared, click \"Refresh\".</p></div>" | |
], | |
"text/plain": [ | |
"<IPython.core.display.HTML object>" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
} | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "gHLBP6zoAoYx", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Once you've set up the notebook and uploaded your files\n", | |
".\n", | |
"\n", | |
"\n", | |
" - Click \"Runtime\" then \"Run All\" to start the process\n", | |
" - Wait a bit\n", | |
" - Download the [Complete] files for your results\n", | |
"\n", | |
"\n" | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment