Skip to content

Instantly share code, notes, and snippets.

@seoutopico
Last active June 30, 2022 07:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save seoutopico/1dceef1150e82c3427ee71482f25b158 to your computer and use it in GitHub Desktop.
Save seoutopico/1dceef1150e82c3427ee71482f25b158 to your computer and use it in GitHub Desktop.
Sprint python Screamingfrog.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/seoutopico/1dceef1150e82c3427ee71482f25b158/sprint-python-screamingfrog.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nRqiCc_tXLri"
},
"source": [
"/// SOLUCIÓN AL SPRINT DE LA SEMANA 3 \n",
"\n",
"Preparamos el entorno para subir y leer nuestra exportación de Screaming Frog\n",
"\n",
"1. Instalamos pandas\n",
"2. Subimos el crawl de screamingFrog:\n",
"3. Leemos el archivo\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "G07Klwk0XJa6"
},
"outputs": [],
"source": [
"#instalamos pandas\n",
"!pip install pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vi84a7UDX7kW"
},
"outputs": [],
"source": [
"#subir arvhicos\n",
"# fuente: https://colab.research.google.com/notebooks/io.ipynb\n",
"\n",
"from google.colab import files\n",
"\n",
"uploaded = files.upload()"
]
},
{
"cell_type": "code",
"source": [
"#Leer el archivo\n",
"import pandas as pd \n",
"df = pd.read_csv('internos_todo.csv') #metodo p.read_extensión\n",
"print('Listo')"
],
"metadata": {
"id": "CfaZrjmlViOL"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"/// QUÉ DATOS QUEREMOS\n",
"\n",
"# ¿Qué queremos saber?\n",
"\n",
"* Total de URL con los diferentes Status code\n",
"* Exportar un csv\n"
],
"metadata": {
"id": "cMQVzCn3n-JR"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VbPPUwPnwKxl"
},
"outputs": [],
"source": [
"# Total URLs según su Status Code (404, 200, 301)\n",
"# https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html\n",
"status_df = df['Código de respuesta'].value_counts()\n",
"status_df"
]
},
{
"cell_type": "code",
"source": [
"# URL por status code, otra forma de obtener lo mismo\n",
"status_code_df = df[['Código de respuesta', 'Dirección']].groupby(['Código de respuesta']).agg('count')\n",
"status_code_df\n"
],
"metadata": {
"id": "hA6wd06sm4yr"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Descargar el CSV nombre la variable donde hemos guardado los datos del data frame seguido de to.csv\n",
"status_code_df.to_csv('filename.csv') \n",
"files.download('filename.csv')"
],
"metadata": {
"id": "0aYDwJr9YsQL"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"//BONUS\n",
"\n",
"Sacamos un listado solo con las URLs que devuelven un 404. \n",
"\n",
"Para ello utilizamos el filtrado visto en el módulo"
],
"metadata": {
"id": "92WIw0IqYtbz"
}
},
{
"cell_type": "code",
"source": [
"# URL por status code 400\n",
"url_404 = df[df['Código de respuesta'] == 404 ] #indicamos dnd estan los datos , luego la condición : nombre de la columna, valor de la celda\n",
"\n",
"#printamos solo las los campos Dirección y Código de respuesta, para quitar ruido al Excel\n",
"total404 = url_404.filter(['Dirección','Código de respuesta']) #filtramos los datos que queremos\n",
"total404\n"
],
"metadata": {
"id": "Hh1zPN6iXLLi"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Descargamos el listado de URL que dan 404\n",
"total404.to_csv('404.csv') \n",
"files.download('404.csv')"
],
"metadata": {
"id": "pjmx23QKuOqj"
},
"execution_count": null,
"outputs": []
}
],
"metadata": {
"colab": {
"name": "Sprint python Screamingfrog.ipynb",
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyMxgVnFFq+rjvufC8Men/yG",
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@seoutopico
Copy link
Author

Utilizar Google Colab para leer, analizar, filtrar los datos de ScreamingFrog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment