{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>Semantic crosslinking article recommender</h3><p>@author: Natzir Turrado: Technical SEO / Data Scientist. <a href=\"https://twitter.com/natzir9\">Twitter > @natzir9</a></p>\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import matplotlib.pyplot as plt\n", | |
"import nltk\n", | |
"from sklearn.feature_extraction.text import TfidfVectorizer\n", | |
"from sklearn.decomposition import NMF\n", | |
"from sklearn.preprocessing import normalize\n", | |
"from nltk.corpus import stopwords\n", | |
"from IPython.display import display" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Crawl your site an get the title and the content of every post. Create an excel file named \"data.xlsx\" and save it in the same work space as this file.\n", | |
"<ul>\n", | |
"<li><a href=\"https://builtvisible.com/seo-guide-to-xpath/\">XPATH for SEO's</a></li>\n", | |
"<li><a href=\"https://www.screamingfrog.co.uk/web-scraping/\">Web Scraping & Data Extraction</a></li>\n", | |
"</ul>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>title</th>\n", | |
" <th>article</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <td>0</td>\n", | |
" <td>SEO para Progressive Web APPs (PWA) y JavaScript</td>\n", | |
" <td>Este artículo es un resumen que hemos hecho Ch...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>1</td>\n", | |
" <td>Cómo mejorar el SEO incrementando la frecuenci...</td>\n", | |
" <td>Siempre digo que es mejor remar a favor que ir...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>2</td>\n", | |
" <td>CTR y SEO: Manipulación del CTR para influir e...</td>\n", | |
" <td>Últimamente se han puesto pesaditos con el tem...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>3</td>\n", | |
" <td>Las mejores herramientas de visualización de d...</td>\n", | |
" <td>Artículo publicado en Doctor Metrics sobre her...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>4</td>\n", | |
" <td>Cómo hacer un Heatmap de las visitas de tu web...</td>\n", | |
" <td>Visualizar la información en forma de mapa de ...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" title \\\n", | |
"0 SEO para Progressive Web APPs (PWA) y JavaScript \n", | |
"1 Cómo mejorar el SEO incrementando la frecuenci... \n", | |
"2 CTR y SEO: Manipulación del CTR para influir e... \n", | |
"3 Las mejores herramientas de visualización de d... \n", | |
"4 Cómo hacer un Heatmap de las visitas de tu web... \n", | |
"\n", | |
" article \n", | |
"0 Este artículo es un resumen que hemos hecho Ch... \n", | |
"1 Siempre digo que es mejor remar a favor que ir... \n", | |
"2 Últimamente se han puesto pesaditos con el tem... \n", | |
"3 Artículo publicado en Doctor Metrics sobre her... \n", | |
"4 Visualizar la información en forma de mapa de ... " | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"df = pd.read_excel('data.xlsx')\n", | |
"display(df.head())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"articles = df.article.tolist()\n", | |
"titles = df.title.tolist()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Transforming a list of documents into a word frequency array (TF-IDF). I'm only accepting words (token_pattern) and ignoring terms that appear in more than 60% of the documents (max_df) and that appear in less than 1 document (min_df). You should test than on a bigger website than my blog. <a href=\"http://dfrancis.co/2017/10/06/tf-idf-vectorizer-fit-and-transform/\">More info on min_df & max_df here</a>." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['aarrr',\n", | |
" 'abajo',\n", | |
" 'abandonamos',\n", | |
" 'abandonar',\n", | |
" 'abandone',\n", | |
" 'abandono',\n", | |
" 'abandonos',\n", | |
" 'abarcan',\n", | |
" 'abc',\n", | |
" 'abierta',\n", | |
" 'abiertas',\n", | |
" 'ability',\n", | |
" 'abismales',\n", | |
" 'about',\n", | |
" 'above',\n", | |
" 'abrazando',\n", | |
" 'abril',\n", | |
" 'abrimos',\n", | |
" 'abrir',\n", | |
" 'abrirlo',\n", | |
" 'absoluta',\n", | |
" 'absolutamente',\n", | |
" 'absolutas',\n", | |
" 'absoluto',\n", | |
" 'abuelo',\n", | |
" 'abundancia',\n", | |
" 'aburriros',\n", | |
" 'acaba',\n", | |
" 'acababan',\n", | |
" 'acabado',\n", | |
" 'acaban',\n", | |
" 'acabar',\n", | |
" 'acabaremos',\n", | |
" 'acabe',\n", | |
" 'acaben',\n", | |
" 'acabo',\n", | |
" 'acceda',\n", | |
" 'accede',\n", | |
" 'acceder',\n", | |
" 'accedido',\n", | |
" 'accesibilidad',\n", | |
" 'accesible',\n", | |
" 'accesibles',\n", | |
" 'accesiblesanaliza',\n", | |
" 'accesiblesporcentaje',\n", | |
" 'acceso',\n", | |
" 'accesos',\n", | |
" 'accionable',\n", | |
" 'accionables',\n", | |
" 'acciones',\n", | |
" 'acento',\n", | |
" 'acepta',\n", | |
" 'aceptada',\n", | |
" 'aceptadas',\n", | |
" 'aceptados',\n", | |
" 'aceptar',\n", | |
" 'aceptarlas',\n", | |
" 'acerca',\n", | |
" 'acercarse',\n", | |
" 'acertar',\n", | |
" 'acierto',\n", | |
" 'aconsejar',\n", | |
" 'aconsejo',\n", | |
" 'acordamos',\n", | |
" 'acordaros',\n", | |
" 'acortar',\n", | |
" 'acostumbrados',\n", | |
" 'acotar',\n", | |
" 'acquisition',\n", | |
" 'across',\n", | |
" 'action',\n", | |
" 'actionsitelinksrich',\n", | |
" 'actitud',\n", | |
" 'actitudes',\n", | |
" 'activa',\n", | |
" 'activado',\n", | |
" 'activador',\n", | |
" 'activan',\n", | |
" 'activar',\n", | |
" 'activas',\n", | |
" 'activation',\n", | |
" 'actividad',\n", | |
" 'actividades',\n", | |
" 'activity',\n", | |
" 'activo',\n", | |
" 'activos',\n", | |
" 'actual',\n", | |
" 'actuales',\n", | |
" 'actualesa',\n", | |
" 'actualidad',\n", | |
" 'actualiza',\n", | |
" 'actualizada',\n", | |
" 'actualizar',\n", | |
" 'actualizarse',\n", | |
" 'actualizo',\n", | |
" 'actualmente',\n", | |
" 'actuamos',\n", | |
" 'actuar',\n", | |
" 'acuerda',\n", | |
" 'acuerdo',\n", | |
" 'ad',\n", | |
" 'adams',\n", | |
" 'adaptado',\n", | |
" 'adaptan',\n", | |
" 'adaptar',\n", | |
" 'adaptarlos',\n", | |
" 'adaptarse',\n", | |
" 'adapten',\n", | |
" 'addeventlistener',\n", | |
" 'adding',\n", | |
" 'addon',\n", | |
" 'adecuada',\n", | |
" 'adecuadas',\n", | |
" 'adecuado',\n", | |
" 'adecuados',\n", | |
" 'adelantado',\n", | |
" 'adelante',\n", | |
" 'adelantee',\n", | |
" 'adelanto',\n", | |
" 'ademas',\n", | |
" 'adjuntado',\n", | |
" 'adjust',\n", | |
" 'admin',\n", | |
" 'administrador',\n", | |
" 'administrativas',\n", | |
" 'admitan',\n", | |
" 'adora',\n", | |
" 'adquirido',\n", | |
" 'adquiriendo',\n", | |
" 'adquirir',\n", | |
" 'adquisitivo',\n", | |
" 'adri',\n", | |
" 'ads',\n", | |
" 'adsense',\n", | |
" 'advance',\n", | |
" 'adversario',\n", | |
" 'adversidad',\n", | |
" 'adverstising',\n", | |
" 'advertisign',\n", | |
" 'advertising',\n", | |
" 'advocate',\n", | |
" 'adwords',\n", | |
" 'adwordsgoogle',\n", | |
" 'aede',\n", | |
" 'afecta',\n", | |
" 'afectadas',\n", | |
" 'afectado',\n", | |
" 'afectan',\n", | |
" 'afectando',\n", | |
" 'afectar',\n", | |
" 'afectarnos',\n", | |
" 'afiliado',\n", | |
" 'afiliados',\n", | |
" 'afines',\n", | |
" 'afirma',\n", | |
" 'afirmaciones',\n", | |
" 'afirman',\n", | |
" 'afirme',\n", | |
" 'afrontar',\n", | |
" 'after',\n", | |
" 'again',\n", | |
" 'against',\n", | |
" 'agencia',\n", | |
" 'agencias',\n", | |
" 'agents',\n", | |
" 'agilidad',\n", | |
" 'agilizar',\n", | |
" 'agosto',\n", | |
" 'agotados',\n", | |
" 'agradece',\n", | |
" 'agradecer',\n", | |
" 'agradecido',\n", | |
" 'agrega',\n", | |
" 'agregado',\n", | |
" 'agregados',\n", | |
" 'agregas',\n", | |
" 'agrupados',\n", | |
" 'agrupar',\n", | |
" 'agruparlos',\n", | |
" 'agua',\n", | |
" 'agujeros',\n", | |
" 'ah',\n", | |
" 'ahora',\n", | |
" 'ahorrar',\n", | |
" 'ahrefs',\n", | |
" 'ai',\n", | |
" 'aida',\n", | |
" 'aidas',\n", | |
" 'airways',\n", | |
" 'aisladas',\n", | |
" 'aislar',\n", | |
" 'alberto',\n", | |
" 'alcachofas',\n", | |
" 'alcance',\n", | |
" 'alcanzables',\n", | |
" 'alcanzablesderivado',\n", | |
" 'alcanzar',\n", | |
" 'alchemyapi',\n", | |
" 'alemania',\n", | |
" 'alerta',\n", | |
" 'alerting',\n", | |
" 'alexa',\n", | |
" 'aleyda',\n", | |
" 'algoritmia',\n", | |
" 'algoritmo',\n", | |
" 'algoritmos',\n", | |
" 'alguien',\n", | |
" 'alguna',\n", | |
" 'alguno',\n", | |
" 'aliada',\n", | |
" 'alignment',\n", | |
" 'alinea',\n", | |
" 'alineadas',\n", | |
" 'alinean',\n", | |
" 'alinearlo',\n", | |
" 'alineen',\n", | |
" 'all',\n", | |
" 'allows',\n", | |
" 'allowscriptaccess',\n", | |
" 'almacenado',\n", | |
" 'almacene',\n", | |
" 'almost',\n", | |
" 'alone',\n", | |
" 'along',\n", | |
" 'alpha',\n", | |
" 'alphami',\n", | |
" 'alphaprobar',\n", | |
" 'alquiler',\n", | |
" 'already',\n", | |
" 'alrededor',\n", | |
" 'also',\n", | |
" 'alta',\n", | |
" 'altas',\n", | |
" 'alterando',\n", | |
" 'alterativas',\n", | |
" 'alternate',\n", | |
" 'alternativa',\n", | |
" 'alternativas',\n", | |
" 'alto',\n", | |
" 'altruismo',\n", | |
" 'altura',\n", | |
" 'alturas',\n", | |
" 'alucinar',\n", | |
" 'alumnos',\n", | |
" 'always',\n", | |
" 'amateur',\n", | |
" 'amateurs',\n", | |
" 'amazoncart',\n", | |
" 'amb',\n", | |
" 'ambas',\n", | |
" 'ambientadas',\n", | |
" 'ambiente',\n", | |
" 'ambiguaciones',\n", | |
" 'ambos',\n", | |
" 'ameno',\n", | |
" 'americano',\n", | |
" 'amigable',\n", | |
" 'amigo',\n", | |
" 'amigos',\n", | |
" 'amiguetes',\n", | |
" 'amoldarnos',\n", | |
" 'amos',\n", | |
" 'amp',\n", | |
" 'amphtml',\n", | |
" 'amplia',\n", | |
" 'ampliando',\n", | |
" 'amplio',\n", | |
" 'an',\n", | |
" 'analicemos',\n", | |
" 'analista',\n", | |
" 'analistas',\n", | |
" 'analistaseo',\n", | |
" 'analitica',\n", | |
" 'analiza',\n", | |
" 'analizada',\n", | |
" 'analizado',\n", | |
" 'analizados',\n", | |
" 'analizamos',\n", | |
" 'analizando',\n", | |
" 'analizar',\n", | |
" 'analizarlos',\n", | |
" 'analizaron',\n", | |
" 'analizo',\n", | |
" 'analyst',\n", | |
" 'analyticator',\n", | |
" 'analytics',\n", | |
" 'analyticsdescubre',\n", | |
" 'analyticseste',\n", | |
" 'analyticsgracias',\n", | |
" 'analyticsla',\n", | |
" 'analyticslas',\n", | |
" 'analyticslo',\n", | |
" 'analyticspasadas',\n", | |
" 'analyticspero',\n", | |
" 'analyticspor',\n", | |
" 'analyticstracking',\n", | |
" 'ancho',\n", | |
" 'anchor',\n", | |
" 'anchoring',\n", | |
" 'anchors',\n", | |
" 'anchos',\n", | |
" 'ancla',\n", | |
" 'and',\n", | |
" 'andorra',\n", | |
" 'andrew',\n", | |
" 'android',\n", | |
" 'angular',\n", | |
" 'anillo',\n", | |
" 'anima',\n", | |
" 'animado',\n", | |
" 'animales',\n", | |
" 'animar',\n", | |
" 'animo',\n", | |
" 'anne',\n", | |
" 'ano',\n", | |
" 'anotaciones',\n", | |
" 'another',\n", | |
" 'ansias',\n", | |
" 'ansidad',\n", | |
" 'ansiedad',\n", | |
" 'answering',\n", | |
" 'antecedentes',\n", | |
" 'antecedentesos',\n", | |
" 'anterior',\n", | |
" 'anteriores',\n", | |
" 'anterioridad',\n", | |
" 'anteriormente',\n", | |
" 'anti',\n", | |
" 'antiguo',\n", | |
" 'antiguos',\n", | |
" 'antivirus',\n", | |
" 'antojo',\n", | |
" 'anual',\n", | |
" 'anunciado',\n", | |
" 'anunciante',\n", | |
" 'anunciar',\n", | |
" 'anuncio',\n", | |
" 'anuncios',\n", | |
" 'anyone',\n", | |
" 'aol',\n", | |
" 'aov',\n", | |
" 'apagar',\n", | |
" 'aparatejo',\n", | |
" 'aparece',\n", | |
" 'aparecen',\n", | |
" 'aparecer',\n", | |
" 'apareces',\n", | |
" 'aparecido',\n", | |
" 'apareciendo',\n", | |
" 'aparecieran',\n", | |
" 'aparezca',\n", | |
" 'aparezcan',\n", | |
" 'apartado',\n", | |
" 'aparte',\n", | |
" 'apelar',\n", | |
" 'apenas',\n", | |
" 'apertura',\n", | |
" 'api',\n", | |
" 'apis',\n", | |
" 'aplica',\n", | |
" 'aplicables',\n", | |
" 'aplicaciones',\n", | |
" 'aplicada',\n", | |
" 'aplicadas',\n", | |
" 'aplicado',\n", | |
" 'aplicados',\n", | |
" 'aplicamos',\n", | |
" 'aplicando',\n", | |
" 'aplicar',\n", | |
" 'aplicarlo',\n", | |
" 'aplicarlos',\n", | |
" 'aplicas',\n", | |
" 'aplication',\n", | |
" 'aplico',\n", | |
" 'aplique',\n", | |
" 'aporta',\n", | |
" 'aportado',\n", | |
" 'aportar',\n", | |
" 'aporten',\n", | |
" 'apostado',\n", | |
" 'apostar',\n", | |
" 'apoyar',\n", | |
" 'apoyas',\n", | |
" 'apoyo',\n", | |
" 'app',\n", | |
" 'appanalytics',\n", | |
" 'apple',\n", | |
" 'approved',\n", | |
" 'apps',\n", | |
" 'appuna',\n", | |
" 'aprendan',\n", | |
" 'aprende',\n", | |
" 'aprender',\n", | |
" 'aprendido',\n", | |
" 'aprendiendo',\n", | |
" 'aprendizaje',\n", | |
" 'aprendizajes',\n", | |
" 'aprendo',\n", | |
" 'aprobados',\n", | |
" 'aprovechado',\n", | |
" 'aprovechando',\n", | |
" 'aprovechar',\n", | |
" 'aprovecharlo',\n", | |
" 'aprovecho',\n", | |
" 'aprox',\n", | |
" 'aproxima',\n", | |
" 'aproximada',\n", | |
" 'aproximadamente',\n", | |
" 'aproximado',\n", | |
" 'apunte',\n", | |
" 'aquel',\n", | |
" 'aquella',\n", | |
" 'aquellas',\n", | |
" 'aquello',\n", | |
" 'aquellos',\n", | |
" 'arcas',\n", | |
" 'archidemostrada',\n", | |
" 'archivada',\n", | |
" 'archive',\n", | |
" 'archivo',\n", | |
" 'archivos',\n", | |
" 'are',\n", | |
" 'argumentos',\n", | |
" 'armar',\n", | |
" 'armoniosa',\n", | |
" 'arquitectura',\n", | |
" 'arquitecturaa',\n", | |
" 'arreglado',\n", | |
" 'arreglan',\n", | |
" 'arreglando',\n", | |
" 'arreglar',\n", | |
" 'arriba',\n", | |
" 'arruinar',\n", | |
" 'arte',\n", | |
" 'artesanales',\n", | |
" 'articles',\n", | |
" 'artificial',\n", | |
" 'artificialmente',\n", | |
" 'arturo',\n", | |
" 'arturomarimon',\n", | |
" 'as',\n", | |
" 'asegura',\n", | |
" 'asegurar',\n", | |
" 'asegurarnos',\n", | |
" 'asegurarse',\n", | |
" 'aseguras',\n", | |
" 'asesorados',\n", | |
" 'asignado',\n", | |
" 'asignando',\n", | |
" 'asignar',\n", | |
" 'asignarles',\n", | |
" 'asimilado',\n", | |
" 'asimilar',\n", | |
" 'asistencial',\n", | |
" 'asistentes',\n", | |
" 'asistidas',\n", | |
" 'asistiendo',\n", | |
" 'ask',\n", | |
" 'asociada',\n", | |
" 'asociados',\n", | |
" 'asociar',\n", | |
" 'asociarse',\n", | |
" 'asociativo',\n", | |
" 'asombrados',\n", | |
" 'asp',\n", | |
" 'aspecto',\n", | |
" 'aspectos',\n", | |
" 'assesor',\n", | |
" 'assets',\n", | |
" 'assisted',\n", | |
" 'asumidas',\n", | |
" 'asumir',\n", | |
" 'asunciones',\n", | |
" 'asunto',\n", | |
" 'asustarnos',\n", | |
" 'async',\n", | |
" 'atacante',\n", | |
" 'atacaremos',\n", | |
" 'atado',\n", | |
" 'ataque',\n", | |
" 'atenta',\n", | |
" 'aterrice',\n", | |
" 'aterriza',\n", | |
" 'aterrizado',\n", | |
" 'aterrizaje',\n", | |
" 'aterrizan',\n", | |
" 'aterrizaron',\n", | |
" 'aterrize',\n", | |
" 'atractivas',\n", | |
" 'atrae',\n", | |
" 'atraen',\n", | |
" 'atraer',\n", | |
" 'atraigan',\n", | |
" 'atribuir',\n", | |
" 'atributos',\n", | |
" 'atts',\n", | |
" 'aturar',\n", | |
" 'auc',\n", | |
" 'audiencia',\n", | |
" 'auditadas',\n", | |
" 'auditado',\n", | |
" 'auditar',\n", | |
" 'auditivo',\n", | |
" 'augure',\n", | |
" 'august',\n", | |
" 'aumenta',\n", | |
" 'aumentado',\n", | |
" 'aumentan',\n", | |
" 'aumentando',\n", | |
" 'aumentar',\n", | |
" 'aumentaremos',\n", | |
" 'aumentaron',\n", | |
" 'aumentas',\n", | |
" 'aumente',\n", | |
" 'aumento',\n", | |
" 'aun',\n", | |
" 'aunque',\n", | |
" 'author',\n", | |
" 'authority',\n", | |
" 'authorrank',\n", | |
" 'auto',\n", | |
" 'autor',\n", | |
" 'autores',\n", | |
" 'autoridad',\n", | |
" 'autoridadpara',\n", | |
" 'autoritario',\n", | |
" 'av',\n", | |
" 'avanza',\n", | |
" 'avanzadas',\n", | |
" 'avanzado',\n", | |
" 'avanzados',\n", | |
" 'avanzar',\n", | |
" 'aveces',\n", | |
" 'avecinaba',\n", | |
" 'aventurero',\n", | |
" 'average',\n", | |
" 'avg',\n", | |
" 'avinash',\n", | |
" 'avisa',\n", | |
" 'avisando',\n", | |
" 'avisarte',\n", | |
" 'avises',\n", | |
" 'aviso',\n", | |
" 'awareness',\n", | |
" 'awarness',\n", | |
" 'ayer',\n", | |
" 'ayuda',\n", | |
" 'ayudado',\n", | |
" 'ayudamos',\n", | |
" 'ayudan',\n", | |
" 'ayudar',\n", | |
" 'ayudaron',\n", | |
" 'ayuden',\n", | |
" 'azar',\n", | |
" 'azul',\n", | |
" 'azules',\n", | |
" 'b',\n", | |
" 'back',\n", | |
" 'backend',\n", | |
" 'backlinks',\n", | |
" 'backlinkwatches',\n", | |
" 'baeza',\n", | |
" 'baezaeste',\n", | |
" 'baidu',\n", | |
" 'baja',\n", | |
" 'bajada',\n", | |
" 'bajadas',\n", | |
" 'bajamos',\n", | |
" 'bajar',\n", | |
" 'baje',\n", | |
" 'bajo',\n", | |
" 'balas',\n", | |
" 'banca',\n", | |
" 'bandera',\n", | |
" 'banderas',\n", | |
" 'bandwidth',\n", | |
" 'baneadas',\n", | |
" 'banner',\n", | |
" 'bar',\n", | |
" 'barcelona',\n", | |
" 'barra',\n", | |
" 'barreras',\n", | |
" 'barrio',\n", | |
" 'barry',\n", | |
" 'basaba',\n", | |
" 'basadas',\n", | |
" 'basado',\n", | |
" 'basados',\n", | |
" 'basamos',\n", | |
" 'basan',\n", | |
" 'basarse',\n", | |
" 'base',\n", | |
" 'bases',\n", | |
" 'basis',\n", | |
" 'basta',\n", | |
" 'bastante',\n", | |
" 'basura',\n", | |
" 'batch',\n", | |
" 'baymard',\n", | |
" 'bbdd',\n", | |
" 'be',\n", | |
" 'beefeater',\n", | |
" 'before',\n", | |
" 'behavior',\n", | |
" 'behavioral',\n", | |
" 'benchmark',\n", | |
" 'benedict',\n", | |
" 'beneficiado',\n", | |
" 'beneficie',\n", | |
" 'beneficio',\n", | |
" 'beneficios',\n", | |
" 'beneficiosa',\n", | |
" 'beneficiosas',\n", | |
" 'best',\n", | |
" 'bestia',\n", | |
" 'beta',\n", | |
" 'better',\n", | |
" 'bettman',\n", | |
" 'bichotoblog',\n", | |
" 'bien',\n", | |
" 'big',\n", | |
" 'bigbadlondon',\n", | |
" 'bill',\n", | |
" 'billete',\n", | |
" 'billion',\n", | |
" 'billones',\n", | |
" 'binario',\n", | |
" 'bing',\n", | |
" 'binocular',\n", | |
" 'biology',\n", | |
" 'bits',\n", | |
" 'bj',\n", | |
" 'black',\n", | |
" 'blackhat',\n", | |
" 'blanca',\n", | |
" 'blanco',\n", | |
" 'blog',\n", | |
" 'blogactualmente',\n", | |
" 'blogs',\n", | |
" 'bloguismo',\n", | |
" 'bloqueado',\n", | |
" 'bloqueados',\n", | |
" 'bloquean',\n", | |
" 'bloqueando',\n", | |
" 'bloquedos',\n", | |
" 'bobbink',\n", | |
" 'body',\n", | |
" 'bola',\n", | |
" 'bonito',\n", | |
" 'booksquery',\n", | |
" 'boom',\n", | |
" 'boost',\n", | |
" 'booster',\n", | |
" 'borra',\n", | |
" 'borrado',\n", | |
" 'borran',\n", | |
" 'borrar',\n", | |
" 'borre',\n", | |
" 'bot',\n", | |
" 'botella',\n", | |
" 'botones',\n", | |
" 'bots',\n", | |
" 'bottom',\n", | |
" 'bounce',\n", | |
" 'brainsins',\n", | |
" 'brand',\n", | |
" 'branding',\n", | |
" 'branzai',\n", | |
" 'breadcrumb',\n", | |
" 'break',\n", | |
" 'breve',\n", | |
" 'brexit',\n", | |
" 'brillantes',\n", | |
" 'brin',\n", | |
" 'brindan',\n", | |
" 'british',\n", | |
" 'brock',\n", | |
" 'broken',\n", | |
" 'broma',\n", | |
" 'browser',\n", | |
" 'brqhcbfwsg',\n", | |
" 'brutal',\n", | |
" 'brutos',\n", | |
" 'budget',\n", | |
" 'buen',\n", | |
" 'buena',\n", | |
" 'buenas',\n", | |
" 'bueno',\n", | |
" 'buenodos',\n", | |
" 'buenos',\n", | |
" 'buffer',\n", | |
" 'bug',\n", | |
" 'builder',\n", | |
" 'building',\n", | |
" 'builing',\n", | |
" 'bulo',\n", | |
" 'bumm',\n", | |
" 'busca',\n", | |
" 'buscaba',\n", | |
" 'buscaban',\n", | |
" 'buscado',\n", | |
" 'buscador',\n", | |
" 'buscadoren',\n", | |
" 'buscadores',\n", | |
" 'buscadoreslos',\n", | |
" 'buscadorespero',\n", | |
" 'buscadorporque',\n", | |
" 'buscamos',\n", | |
" 'buscan',\n", | |
" 'buscando',\n", | |
" 'buscar',\n", | |
" 'buscarlo',\n", | |
" 'buscaron',\n", | |
" 'buscas',\n", | |
" 'busquen',\n", | |
" 'busques',\n", | |
" 'by',\n", | |
" 'byte',\n", | |
" 'c',\n", | |
" 'cabeceras',\n", | |
" 'cabecerasse',\n", | |
" 'cabeza',\n", | |
" 'cabezas',\n", | |
" 'cabo',\n", | |
" 'cache',\n", | |
" 'cacheada',\n", | |
" 'cacheo',\n", | |
" 'cacioppo',\n", | |
" 'cada',\n", | |
" 'cadenas',\n", | |
" 'caduco',\n", | |
" 'cae',\n", | |
" 'caer',\n", | |
" 'caffeine',\n", | |
" 'caja',\n", | |
" 'calcula',\n", | |
" 'calculadora',\n", | |
" 'calculamos',\n", | |
" 'calcular',\n", | |
" 'calcularlo',\n", | |
" 'caldeado',\n", | |
" 'calendario',\n", | |
" 'calidad',\n", | |
" 'calidadal',\n", | |
" 'calientes',\n", | |
" 'california',\n", | |
" 'call',\n", | |
" 'calma',\n", | |
" 'calor',\n", | |
" 'cambia',\n", | |
" 'cambiado',\n", | |
" 'cambian',\n", | |
" 'cambiando',\n", | |
" 'cambiar',\n", | |
" 'cambiarle',\n", | |
" 'cambiarlo',\n", | |
" 'cambie',\n", | |
" 'cambies',\n", | |
" 'cambio',\n", | |
" 'cambiopagina',\n", | |
" 'cambios',\n", | |
" 'camello',\n", | |
" 'camino',\n", | |
" 'caminos',\n", | |
" 'camisetas',\n", | |
" 'campo',\n", | |
" 'campos',\n", | |
" 'camposresponsable',\n", | |
" 'can',\n", | |
" 'canal',\n", | |
" 'canales',\n", | |
" 'canary',\n", | |
" 'canibalices',\n", | |
" 'canonical',\n", | |
" 'canonicalitis',\n", | |
" 'cansa',\n", | |
" 'cansado',\n", | |
" 'cantando',\n", | |
" 'cantidad',\n", | |
" 'capaces',\n", | |
" 'capacidad',\n", | |
" 'capadas',\n", | |
" 'capando',\n", | |
" 'capas',\n", | |
" 'capaz',\n", | |
" 'captados',\n", | |
" 'captar',\n", | |
" 'captcha',\n", | |
" 'captchas',\n", | |
" 'captchaslos',\n", | |
" 'captology',\n", | |
" 'capturar',\n", | |
" 'capturas',\n", | |
" 'cara',\n", | |
" 'caracteres',\n", | |
" 'caramelo',\n", | |
" 'caras',\n", | |
" 'card',\n", | |
" 'cardinal',\n", | |
" 'cards',\n", | |
" 'carencia',\n", | |
" 'carga',\n", | |
" 'cargando',\n", | |
" 'cargar',\n", | |
" 'cargarlos',\n", | |
" 'cargarse',\n", | |
" 'cargarte',\n", | |
" 'carguemos',\n", | |
" 'carguen',\n", | |
" 'carlos',\n", | |
" 'carlosredondo',\n", | |
" 'caro',\n", | |
" 'carrito',\n", | |
" 'carritos',\n", | |
" 'carrusel',\n", | |
" 'casa',\n", | |
" 'casas',\n", | |
" 'case',\n", | |
" 'casi',\n", | |
" 'casilla',\n", | |
" 'caso',\n", | |
" 'casos',\n", | |
" 'casta',\n", | |
" 'castellano',\n", | |
" 'casualidad',\n", | |
" 'casualmente',\n", | |
" 'cat',\n", | |
" 'catal',\n", | |
" 'catalan',\n", | |
" 'catalunya',\n", | |
" 'categorizan',\n", | |
" 'categorizar',\n", | |
" 'categorizarlo',\n", | |
" 'category',\n", | |
" 'categorydelegating',\n", | |
" 'causa',\n", | |
" 'causado',\n", | |
" 'causalidad',\n", | |
" 'causan',\n", | |
" 'cause',\n", | |
" 'caused',\n", | |
" 'cc',\n", | |
" 'cd',\n", | |
" 'cdn',\n", | |
" 'cedido',\n", | |
" 'cegados',\n", | |
" 'cegaron',\n", | |
" 'cenando',\n", | |
" 'centenares',\n", | |
" 'centrada',\n", | |
" 'centrado',\n", | |
" 'centrados',\n", | |
" 'central',\n", | |
" 'centralizada',\n", | |
" 'centramos',\n", | |
" 'centran',\n", | |
" 'centrando',\n", | |
" 'centrar',\n", | |
" 'centrarnos',\n", | |
" 'centrarse',\n", | |
" 'centras',\n", | |
" 'centre',\n", | |
" 'centremos',\n", | |
" 'centric',\n", | |
" 'centro',\n", | |
" 'ceo',\n", | |
" 'cercana',\n", | |
" 'cercanas',\n", | |
" 'cercano',\n", | |
" 'cerebro',\n", | |
" 'cerebros',\n", | |
" 'cereto',\n", | |
" 'cero',\n", | |
" 'cerrado',\n", | |
" 'cerrados',\n", | |
" 'cerrar',\n", | |
" 'certeza',\n", | |
" 'certificadas',\n", | |
" 'certificados',\n", | |
" 'chacras',\n", | |
" 'chaiken',\n", | |
" 'change',\n", | |
" 'channel',\n", | |
" 'chapter',\n", | |
" 'charla',\n", | |
" 'charlas',\n", | |
" 'chat',\n", | |
" 'chats',\n", | |
" 'checkbox',\n", | |
" 'checkeo',\n", | |
" 'checker',\n", | |
" 'checklist',\n", | |
" 'checklisty',\n", | |
" 'checkout',\n", | |
" 'checkoutestas',\n", | |
" 'chica',\n", | |
" 'chico',\n", | |
" 'chicos',\n", | |
" 'china',\n", | |
" 'choose',\n", | |
" 'chorrada',\n", | |
" 'christian',\n", | |
" 'chrome',\n", | |
" 'chulas',\n", | |
" 'chuleta',\n", | |
" 'chulos',\n", | |
" 'chute',\n", | |
" 'cialdini',\n", | |
" 'ciclo',\n", | |
" 'cid',\n", | |
" 'ciegas',\n", | |
" 'ciegos',\n", | |
" 'cientos',\n", | |
" 'cierre',\n", | |
" 'cierta',\n", | |
" 'ciertas',\n", | |
" 'cierto',\n", | |
" 'ciertos',\n", | |
" 'cifrado',\n", | |
" 'cifrados',\n", | |
" 'cinco',\n", | |
" 'cingulado',\n", | |
" 'circulito',\n", | |
" 'circunstancias',\n", | |
" 'cirujano',\n", | |
" 'cita',\n", | |
" 'citadas',\n", | |
" 'citado',\n", | |
" 'citarlo',\n", | |
" 'cito',\n", | |
" 'ciudad',\n", | |
" 'ciudadanos',\n", | |
" 'claim',\n", | |
" 'clara',\n", | |
" 'claraincluye',\n", | |
" 'claramente',\n", | |
" 'claras',\n", | |
" 'claro',\n", | |
" 'claros',\n", | |
" 'clases',\n", | |
" 'clasificado',\n", | |
" 'clasificados',\n", | |
" 'clasificar',\n", | |
" 'classic',\n", | |
" 'classification',\n", | |
" 'classificationmining',\n", | |
" 'clave',\n", | |
" 'claveestos',\n", | |
" 'claves',\n", | |
" 'cleaner',\n", | |
" 'clic',\n", | |
" 'clica',\n", | |
" 'clicamos',\n", | |
" 'clicar',\n", | |
" 'clicarse',\n", | |
" 'click',\n", | |
" 'clicks',\n", | |
" 'clickworker',\n", | |
" 'clics',\n", | |
" 'client',\n", | |
" 'cliente',\n", | |
" 'clientes',\n", | |
" 'clienteun',\n", | |
" 'clinic',\n", | |
" 'clinicseo',\n", | |
" 'cliquen',\n", | |
" 'cloaking',\n", | |
" 'clusters',\n", | |
" 'cms',\n", | |
" 'cnn',\n", | |
" 'co',\n", | |
" 'coach',\n", | |
" 'cobren',\n", | |
" 'cocacola',\n", | |
" 'code',\n", | |
" 'codificado',\n", | |
" 'coge',\n", | |
" 'cogen',\n", | |
" 'coger',\n", | |
" 'cogerlos',\n", | |
" 'cogido',\n", | |
" 'cogiendo',\n", | |
" 'cognitiva',\n", | |
" 'cognitivala',\n", | |
" 'cognitivamente',\n", | |
" 'cognitivas',\n", | |
" 'cognitivos',\n", | |
" 'coherencia',\n", | |
" 'coherente',\n", | |
" 'cohort',\n", | |
" 'coincida',\n", | |
" 'coincidence',\n", | |
" 'coincidencias',\n", | |
" 'coincidiendo',\n", | |
" 'coincidir',\n", | |
" 'cojones',\n", | |
" 'cola',\n", | |
" 'colabora',\n", | |
" 'colaborativa',\n", | |
" 'colarse',\n", | |
" ...]" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"tfidf = TfidfVectorizer(stop_words = stopwords.words('spanish'), \n", | |
" max_df = 0.6, min_df = 1, token_pattern=r'(?u)\\b[A-Za-z]+\\b')\n", | |
"csr_mat = tfidf.fit_transform(articles)\n", | |
"words = tfidf.get_feature_names()\n", | |
"words" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Creating a <a href=\"https://mlexplained.com/2017/12/28/a-practical-introduction-to-nmf-nonnegative-matrix-factorization/\">NMF model (nonnegative matrix factorization)</a>. Here we are <a href=\"https://blog.exploratory.io/demystifying-text-analytics-part-4-dimensionality-reduction-and-clustering-in-r-cbc8c90e0b14\">reducing the dimension</a> of the sparse matrix created before." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"model = NMF(n_components=10) #test the number of components for your site(similar to the number of 'topics')\n", | |
"model.fit(csr_mat)\n", | |
"nmf_features = model.transform(csr_mat)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Normalizing the NMF features" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" <th>3</th>\n", | |
" <th>4</th>\n", | |
" <th>5</th>\n", | |
" <th>6</th>\n", | |
" <th>7</th>\n", | |
" <th>8</th>\n", | |
" <th>9</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <td>SEO para Progressive Web APPs (PWA) y JavaScript</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.040243</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.052272</td>\n", | |
" <td>0.379103</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.072834</td>\n", | |
" <td>0.138624</td>\n", | |
" <td>0.045398</td>\n", | |
" <td>0.908486</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Cómo mejorar el SEO incrementando la frecuencia de rastreo</td>\n", | |
" <td>0.056258</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.349612</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.018862</td>\n", | |
" <td>0.030825</td>\n", | |
" <td>0.005401</td>\n", | |
" <td>0.934490</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>CTR y SEO: Manipulación del CTR para influir en los rankings</td>\n", | |
" <td>0.946162</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.066596</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.316768</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Las mejores herramientas de visualización de datos gratuitas</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.973158</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.230139</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Cómo hacer un Heatmap de las visitas de tu web por día y hora</td>\n", | |
" <td>0.352149</td>\n", | |
" <td>0.654728</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.136502</td>\n", | |
" <td>0.079182</td>\n", | |
" <td>0.119675</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.627579</td>\n", | |
" <td>0.119344</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2 \\\n", | |
"SEO para Progressive Web APPs (PWA) y JavaScript 0.000000 0.040243 0.0 \n", | |
"Cómo mejorar el SEO incrementando la frecuencia... 0.056258 0.000000 0.0 \n", | |
"CTR y SEO: Manipulación del CTR para influir en... 0.946162 0.000000 0.0 \n", | |
"Las mejores herramientas de visualización de da... 0.000000 0.973158 0.0 \n", | |
"Cómo hacer un Heatmap de las visitas de tu web ... 0.352149 0.654728 0.0 \n", | |
"\n", | |
" 3 4 \\\n", | |
"SEO para Progressive Web APPs (PWA) y JavaScript 0.052272 0.379103 \n", | |
"Cómo mejorar el SEO incrementando la frecuencia... 0.000000 0.349612 \n", | |
"CTR y SEO: Manipulación del CTR para influir en... 0.000000 0.066596 \n", | |
"Las mejores herramientas de visualización de da... 0.000000 0.000000 \n", | |
"Cómo hacer un Heatmap de las visitas de tu web ... 0.136502 0.079182 \n", | |
"\n", | |
" 5 6 \\\n", | |
"SEO para Progressive Web APPs (PWA) y JavaScript 0.000000 0.072834 \n", | |
"Cómo mejorar el SEO incrementando la frecuencia... 0.000000 0.018862 \n", | |
"CTR y SEO: Manipulación del CTR para influir en... 0.000000 0.000000 \n", | |
"Las mejores herramientas de visualización de da... 0.000000 0.000000 \n", | |
"Cómo hacer un Heatmap de las visitas de tu web ... 0.119675 0.000000 \n", | |
"\n", | |
" 7 8 \\\n", | |
"SEO para Progressive Web APPs (PWA) y JavaScript 0.138624 0.045398 \n", | |
"Cómo mejorar el SEO incrementando la frecuencia... 0.030825 0.005401 \n", | |
"CTR y SEO: Manipulación del CTR para influir en... 0.316768 0.000000 \n", | |
"Las mejores herramientas de visualización de da... 0.230139 0.000000 \n", | |
"Cómo hacer un Heatmap de las visitas de tu web ... 0.000000 0.627579 \n", | |
"\n", | |
" 9 \n", | |
"SEO para Progressive Web APPs (PWA) y JavaScript 0.908486 \n", | |
"Cómo mejorar el SEO incrementando la frecuencia... 0.934490 \n", | |
"CTR y SEO: Manipulación del CTR para influir en... 0.000000 \n", | |
"Las mejores herramientas de visualización de da... 0.000000 \n", | |
"Cómo hacer un Heatmap de las visitas de tu web ... 0.119344 " | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"norm_features = normalize(nmf_features)\n", | |
"\n", | |
"df = pd.DataFrame(norm_features,index=titles)\n", | |
"display(df.head())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"article = df.loc['Qué es un buscador semántico']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Displaying the 10 articles with highest cosine similarity" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Qué es un buscador semántico 1.000000\n", | |
"Qué son las entidades y su implicación en el SEO 0.999439\n", | |
"SEO Semántico para la Web Semántica 0.998715\n", | |
"Keyword Research con Google Refine [Vídeo Tutorial] 0.734699\n", | |
"SEO, rankings y conversión 0.430864\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"similarities = df.dot(article)\n", | |
"similarities.nlargest()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment