Skip to content

Instantly share code, notes, and snippets.

@natzir
Created October 7, 2019 13:51
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save natzir/e04b0ee0005aabbff696bf6fb2e9b4c1 to your computer and use it in GitHub Desktop.
Save natzir/e04b0ee0005aabbff696bf6fb2e9b4c1 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Semantic crosslinking article recommender</h3><p>@author: Natzir Turrado: Technical SEO / Data Scientist. <a href=\"https://twitter.com/natzir9\">Twitter > @natzir9</a></p>\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import nltk\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.decomposition import NMF\n",
"from sklearn.preprocessing import normalize\n",
"from nltk.corpus import stopwords\n",
"from IPython.display import display"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Crawl your site an get the title and the content of every post. Create an excel file named \"data.xlsx\" and save it in the same work space as this file.\n",
"<ul>\n",
"<li><a href=\"https://builtvisible.com/seo-guide-to-xpath/\">XPATH for SEO's</a></li>\n",
"<li><a href=\"https://www.screamingfrog.co.uk/web-scraping/\">Web Scraping & Data Extraction</a></li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>article</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>SEO para Progressive Web APPs (PWA) y JavaScript</td>\n",
" <td>Este artículo es un resumen que hemos hecho Ch...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>Cómo mejorar el SEO incrementando la frecuenci...</td>\n",
" <td>Siempre digo que es mejor remar a favor que ir...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>CTR y SEO: Manipulación del CTR para influir e...</td>\n",
" <td>Últimamente se han puesto pesaditos con el tem...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>Las mejores herramientas de visualización de d...</td>\n",
" <td>Artículo publicado en Doctor Metrics sobre her...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>Cómo hacer un Heatmap de las visitas de tu web...</td>\n",
" <td>Visualizar la información en forma de mapa de ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title \\\n",
"0 SEO para Progressive Web APPs (PWA) y JavaScript \n",
"1 Cómo mejorar el SEO incrementando la frecuenci... \n",
"2 CTR y SEO: Manipulación del CTR para influir e... \n",
"3 Las mejores herramientas de visualización de d... \n",
"4 Cómo hacer un Heatmap de las visitas de tu web... \n",
"\n",
" article \n",
"0 Este artículo es un resumen que hemos hecho Ch... \n",
"1 Siempre digo que es mejor remar a favor que ir... \n",
"2 Últimamente se han puesto pesaditos con el tem... \n",
"3 Artículo publicado en Doctor Metrics sobre her... \n",
"4 Visualizar la información en forma de mapa de ... "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df = pd.read_excel('data.xlsx')\n",
"display(df.head())"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"articles = df.article.tolist()\n",
"titles = df.title.tolist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transforming a list of documents into a word frequency array (TF-IDF). I'm only accepting words (token_pattern) and ignoring terms that appear in more than 60% of the documents (max_df) and that appear in less than 1 document (min_df). You should test than on a bigger website than my blog. <a href=\"http://dfrancis.co/2017/10/06/tf-idf-vectorizer-fit-and-transform/\">More info on min_df & max_df here</a>."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['aarrr',\n",
" 'abajo',\n",
" 'abandonamos',\n",
" 'abandonar',\n",
" 'abandone',\n",
" 'abandono',\n",
" 'abandonos',\n",
" 'abarcan',\n",
" 'abc',\n",
" 'abierta',\n",
" 'abiertas',\n",
" 'ability',\n",
" 'abismales',\n",
" 'about',\n",
" 'above',\n",
" 'abrazando',\n",
" 'abril',\n",
" 'abrimos',\n",
" 'abrir',\n",
" 'abrirlo',\n",
" 'absoluta',\n",
" 'absolutamente',\n",
" 'absolutas',\n",
" 'absoluto',\n",
" 'abuelo',\n",
" 'abundancia',\n",
" 'aburriros',\n",
" 'acaba',\n",
" 'acababan',\n",
" 'acabado',\n",
" 'acaban',\n",
" 'acabar',\n",
" 'acabaremos',\n",
" 'acabe',\n",
" 'acaben',\n",
" 'acabo',\n",
" 'acceda',\n",
" 'accede',\n",
" 'acceder',\n",
" 'accedido',\n",
" 'accesibilidad',\n",
" 'accesible',\n",
" 'accesibles',\n",
" 'accesiblesanaliza',\n",
" 'accesiblesporcentaje',\n",
" 'acceso',\n",
" 'accesos',\n",
" 'accionable',\n",
" 'accionables',\n",
" 'acciones',\n",
" 'acento',\n",
" 'acepta',\n",
" 'aceptada',\n",
" 'aceptadas',\n",
" 'aceptados',\n",
" 'aceptar',\n",
" 'aceptarlas',\n",
" 'acerca',\n",
" 'acercarse',\n",
" 'acertar',\n",
" 'acierto',\n",
" 'aconsejar',\n",
" 'aconsejo',\n",
" 'acordamos',\n",
" 'acordaros',\n",
" 'acortar',\n",
" 'acostumbrados',\n",
" 'acotar',\n",
" 'acquisition',\n",
" 'across',\n",
" 'action',\n",
" 'actionsitelinksrich',\n",
" 'actitud',\n",
" 'actitudes',\n",
" 'activa',\n",
" 'activado',\n",
" 'activador',\n",
" 'activan',\n",
" 'activar',\n",
" 'activas',\n",
" 'activation',\n",
" 'actividad',\n",
" 'actividades',\n",
" 'activity',\n",
" 'activo',\n",
" 'activos',\n",
" 'actual',\n",
" 'actuales',\n",
" 'actualesa',\n",
" 'actualidad',\n",
" 'actualiza',\n",
" 'actualizada',\n",
" 'actualizar',\n",
" 'actualizarse',\n",
" 'actualizo',\n",
" 'actualmente',\n",
" 'actuamos',\n",
" 'actuar',\n",
" 'acuerda',\n",
" 'acuerdo',\n",
" 'ad',\n",
" 'adams',\n",
" 'adaptado',\n",
" 'adaptan',\n",
" 'adaptar',\n",
" 'adaptarlos',\n",
" 'adaptarse',\n",
" 'adapten',\n",
" 'addeventlistener',\n",
" 'adding',\n",
" 'addon',\n",
" 'adecuada',\n",
" 'adecuadas',\n",
" 'adecuado',\n",
" 'adecuados',\n",
" 'adelantado',\n",
" 'adelante',\n",
" 'adelantee',\n",
" 'adelanto',\n",
" 'ademas',\n",
" 'adjuntado',\n",
" 'adjust',\n",
" 'admin',\n",
" 'administrador',\n",
" 'administrativas',\n",
" 'admitan',\n",
" 'adora',\n",
" 'adquirido',\n",
" 'adquiriendo',\n",
" 'adquirir',\n",
" 'adquisitivo',\n",
" 'adri',\n",
" 'ads',\n",
" 'adsense',\n",
" 'advance',\n",
" 'adversario',\n",
" 'adversidad',\n",
" 'adverstising',\n",
" 'advertisign',\n",
" 'advertising',\n",
" 'advocate',\n",
" 'adwords',\n",
" 'adwordsgoogle',\n",
" 'aede',\n",
" 'afecta',\n",
" 'afectadas',\n",
" 'afectado',\n",
" 'afectan',\n",
" 'afectando',\n",
" 'afectar',\n",
" 'afectarnos',\n",
" 'afiliado',\n",
" 'afiliados',\n",
" 'afines',\n",
" 'afirma',\n",
" 'afirmaciones',\n",
" 'afirman',\n",
" 'afirme',\n",
" 'afrontar',\n",
" 'after',\n",
" 'again',\n",
" 'against',\n",
" 'agencia',\n",
" 'agencias',\n",
" 'agents',\n",
" 'agilidad',\n",
" 'agilizar',\n",
" 'agosto',\n",
" 'agotados',\n",
" 'agradece',\n",
" 'agradecer',\n",
" 'agradecido',\n",
" 'agrega',\n",
" 'agregado',\n",
" 'agregados',\n",
" 'agregas',\n",
" 'agrupados',\n",
" 'agrupar',\n",
" 'agruparlos',\n",
" 'agua',\n",
" 'agujeros',\n",
" 'ah',\n",
" 'ahora',\n",
" 'ahorrar',\n",
" 'ahrefs',\n",
" 'ai',\n",
" 'aida',\n",
" 'aidas',\n",
" 'airways',\n",
" 'aisladas',\n",
" 'aislar',\n",
" 'alberto',\n",
" 'alcachofas',\n",
" 'alcance',\n",
" 'alcanzables',\n",
" 'alcanzablesderivado',\n",
" 'alcanzar',\n",
" 'alchemyapi',\n",
" 'alemania',\n",
" 'alerta',\n",
" 'alerting',\n",
" 'alexa',\n",
" 'aleyda',\n",
" 'algoritmia',\n",
" 'algoritmo',\n",
" 'algoritmos',\n",
" 'alguien',\n",
" 'alguna',\n",
" 'alguno',\n",
" 'aliada',\n",
" 'alignment',\n",
" 'alinea',\n",
" 'alineadas',\n",
" 'alinean',\n",
" 'alinearlo',\n",
" 'alineen',\n",
" 'all',\n",
" 'allows',\n",
" 'allowscriptaccess',\n",
" 'almacenado',\n",
" 'almacene',\n",
" 'almost',\n",
" 'alone',\n",
" 'along',\n",
" 'alpha',\n",
" 'alphami',\n",
" 'alphaprobar',\n",
" 'alquiler',\n",
" 'already',\n",
" 'alrededor',\n",
" 'also',\n",
" 'alta',\n",
" 'altas',\n",
" 'alterando',\n",
" 'alterativas',\n",
" 'alternate',\n",
" 'alternativa',\n",
" 'alternativas',\n",
" 'alto',\n",
" 'altruismo',\n",
" 'altura',\n",
" 'alturas',\n",
" 'alucinar',\n",
" 'alumnos',\n",
" 'always',\n",
" 'amateur',\n",
" 'amateurs',\n",
" 'amazoncart',\n",
" 'amb',\n",
" 'ambas',\n",
" 'ambientadas',\n",
" 'ambiente',\n",
" 'ambiguaciones',\n",
" 'ambos',\n",
" 'ameno',\n",
" 'americano',\n",
" 'amigable',\n",
" 'amigo',\n",
" 'amigos',\n",
" 'amiguetes',\n",
" 'amoldarnos',\n",
" 'amos',\n",
" 'amp',\n",
" 'amphtml',\n",
" 'amplia',\n",
" 'ampliando',\n",
" 'amplio',\n",
" 'an',\n",
" 'analicemos',\n",
" 'analista',\n",
" 'analistas',\n",
" 'analistaseo',\n",
" 'analitica',\n",
" 'analiza',\n",
" 'analizada',\n",
" 'analizado',\n",
" 'analizados',\n",
" 'analizamos',\n",
" 'analizando',\n",
" 'analizar',\n",
" 'analizarlos',\n",
" 'analizaron',\n",
" 'analizo',\n",
" 'analyst',\n",
" 'analyticator',\n",
" 'analytics',\n",
" 'analyticsdescubre',\n",
" 'analyticseste',\n",
" 'analyticsgracias',\n",
" 'analyticsla',\n",
" 'analyticslas',\n",
" 'analyticslo',\n",
" 'analyticspasadas',\n",
" 'analyticspero',\n",
" 'analyticspor',\n",
" 'analyticstracking',\n",
" 'ancho',\n",
" 'anchor',\n",
" 'anchoring',\n",
" 'anchors',\n",
" 'anchos',\n",
" 'ancla',\n",
" 'and',\n",
" 'andorra',\n",
" 'andrew',\n",
" 'android',\n",
" 'angular',\n",
" 'anillo',\n",
" 'anima',\n",
" 'animado',\n",
" 'animales',\n",
" 'animar',\n",
" 'animo',\n",
" 'anne',\n",
" 'ano',\n",
" 'anotaciones',\n",
" 'another',\n",
" 'ansias',\n",
" 'ansidad',\n",
" 'ansiedad',\n",
" 'answering',\n",
" 'antecedentes',\n",
" 'antecedentesos',\n",
" 'anterior',\n",
" 'anteriores',\n",
" 'anterioridad',\n",
" 'anteriormente',\n",
" 'anti',\n",
" 'antiguo',\n",
" 'antiguos',\n",
" 'antivirus',\n",
" 'antojo',\n",
" 'anual',\n",
" 'anunciado',\n",
" 'anunciante',\n",
" 'anunciar',\n",
" 'anuncio',\n",
" 'anuncios',\n",
" 'anyone',\n",
" 'aol',\n",
" 'aov',\n",
" 'apagar',\n",
" 'aparatejo',\n",
" 'aparece',\n",
" 'aparecen',\n",
" 'aparecer',\n",
" 'apareces',\n",
" 'aparecido',\n",
" 'apareciendo',\n",
" 'aparecieran',\n",
" 'aparezca',\n",
" 'aparezcan',\n",
" 'apartado',\n",
" 'aparte',\n",
" 'apelar',\n",
" 'apenas',\n",
" 'apertura',\n",
" 'api',\n",
" 'apis',\n",
" 'aplica',\n",
" 'aplicables',\n",
" 'aplicaciones',\n",
" 'aplicada',\n",
" 'aplicadas',\n",
" 'aplicado',\n",
" 'aplicados',\n",
" 'aplicamos',\n",
" 'aplicando',\n",
" 'aplicar',\n",
" 'aplicarlo',\n",
" 'aplicarlos',\n",
" 'aplicas',\n",
" 'aplication',\n",
" 'aplico',\n",
" 'aplique',\n",
" 'aporta',\n",
" 'aportado',\n",
" 'aportar',\n",
" 'aporten',\n",
" 'apostado',\n",
" 'apostar',\n",
" 'apoyar',\n",
" 'apoyas',\n",
" 'apoyo',\n",
" 'app',\n",
" 'appanalytics',\n",
" 'apple',\n",
" 'approved',\n",
" 'apps',\n",
" 'appuna',\n",
" 'aprendan',\n",
" 'aprende',\n",
" 'aprender',\n",
" 'aprendido',\n",
" 'aprendiendo',\n",
" 'aprendizaje',\n",
" 'aprendizajes',\n",
" 'aprendo',\n",
" 'aprobados',\n",
" 'aprovechado',\n",
" 'aprovechando',\n",
" 'aprovechar',\n",
" 'aprovecharlo',\n",
" 'aprovecho',\n",
" 'aprox',\n",
" 'aproxima',\n",
" 'aproximada',\n",
" 'aproximadamente',\n",
" 'aproximado',\n",
" 'apunte',\n",
" 'aquel',\n",
" 'aquella',\n",
" 'aquellas',\n",
" 'aquello',\n",
" 'aquellos',\n",
" 'arcas',\n",
" 'archidemostrada',\n",
" 'archivada',\n",
" 'archive',\n",
" 'archivo',\n",
" 'archivos',\n",
" 'are',\n",
" 'argumentos',\n",
" 'armar',\n",
" 'armoniosa',\n",
" 'arquitectura',\n",
" 'arquitecturaa',\n",
" 'arreglado',\n",
" 'arreglan',\n",
" 'arreglando',\n",
" 'arreglar',\n",
" 'arriba',\n",
" 'arruinar',\n",
" 'arte',\n",
" 'artesanales',\n",
" 'articles',\n",
" 'artificial',\n",
" 'artificialmente',\n",
" 'arturo',\n",
" 'arturomarimon',\n",
" 'as',\n",
" 'asegura',\n",
" 'asegurar',\n",
" 'asegurarnos',\n",
" 'asegurarse',\n",
" 'aseguras',\n",
" 'asesorados',\n",
" 'asignado',\n",
" 'asignando',\n",
" 'asignar',\n",
" 'asignarles',\n",
" 'asimilado',\n",
" 'asimilar',\n",
" 'asistencial',\n",
" 'asistentes',\n",
" 'asistidas',\n",
" 'asistiendo',\n",
" 'ask',\n",
" 'asociada',\n",
" 'asociados',\n",
" 'asociar',\n",
" 'asociarse',\n",
" 'asociativo',\n",
" 'asombrados',\n",
" 'asp',\n",
" 'aspecto',\n",
" 'aspectos',\n",
" 'assesor',\n",
" 'assets',\n",
" 'assisted',\n",
" 'asumidas',\n",
" 'asumir',\n",
" 'asunciones',\n",
" 'asunto',\n",
" 'asustarnos',\n",
" 'async',\n",
" 'atacante',\n",
" 'atacaremos',\n",
" 'atado',\n",
" 'ataque',\n",
" 'atenta',\n",
" 'aterrice',\n",
" 'aterriza',\n",
" 'aterrizado',\n",
" 'aterrizaje',\n",
" 'aterrizan',\n",
" 'aterrizaron',\n",
" 'aterrize',\n",
" 'atractivas',\n",
" 'atrae',\n",
" 'atraen',\n",
" 'atraer',\n",
" 'atraigan',\n",
" 'atribuir',\n",
" 'atributos',\n",
" 'atts',\n",
" 'aturar',\n",
" 'auc',\n",
" 'audiencia',\n",
" 'auditadas',\n",
" 'auditado',\n",
" 'auditar',\n",
" 'auditivo',\n",
" 'augure',\n",
" 'august',\n",
" 'aumenta',\n",
" 'aumentado',\n",
" 'aumentan',\n",
" 'aumentando',\n",
" 'aumentar',\n",
" 'aumentaremos',\n",
" 'aumentaron',\n",
" 'aumentas',\n",
" 'aumente',\n",
" 'aumento',\n",
" 'aun',\n",
" 'aunque',\n",
" 'author',\n",
" 'authority',\n",
" 'authorrank',\n",
" 'auto',\n",
" 'autor',\n",
" 'autores',\n",
" 'autoridad',\n",
" 'autoridadpara',\n",
" 'autoritario',\n",
" 'av',\n",
" 'avanza',\n",
" 'avanzadas',\n",
" 'avanzado',\n",
" 'avanzados',\n",
" 'avanzar',\n",
" 'aveces',\n",
" 'avecinaba',\n",
" 'aventurero',\n",
" 'average',\n",
" 'avg',\n",
" 'avinash',\n",
" 'avisa',\n",
" 'avisando',\n",
" 'avisarte',\n",
" 'avises',\n",
" 'aviso',\n",
" 'awareness',\n",
" 'awarness',\n",
" 'ayer',\n",
" 'ayuda',\n",
" 'ayudado',\n",
" 'ayudamos',\n",
" 'ayudan',\n",
" 'ayudar',\n",
" 'ayudaron',\n",
" 'ayuden',\n",
" 'azar',\n",
" 'azul',\n",
" 'azules',\n",
" 'b',\n",
" 'back',\n",
" 'backend',\n",
" 'backlinks',\n",
" 'backlinkwatches',\n",
" 'baeza',\n",
" 'baezaeste',\n",
" 'baidu',\n",
" 'baja',\n",
" 'bajada',\n",
" 'bajadas',\n",
" 'bajamos',\n",
" 'bajar',\n",
" 'baje',\n",
" 'bajo',\n",
" 'balas',\n",
" 'banca',\n",
" 'bandera',\n",
" 'banderas',\n",
" 'bandwidth',\n",
" 'baneadas',\n",
" 'banner',\n",
" 'bar',\n",
" 'barcelona',\n",
" 'barra',\n",
" 'barreras',\n",
" 'barrio',\n",
" 'barry',\n",
" 'basaba',\n",
" 'basadas',\n",
" 'basado',\n",
" 'basados',\n",
" 'basamos',\n",
" 'basan',\n",
" 'basarse',\n",
" 'base',\n",
" 'bases',\n",
" 'basis',\n",
" 'basta',\n",
" 'bastante',\n",
" 'basura',\n",
" 'batch',\n",
" 'baymard',\n",
" 'bbdd',\n",
" 'be',\n",
" 'beefeater',\n",
" 'before',\n",
" 'behavior',\n",
" 'behavioral',\n",
" 'benchmark',\n",
" 'benedict',\n",
" 'beneficiado',\n",
" 'beneficie',\n",
" 'beneficio',\n",
" 'beneficios',\n",
" 'beneficiosa',\n",
" 'beneficiosas',\n",
" 'best',\n",
" 'bestia',\n",
" 'beta',\n",
" 'better',\n",
" 'bettman',\n",
" 'bichotoblog',\n",
" 'bien',\n",
" 'big',\n",
" 'bigbadlondon',\n",
" 'bill',\n",
" 'billete',\n",
" 'billion',\n",
" 'billones',\n",
" 'binario',\n",
" 'bing',\n",
" 'binocular',\n",
" 'biology',\n",
" 'bits',\n",
" 'bj',\n",
" 'black',\n",
" 'blackhat',\n",
" 'blanca',\n",
" 'blanco',\n",
" 'blog',\n",
" 'blogactualmente',\n",
" 'blogs',\n",
" 'bloguismo',\n",
" 'bloqueado',\n",
" 'bloqueados',\n",
" 'bloquean',\n",
" 'bloqueando',\n",
" 'bloquedos',\n",
" 'bobbink',\n",
" 'body',\n",
" 'bola',\n",
" 'bonito',\n",
" 'booksquery',\n",
" 'boom',\n",
" 'boost',\n",
" 'booster',\n",
" 'borra',\n",
" 'borrado',\n",
" 'borran',\n",
" 'borrar',\n",
" 'borre',\n",
" 'bot',\n",
" 'botella',\n",
" 'botones',\n",
" 'bots',\n",
" 'bottom',\n",
" 'bounce',\n",
" 'brainsins',\n",
" 'brand',\n",
" 'branding',\n",
" 'branzai',\n",
" 'breadcrumb',\n",
" 'break',\n",
" 'breve',\n",
" 'brexit',\n",
" 'brillantes',\n",
" 'brin',\n",
" 'brindan',\n",
" 'british',\n",
" 'brock',\n",
" 'broken',\n",
" 'broma',\n",
" 'browser',\n",
" 'brqhcbfwsg',\n",
" 'brutal',\n",
" 'brutos',\n",
" 'budget',\n",
" 'buen',\n",
" 'buena',\n",
" 'buenas',\n",
" 'bueno',\n",
" 'buenodos',\n",
" 'buenos',\n",
" 'buffer',\n",
" 'bug',\n",
" 'builder',\n",
" 'building',\n",
" 'builing',\n",
" 'bulo',\n",
" 'bumm',\n",
" 'busca',\n",
" 'buscaba',\n",
" 'buscaban',\n",
" 'buscado',\n",
" 'buscador',\n",
" 'buscadoren',\n",
" 'buscadores',\n",
" 'buscadoreslos',\n",
" 'buscadorespero',\n",
" 'buscadorporque',\n",
" 'buscamos',\n",
" 'buscan',\n",
" 'buscando',\n",
" 'buscar',\n",
" 'buscarlo',\n",
" 'buscaron',\n",
" 'buscas',\n",
" 'busquen',\n",
" 'busques',\n",
" 'by',\n",
" 'byte',\n",
" 'c',\n",
" 'cabeceras',\n",
" 'cabecerasse',\n",
" 'cabeza',\n",
" 'cabezas',\n",
" 'cabo',\n",
" 'cache',\n",
" 'cacheada',\n",
" 'cacheo',\n",
" 'cacioppo',\n",
" 'cada',\n",
" 'cadenas',\n",
" 'caduco',\n",
" 'cae',\n",
" 'caer',\n",
" 'caffeine',\n",
" 'caja',\n",
" 'calcula',\n",
" 'calculadora',\n",
" 'calculamos',\n",
" 'calcular',\n",
" 'calcularlo',\n",
" 'caldeado',\n",
" 'calendario',\n",
" 'calidad',\n",
" 'calidadal',\n",
" 'calientes',\n",
" 'california',\n",
" 'call',\n",
" 'calma',\n",
" 'calor',\n",
" 'cambia',\n",
" 'cambiado',\n",
" 'cambian',\n",
" 'cambiando',\n",
" 'cambiar',\n",
" 'cambiarle',\n",
" 'cambiarlo',\n",
" 'cambie',\n",
" 'cambies',\n",
" 'cambio',\n",
" 'cambiopagina',\n",
" 'cambios',\n",
" 'camello',\n",
" 'camino',\n",
" 'caminos',\n",
" 'camisetas',\n",
" 'campo',\n",
" 'campos',\n",
" 'camposresponsable',\n",
" 'can',\n",
" 'canal',\n",
" 'canales',\n",
" 'canary',\n",
" 'canibalices',\n",
" 'canonical',\n",
" 'canonicalitis',\n",
" 'cansa',\n",
" 'cansado',\n",
" 'cantando',\n",
" 'cantidad',\n",
" 'capaces',\n",
" 'capacidad',\n",
" 'capadas',\n",
" 'capando',\n",
" 'capas',\n",
" 'capaz',\n",
" 'captados',\n",
" 'captar',\n",
" 'captcha',\n",
" 'captchas',\n",
" 'captchaslos',\n",
" 'captology',\n",
" 'capturar',\n",
" 'capturas',\n",
" 'cara',\n",
" 'caracteres',\n",
" 'caramelo',\n",
" 'caras',\n",
" 'card',\n",
" 'cardinal',\n",
" 'cards',\n",
" 'carencia',\n",
" 'carga',\n",
" 'cargando',\n",
" 'cargar',\n",
" 'cargarlos',\n",
" 'cargarse',\n",
" 'cargarte',\n",
" 'carguemos',\n",
" 'carguen',\n",
" 'carlos',\n",
" 'carlosredondo',\n",
" 'caro',\n",
" 'carrito',\n",
" 'carritos',\n",
" 'carrusel',\n",
" 'casa',\n",
" 'casas',\n",
" 'case',\n",
" 'casi',\n",
" 'casilla',\n",
" 'caso',\n",
" 'casos',\n",
" 'casta',\n",
" 'castellano',\n",
" 'casualidad',\n",
" 'casualmente',\n",
" 'cat',\n",
" 'catal',\n",
" 'catalan',\n",
" 'catalunya',\n",
" 'categorizan',\n",
" 'categorizar',\n",
" 'categorizarlo',\n",
" 'category',\n",
" 'categorydelegating',\n",
" 'causa',\n",
" 'causado',\n",
" 'causalidad',\n",
" 'causan',\n",
" 'cause',\n",
" 'caused',\n",
" 'cc',\n",
" 'cd',\n",
" 'cdn',\n",
" 'cedido',\n",
" 'cegados',\n",
" 'cegaron',\n",
" 'cenando',\n",
" 'centenares',\n",
" 'centrada',\n",
" 'centrado',\n",
" 'centrados',\n",
" 'central',\n",
" 'centralizada',\n",
" 'centramos',\n",
" 'centran',\n",
" 'centrando',\n",
" 'centrar',\n",
" 'centrarnos',\n",
" 'centrarse',\n",
" 'centras',\n",
" 'centre',\n",
" 'centremos',\n",
" 'centric',\n",
" 'centro',\n",
" 'ceo',\n",
" 'cercana',\n",
" 'cercanas',\n",
" 'cercano',\n",
" 'cerebro',\n",
" 'cerebros',\n",
" 'cereto',\n",
" 'cero',\n",
" 'cerrado',\n",
" 'cerrados',\n",
" 'cerrar',\n",
" 'certeza',\n",
" 'certificadas',\n",
" 'certificados',\n",
" 'chacras',\n",
" 'chaiken',\n",
" 'change',\n",
" 'channel',\n",
" 'chapter',\n",
" 'charla',\n",
" 'charlas',\n",
" 'chat',\n",
" 'chats',\n",
" 'checkbox',\n",
" 'checkeo',\n",
" 'checker',\n",
" 'checklist',\n",
" 'checklisty',\n",
" 'checkout',\n",
" 'checkoutestas',\n",
" 'chica',\n",
" 'chico',\n",
" 'chicos',\n",
" 'china',\n",
" 'choose',\n",
" 'chorrada',\n",
" 'christian',\n",
" 'chrome',\n",
" 'chulas',\n",
" 'chuleta',\n",
" 'chulos',\n",
" 'chute',\n",
" 'cialdini',\n",
" 'ciclo',\n",
" 'cid',\n",
" 'ciegas',\n",
" 'ciegos',\n",
" 'cientos',\n",
" 'cierre',\n",
" 'cierta',\n",
" 'ciertas',\n",
" 'cierto',\n",
" 'ciertos',\n",
" 'cifrado',\n",
" 'cifrados',\n",
" 'cinco',\n",
" 'cingulado',\n",
" 'circulito',\n",
" 'circunstancias',\n",
" 'cirujano',\n",
" 'cita',\n",
" 'citadas',\n",
" 'citado',\n",
" 'citarlo',\n",
" 'cito',\n",
" 'ciudad',\n",
" 'ciudadanos',\n",
" 'claim',\n",
" 'clara',\n",
" 'claraincluye',\n",
" 'claramente',\n",
" 'claras',\n",
" 'claro',\n",
" 'claros',\n",
" 'clases',\n",
" 'clasificado',\n",
" 'clasificados',\n",
" 'clasificar',\n",
" 'classic',\n",
" 'classification',\n",
" 'classificationmining',\n",
" 'clave',\n",
" 'claveestos',\n",
" 'claves',\n",
" 'cleaner',\n",
" 'clic',\n",
" 'clica',\n",
" 'clicamos',\n",
" 'clicar',\n",
" 'clicarse',\n",
" 'click',\n",
" 'clicks',\n",
" 'clickworker',\n",
" 'clics',\n",
" 'client',\n",
" 'cliente',\n",
" 'clientes',\n",
" 'clienteun',\n",
" 'clinic',\n",
" 'clinicseo',\n",
" 'cliquen',\n",
" 'cloaking',\n",
" 'clusters',\n",
" 'cms',\n",
" 'cnn',\n",
" 'co',\n",
" 'coach',\n",
" 'cobren',\n",
" 'cocacola',\n",
" 'code',\n",
" 'codificado',\n",
" 'coge',\n",
" 'cogen',\n",
" 'coger',\n",
" 'cogerlos',\n",
" 'cogido',\n",
" 'cogiendo',\n",
" 'cognitiva',\n",
" 'cognitivala',\n",
" 'cognitivamente',\n",
" 'cognitivas',\n",
" 'cognitivos',\n",
" 'coherencia',\n",
" 'coherente',\n",
" 'cohort',\n",
" 'coincida',\n",
" 'coincidence',\n",
" 'coincidencias',\n",
" 'coincidiendo',\n",
" 'coincidir',\n",
" 'cojones',\n",
" 'cola',\n",
" 'colabora',\n",
" 'colaborativa',\n",
" 'colarse',\n",
" ...]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tfidf = TfidfVectorizer(stop_words = stopwords.words('spanish'), \n",
" max_df = 0.6, min_df = 1, token_pattern=r'(?u)\\b[A-Za-z]+\\b')\n",
"csr_mat = tfidf.fit_transform(articles)\n",
"words = tfidf.get_feature_names()\n",
"words"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a <a href=\"https://mlexplained.com/2017/12/28/a-practical-introduction-to-nmf-nonnegative-matrix-factorization/\">NMF model (nonnegative matrix factorization)</a>. Here we are <a href=\"https://blog.exploratory.io/demystifying-text-analytics-part-4-dimensionality-reduction-and-clustering-in-r-cbc8c90e0b14\">reducing the dimension</a> of the sparse matrix created before."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"model = NMF(n_components=10) #test the number of components for your site(similar to the number of 'topics')\n",
"model.fit(csr_mat)\n",
"nmf_features = model.transform(csr_mat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Normalizing the NMF features"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>SEO para Progressive Web APPs (PWA) y JavaScript</td>\n",
" <td>0.000000</td>\n",
" <td>0.040243</td>\n",
" <td>0.0</td>\n",
" <td>0.052272</td>\n",
" <td>0.379103</td>\n",
" <td>0.000000</td>\n",
" <td>0.072834</td>\n",
" <td>0.138624</td>\n",
" <td>0.045398</td>\n",
" <td>0.908486</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Cómo mejorar el SEO incrementando la frecuencia de rastreo</td>\n",
" <td>0.056258</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.349612</td>\n",
" <td>0.000000</td>\n",
" <td>0.018862</td>\n",
" <td>0.030825</td>\n",
" <td>0.005401</td>\n",
" <td>0.934490</td>\n",
" </tr>\n",
" <tr>\n",
" <td>CTR y SEO: Manipulación del CTR para influir en los rankings</td>\n",
" <td>0.946162</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.066596</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.316768</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Las mejores herramientas de visualización de datos gratuitas</td>\n",
" <td>0.000000</td>\n",
" <td>0.973158</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.230139</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Cómo hacer un Heatmap de las visitas de tu web por día y hora</td>\n",
" <td>0.352149</td>\n",
" <td>0.654728</td>\n",
" <td>0.0</td>\n",
" <td>0.136502</td>\n",
" <td>0.079182</td>\n",
" <td>0.119675</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.627579</td>\n",
" <td>0.119344</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 \\\n",
"SEO para Progressive Web APPs (PWA) y JavaScript 0.000000 0.040243 0.0 \n",
"Cómo mejorar el SEO incrementando la frecuencia... 0.056258 0.000000 0.0 \n",
"CTR y SEO: Manipulación del CTR para influir en... 0.946162 0.000000 0.0 \n",
"Las mejores herramientas de visualización de da... 0.000000 0.973158 0.0 \n",
"Cómo hacer un Heatmap de las visitas de tu web ... 0.352149 0.654728 0.0 \n",
"\n",
" 3 4 \\\n",
"SEO para Progressive Web APPs (PWA) y JavaScript 0.052272 0.379103 \n",
"Cómo mejorar el SEO incrementando la frecuencia... 0.000000 0.349612 \n",
"CTR y SEO: Manipulación del CTR para influir en... 0.000000 0.066596 \n",
"Las mejores herramientas de visualización de da... 0.000000 0.000000 \n",
"Cómo hacer un Heatmap de las visitas de tu web ... 0.136502 0.079182 \n",
"\n",
" 5 6 \\\n",
"SEO para Progressive Web APPs (PWA) y JavaScript 0.000000 0.072834 \n",
"Cómo mejorar el SEO incrementando la frecuencia... 0.000000 0.018862 \n",
"CTR y SEO: Manipulación del CTR para influir en... 0.000000 0.000000 \n",
"Las mejores herramientas de visualización de da... 0.000000 0.000000 \n",
"Cómo hacer un Heatmap de las visitas de tu web ... 0.119675 0.000000 \n",
"\n",
" 7 8 \\\n",
"SEO para Progressive Web APPs (PWA) y JavaScript 0.138624 0.045398 \n",
"Cómo mejorar el SEO incrementando la frecuencia... 0.030825 0.005401 \n",
"CTR y SEO: Manipulación del CTR para influir en... 0.316768 0.000000 \n",
"Las mejores herramientas de visualización de da... 0.230139 0.000000 \n",
"Cómo hacer un Heatmap de las visitas de tu web ... 0.000000 0.627579 \n",
"\n",
" 9 \n",
"SEO para Progressive Web APPs (PWA) y JavaScript 0.908486 \n",
"Cómo mejorar el SEO incrementando la frecuencia... 0.934490 \n",
"CTR y SEO: Manipulación del CTR para influir en... 0.000000 \n",
"Las mejores herramientas de visualización de da... 0.000000 \n",
"Cómo hacer un Heatmap de las visitas de tu web ... 0.119344 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"norm_features = normalize(nmf_features)\n",
"\n",
"df = pd.DataFrame(norm_features,index=titles)\n",
"display(df.head())"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"article = df.loc['Qué es un buscador semántico']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Displaying the 10 articles with highest cosine similarity"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Qué es un buscador semántico 1.000000\n",
"Qué son las entidades y su implicación en el SEO 0.999439\n",
"SEO Semántico para la Web Semántica 0.998715\n",
"Keyword Research con Google Refine [Vídeo Tutorial] 0.734699\n",
"SEO, rankings y conversión 0.430864\n",
"dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"similarities = df.dot(article)\n",
"similarities.nlargest()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment