Skip to content

Instantly share code, notes, and snippets.

@sanchezcarlosjr
Last active June 22, 2023 17:25
Show Gist options
  • Save sanchezcarlosjr/91c8b8588e339381ae2b75fa868ee7e8 to your computer and use it in GitHub Desktop.
Save sanchezcarlosjr/91c8b8588e339381ae2b75fa868ee7e8 to your computer and use it in GitHub Desktop.
MeIA - Sentiment analysis in the TASS corpus (Spanish) using traditional machine learning.ipynb
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/sanchezcarlosjr/91c8b8588e339381ae2b75fa868ee7e8/clase1-clasificacion.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# MeIA - Análisis de sentimientos en el corpus del TASS (español) usando machine learning tradicional\n",
"\n",
"Carlos Eduardo Sánchez Torres\n",
"\n",
"Curso elaborado por:\n",
"Dra. Helena M. Gomez Adorno, UNAM\n",
"\n",
"Miguel Angel Alvarez, CIMAT\n",
"\n",
"Victor Giovanni Morales, BUAP\n"
],
"metadata": {
"id": "aI9ugCiwVfsZ"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "pV_yAmsSOJJ8"
},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/18bWyrg9-EJtmgRTdJblML1W2SGZrykAE/view?usp=sharing)\n",
"\n",
"# Extracción de características"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "e_JCC22WOJKB",
"outputId": "e6b7d630-8119-415e-d898-a78bdb382bfe"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"La matriz tiene: 4 filas y 9 columnas\n",
"\n",
"El vocabulario del corpus es:\n",
"['azul' 'brilla' 'brillante' 'cielo' 'el' 'en' 'es' 'grande' 'sol']\n",
"\n",
"Matriz término-documento\n",
"[[1 0 0 1 1 0 1 0 0]\n",
" [0 0 1 0 1 0 1 1 1]\n",
" [0 0 1 1 2 1 1 0 1]\n",
" [1 1 0 1 2 1 0 0 1]]\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"corpus = ['El cielo es azul','El sol es brillante y grande', 'El sol en el cielo es brillante', 'El sol brilla en el cielo azul']\n",
"\n",
"#Instanciamos el objeto CountVectorizer() con parametros por defecto. Este objeto por defecto extrae todas las palabras\n",
"#del corpus, es decir, la bolsa de palabras y cuenta la frecuencia de aparición de cada una\n",
"count_vect = CountVectorizer()\n",
"\n",
"#Ajustamos el vectorizador al corpus para generar la matriz término-documento\n",
"matrizBOW = count_vect.fit_transform(corpus)\n",
"\n",
"#imprimimos la cantidad de filas y columnas de la matriz generada\n",
"print(\"La matriz tiene: \",matrizBOW.shape[0], \"filas y \",matrizBOW.shape[1], \" columnas\")\n",
"print(\"\")\n",
"\n",
"#imprimimos la lista de palabras extraídas\n",
"print(\"El vocabulario del corpus es:\")\n",
"print(count_vect.get_feature_names_out())\n",
"print(\"\")\n",
"\n",
"#visualizamos la matriz generada\n",
"print(\"Matriz término-documento\")\n",
"print(matrizBOW.toarray())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4NfDk3bIOJKE"
},
"source": [
"# N-gramas de palabras"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "AHya6Q_JOJKE",
"outputId": "d9d26cb7-a8f0-4ede-967f-fc8ea87ed87f"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"La matriz tiene: 4 filas y 23 columnas\n",
"\n",
"El vocabulario (bi-gramas) del corpus es:\n",
"{'el': 8, 'cielo': 5, 'es': 13, 'azul': 0, 'el cielo': 9, 'cielo es': 7, 'es azul': 14, 'sol': 17, 'brillante': 3, 'y': 21, 'grande': 16, 'el sol': 10, 'sol es': 20, 'es brillante': 15, 'brillante y': 4, 'y grande': 22, 'en': 11, 'sol en': 19, 'en el': 12, 'brilla': 1, 'sol brilla': 18, 'brilla en': 2, 'cielo azul': 6}\n",
"\n",
"Matriz término-documento\n",
"[[1 0 0 0 0 1 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0]\n",
" [0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 0 0 1 1 1]\n",
" [0 0 0 1 0 1 0 1 2 1 1 1 1 1 0 1 0 1 0 1 0 0 0]\n",
" [1 1 1 0 0 1 1 0 2 1 1 1 1 0 0 0 0 1 1 0 0 0 0]]\n"
]
}
],
"source": [
"#instanciamos el objeto CountVectorizer() con el rango de n-gramas 1 a 2. Esto significa que el objeto buscará\n",
"#todos los n-gramas de longitud 1 (unigramas o palabras) y 2 (bigramas).\n",
"bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\\b\\w+\\b')\n",
"\n",
"#Ajustamos el vectorizador al corpus para generar la matriz término-documento\n",
"matriz_bigramas = bigram_vectorizer.fit_transform(corpus)\n",
"\n",
"#imprimimos la cantidad de filas y columnas de la matriz generada\n",
"print(\"La matriz tiene: \",matriz_bigramas.shape[0], \"filas y \",matriz_bigramas.shape[1], \" columnas\")\n",
"print(\"\")\n",
"\n",
"#imprimimos la lista de bigramas extraídas\n",
"print(\"El vocabulario (bi-gramas) del corpus es:\")\n",
"print(bigram_vectorizer.vocabulary_)\n",
"print(\"\")\n",
"\n",
"#visualizamos la matriz generada\n",
"print(\"Matriz término-documento\")\n",
"print(matriz_bigramas.toarray())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YAm7t5ZfOJKF"
},
"source": [
"# Tf-idf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "PwkSj5PCOJKF",
"outputId": "78d923c2-8c19-4801-956a-be0c598a0208"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"La matriz tiene: 4 filas y 9 columnas\n",
"\n",
"El vocabulario (bi-gramas) del corpus es:\n",
"{'el': 4, 'cielo': 3, 'es': 6, 'azul': 0, 'sol': 8, 'brillante': 2, 'grande': 7, 'en': 5, 'brilla': 1}\n",
"\n",
"Matriz término-documento\n",
"[[0.60313701 0. 0. 0.48829139 0.39921021 0.\n",
" 0.48829139 0. 0. ]\n",
" [0. 0. 0.47903796 0. 0.31707032 0.\n",
" 0.38782252 0.60759891 0.38782252]\n",
" [0. 0. 0.4181692 0.33854401 0.55356382 0.4181692\n",
" 0.33854401 0. 0.33854401]\n",
" [0.38714286 0.49104163 0. 0.31342551 0.51249178 0.38714286\n",
" 0. 0. 0.31342551]]\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"#instanciamos el objeto TfidfVectorizer() con parametros por defecto, al igual que el CountVectorizer extraerá\n",
"#las palabras del corpus pero en lugar de contar la frecuencia, realizará el cálculo del TF-idf\n",
"tfidf_vect = TfidfVectorizer()\n",
"\n",
"#Ajustamos el vectorizador al corpus para generar la matriz término-documento\n",
"matriz_tfidf=tfidf_vect.fit_transform(corpus)\n",
"\n",
"#imprimimos la cantidad de filas y columnas de la matriz generada\n",
"print(\"La matriz tiene: \",matriz_tfidf.shape[0], \"filas y \",matriz_tfidf.shape[1], \" columnas\")\n",
"print(\"\")\n",
"\n",
"#imprimimos la lista de bigramas extraídas\n",
"print(\"El vocabulario (bi-gramas) del corpus es:\")\n",
"print(tfidf_vect.vocabulary_)\n",
"print(\"\")\n",
"\n",
"#visualizamos la matriz generada\n",
"print(\"Matriz término-documento\")\n",
"print(matriz_tfidf.toarray())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "In-QseH5OJKG"
},
"source": [
"# Similitud coseno\n",
"\n",
"Para cada documento (el número de filas de la matriz), podemos calcular la similitud coseno entre el primer documento (*\"El cielo es azul\"*) con cada uno de los otros documentos del conjunto:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "71BEjldSOJKG",
"outputId": "75d58ea5-2d18-4ba4-fd2b-0d19ff364cb6"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[[1. 0.4472136 0.66666667 0.66666667]]\n",
"[[1. 0.22792115 0.58554004 0.48795004]]\n",
"[[1. 0.3159481 0.55160457 0.59113512]]\n"
]
}
],
"source": [
"from sklearn.metrics.pairwise import cosine_similarity\n",
"sim_cosenoTF=cosine_similarity(matrizBOW[0:1], matrizBOW)\n",
"sim_cosenoTF_bigramas=cosine_similarity(matriz_bigramas[0:1], matriz_bigramas)\n",
"sim_cosenoTFIDF=cosine_similarity(matriz_tfidf[0:1], matriz_tfidf)\n",
"print(sim_cosenoTF)\n",
"print(sim_cosenoTF_bigramas)\n",
"print(sim_cosenoTFIDF)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xkYBlSvfOJKH"
},
"source": [
"**Nota:** matrizBOW[0:1] es la operación de Scipy para obtener la primera fila de la matriz dispersa y la matriz resultante es la similitud de coseno entre el primer documento con todos los documentos del corpus. Tener en cuenta que el primer valor de la matriz es 1.0 porque es la similitud de coseno entre el primer documento consigo mismo. También tener en cuenta que debido a la presencia de palabras similares en el cuarto documento (\"El sol brilla en el cielo azul\"), logró una mejor puntuación utilizando la matriz TF-IDF."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "60hK0-DcOJKH"
},
"source": [
"# Análisis de Sentimientos en el corpus del TASS 2020"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "L1eUSUO9OJKI"
},
"source": [
"## Preparación de datos"
]
},
{
"cell_type": "code",
"source": [
"#si usamos colab\n",
"from google.colab import drive\n",
"\n",
"drive.mount('/content/gdrive')\n",
"\n",
"#si falta instalar nltk\n",
"!pip install nltk"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "InWgw5vPOjBM",
"outputId": "718e5eea-fd9c-4bf9-d242-5b9a9944e358"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Mounted at /content/gdrive\n",
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
"Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)\n",
"Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.3)\n",
"Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.2.0)\n",
"Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2022.10.31)\n",
"Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.65.0)\n"
]
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "kTrmD0EaOJKI",
"outputId": "c47037f4-35df-46e8-98a8-31afd9dd1ce1"
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import nltk\n",
"nltk.download('punkt')\n",
"\n",
"# Lectura del archivo donde se encuentran los datos de entrenamiento y validación\n",
"data = pd.read_csv('/content/gdrive/MyDrive/Meia2023/Modulo2-ClasificacionTextos/corpusTASS-2020/train.tsv', sep='\\t')\n",
"data_dev = pd.read_csv('/content/gdrive/MyDrive/Meia2023/Modulo2-ClasificacionTextos/corpusTASS-2020/dev.tsv', sep='\\t')\n",
"\n",
"# Diccionario de mapeo de etiquetas\n",
"mapeo_etiquetas = {'N': 0, 'NEU': 1, 'P': 2}\n",
"\n",
"# Transformación de la columna \"etiqueta\"\n",
"#data['etiqueta_num'] = data['etiqueta'].map(mapeo_etiquetas)\n",
"#data_dev['etiqueta_num'] = data_dev['etiqueta'].map(mapeo_etiquetas)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "v4RDNbO_OJKI",
"outputId": "2173ecbb-d24e-4bf3-c84c-8d7a15176ca1",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 411
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" id texto \\\n",
"0 768512386269638656 @morbosaborealis jajajaja... eso es verdad... ... \n",
"1 768529956162924544 @Adriansoler espero y deseo que el interior te... \n",
"2 768557093955698688 comprendo que te molen mis tattoos, pero no te... \n",
"3 770616744192929792 Mi última partida jugada, con Sona support. La... \n",
"4 769959690092642304 Tranquilos que con el.dinero de Camacho seguro... \n",
"... ... ... \n",
"4797 817849572865273857 @ladelbosque29 acude al próximo llamado que ha... \n",
"4798 800007284491309060 @Dianybony jajajaja claro que no amor!! te amo... \n",
"4799 817236774816718848 Hoy le pedí a Dios una señal realmente obvia, ... \n",
"4800 816175658250420224 El reboot de Jumanji puede romper mi corazón x... \n",
"4801 816354923923111937 @Djrossana que tengan un lindo martes y que to... \n",
"\n",
" etiqueta pais \n",
"0 N es \n",
"1 NEU es \n",
"2 NEU es \n",
"3 P es \n",
"4 P es \n",
"... ... ... \n",
"4797 NEU mx \n",
"4798 P mx \n",
"4799 P mx \n",
"4800 N mx \n",
"4801 P mx \n",
"\n",
"[4802 rows x 4 columns]"
],
"text/html": [
"\n",
" <div id=\"df-19b446fd-74d8-4dad-8546-7eb8a96eb519\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>texto</th>\n",
" <th>etiqueta</th>\n",
" <th>pais</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>768512386269638656</td>\n",
" <td>@morbosaborealis jajajaja... eso es verdad... ...</td>\n",
" <td>N</td>\n",
" <td>es</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>768529956162924544</td>\n",
" <td>@Adriansoler espero y deseo que el interior te...</td>\n",
" <td>NEU</td>\n",
" <td>es</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>768557093955698688</td>\n",
" <td>comprendo que te molen mis tattoos, pero no te...</td>\n",
" <td>NEU</td>\n",
" <td>es</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>770616744192929792</td>\n",
" <td>Mi última partida jugada, con Sona support. La...</td>\n",
" <td>P</td>\n",
" <td>es</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>769959690092642304</td>\n",
" <td>Tranquilos que con el.dinero de Camacho seguro...</td>\n",
" <td>P</td>\n",
" <td>es</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4797</th>\n",
" <td>817849572865273857</td>\n",
" <td>@ladelbosque29 acude al próximo llamado que ha...</td>\n",
" <td>NEU</td>\n",
" <td>mx</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4798</th>\n",
" <td>800007284491309060</td>\n",
" <td>@Dianybony jajajaja claro que no amor!! te amo...</td>\n",
" <td>P</td>\n",
" <td>mx</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4799</th>\n",
" <td>817236774816718848</td>\n",
" <td>Hoy le pedí a Dios una señal realmente obvia, ...</td>\n",
" <td>P</td>\n",
" <td>mx</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4800</th>\n",
" <td>816175658250420224</td>\n",
" <td>El reboot de Jumanji puede romper mi corazón x...</td>\n",
" <td>N</td>\n",
" <td>mx</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4801</th>\n",
" <td>816354923923111937</td>\n",
" <td>@Djrossana que tengan un lindo martes y que to...</td>\n",
" <td>P</td>\n",
" <td>mx</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>4802 rows × 4 columns</p>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-19b446fd-74d8-4dad-8546-7eb8a96eb519')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-19b446fd-74d8-4dad-8546-7eb8a96eb519 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-19b446fd-74d8-4dad-8546-7eb8a96eb519');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 9
}
],
"source": [
"# visualizar el dataset de entrenamiento\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mKByFSxxOJKJ",
"outputId": "ce717d1b-d0fa-4761-f7a1-dccd8086727f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 433
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" id texto \\\n",
"0 773238965709176832 @chefidiaz no seas muy dura \n",
"1 770702799470489601 @lantoli podemos usar el término.equipo pepino \n",
"2 770238084764041217 Como destrozaba el puto movil ahora mismo \n",
"3 770222346829520896 @YG__GF me ofrecería pero gerald es demasiado ... \n",
"4 770560227531948032 @omixam no creo que hayan diseñado una tipo pr... \n",
"... ... ... \n",
"2438 819221868574085121 @Natsflorees no todo es tan malo \n",
"2439 818686841381601280 @Richo_Amezquita a ver, ya, no seas así \n",
"2440 819012309880360960 ocupo el gym en serio #VideoMTV2016 Abraham Mateo \n",
"2441 819306396378275840 Empezar de nuevo con la dieta es tan difícil \n",
"2442 818801865148235776 Los de alado vienen comiendo pan enfrente de l... \n",
"\n",
" etiqueta pais etiqueta_num \n",
"0 N es 0 \n",
"1 NEU es 1 \n",
"2 N es 0 \n",
"3 NEU es 1 \n",
"4 N es 0 \n",
"... ... ... ... \n",
"2438 NEU mx 1 \n",
"2439 N mx 0 \n",
"2440 NEU mx 1 \n",
"2441 N mx 0 \n",
"2442 N mx 0 \n",
"\n",
"[2443 rows x 5 columns]"
],
"text/html": [
"\n",
" <div id=\"df-4332a32d-b6be-43a5-aa0c-62ef4bf86276\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>texto</th>\n",
" <th>etiqueta</th>\n",
" <th>pais</th>\n",
" <th>etiqueta_num</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>773238965709176832</td>\n",
" <td>@chefidiaz no seas muy dura</td>\n",
" <td>N</td>\n",
" <td>es</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>770702799470489601</td>\n",
" <td>@lantoli podemos usar el término.equipo pepino</td>\n",
" <td>NEU</td>\n",
" <td>es</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>770238084764041217</td>\n",
" <td>Como destrozaba el puto movil ahora mismo</td>\n",
" <td>N</td>\n",
" <td>es</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>770222346829520896</td>\n",
" <td>@YG__GF me ofrecería pero gerald es demasiado ...</td>\n",
" <td>NEU</td>\n",
" <td>es</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>770560227531948032</td>\n",
" <td>@omixam no creo que hayan diseñado una tipo pr...</td>\n",
" <td>N</td>\n",
" <td>es</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2438</th>\n",
" <td>819221868574085121</td>\n",
" <td>@Natsflorees no todo es tan malo</td>\n",
" <td>NEU</td>\n",
" <td>mx</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2439</th>\n",
" <td>818686841381601280</td>\n",
" <td>@Richo_Amezquita a ver, ya, no seas así</td>\n",
" <td>N</td>\n",
" <td>mx</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2440</th>\n",
" <td>819012309880360960</td>\n",
" <td>ocupo el gym en serio #VideoMTV2016 Abraham Mateo</td>\n",
" <td>NEU</td>\n",
" <td>mx</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2441</th>\n",
" <td>819306396378275840</td>\n",
" <td>Empezar de nuevo con la dieta es tan difícil</td>\n",
" <td>N</td>\n",
" <td>mx</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2442</th>\n",
" <td>818801865148235776</td>\n",
" <td>Los de alado vienen comiendo pan enfrente de l...</td>\n",
" <td>N</td>\n",
" <td>mx</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2443 rows × 5 columns</p>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-4332a32d-b6be-43a5-aa0c-62ef4bf86276')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-4332a32d-b6be-43a5-aa0c-62ef4bf86276 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-4332a32d-b6be-43a5-aa0c-62ef4bf86276');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 40
}
],
"source": [
"# visualizar el dataset de validacion\n",
"data_dev"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zxU5EJKNOJKJ"
},
"source": [
"### División de muestras en entrenamiento (train) y validación (dev)\n",
"En caso que no tengamos un dataset para validación, podemos dividir nuestro conjunto de entrenamiento en 2 subconjuntos. Tambien podemos hacer multiples divisiones para entrenar y validar el modelo en varias particiones del conjunto de entrenamiento. Abajo un ejemplo de como realizar la patición."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Hx1p6AaUOJKK",
"outputId": "0cd524dc-c8d0-41f5-f5ca-e47731b6c359"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"3601\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(data['texto'],\n",
" data['etiqueta'],\n",
" random_state=0)\n",
"print(len(X_train))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RYR9LcHQOJKK"
},
"source": [
"Para este ejemplo, utilizaremos las particiones de entrenamiento y validación ya definidas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "vZ3iJPeUOJKK",
"outputId": "b0e0ab47-f3d0-470c-f8e7-a26ac3998c9c"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Tamaño del conjunto de entrenamiento: 4802 4802\n",
"Tamaño del conjunto de validación: 2443 2443\n",
"0 @morbosaborealis jajajaja... eso es verdad... ...\n",
"1 @Adriansoler espero y deseo que el interior te...\n",
"2 comprendo que te molen mis tattoos, pero no te...\n",
"3 Mi última partida jugada, con Sona support. La...\n",
"4 Tranquilos que con el.dinero de Camacho seguro...\n",
" ... \n",
"4797 @ladelbosque29 acude al próximo llamado que ha...\n",
"4798 @Dianybony jajajaja claro que no amor!! te amo...\n",
"4799 Hoy le pedí a Dios una señal realmente obvia, ...\n",
"4800 El reboot de Jumanji puede romper mi corazón x...\n",
"4801 @Djrossana que tengan un lindo martes y que to...\n",
"Name: texto, Length: 4802, dtype: object\n",
"0 N\n",
"1 NEU\n",
"2 NEU\n",
"3 P\n",
"4 P\n",
" ... \n",
"4797 NEU\n",
"4798 P\n",
"4799 P\n",
"4800 N\n",
"4801 P\n",
"Name: etiqueta, Length: 4802, dtype: object\n"
]
}
],
"source": [
"X_train, y_train = data['texto'], data['etiqueta']\n",
"X_test, y_test =data_dev['texto'], data_dev['etiqueta']\n",
"\n",
"print('Tamaño del conjunto de entrenamiento:',len(X_train), len(y_train))\n",
"print('Tamaño del conjunto de validación:',len(X_test), len(y_test))\n",
"print(X_train)\n",
"print(y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rfXH1aUiOJKL"
},
"source": [
"¿Cuáles son los 20 tokens más frecuentes (únicos) en el texto? ¿Cuál es su frecuencia?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Xom3i9pYOJKL",
"outputId": "51d38eba-03d8-4e27-c4de-7fae39c517b9"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[('@', 3171), (',', 2642), ('que', 2458), ('de', 2456), ('y', 1845), ('a', 1578), ('.', 1512), ('la', 1454), ('no', 1441), ('!', 1352), ('me', 1238), ('el', 1198), ('en', 1163), ('es', 956), ('un', 772), ('lo', 706), ('mi', 658), ('por', 609), ('se', 600), ('con', 590)]\n"
]
}
],
"source": [
"from collections import Counter\n",
"def masFrecuentes():\n",
" text = ' '.join(data['texto'])\n",
" tokens=nltk.word_tokenize(text)\n",
" cnt =Counter(tokens).most_common(20)\n",
" return cnt\n",
"\n",
"#print(' '.join(data['texto']))\n",
"masFrecuentes()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cGddn1jNOJKM"
},
"source": [
"¿Qué porcentaje de los documentos en el conjunto de entrenamiento son positivos, negativos y neutros?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-Qfq1nB8OJKM",
"outputId": "eeb4b1ee-57ba-4a22-83c1-b12256608c73"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Clase Negativa: 0.3925447730112453 %\n",
"Clase Neutra: 0.31715951686797167 %\n",
"Clase Positiva: 0.29029571012078303 %\n"
]
}
],
"source": [
"data_neg=len(data[data['etiqueta'] == 'N'])\n",
"data_neu=len(data[data['etiqueta'] == 'NEU'])\n",
"data_pos=len(data[data['etiqueta'] == 'P'])\n",
"print ('Clase Negativa:',(data_neg)/len(data), '%')\n",
"print ('Clase Neutra:',(data_neu)/len(data), '%')\n",
"print ('Clase Positiva:', (data_pos)/len(data), '%')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5lCDhjszOJKN"
},
"source": [
"¿Qué porcentaje de los documentos en el conjunto de prueba son positivos, negativos y neutros?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "EwzAqRjVOJKN",
"outputId": "f5baf49d-4013-46ff-b407-7c5a49d81f79"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Clase Negativa: 0.3892754809660254 %\n",
"Clase Neutra: 0.32460090053213264 %\n",
"Clase Positiva: 0.286123618501842 %\n"
]
}
],
"source": [
"data_dev_neg=len(data_dev[data_dev['etiqueta'] == 'N'])\n",
"data_dev_neu=len(data_dev[data_dev['etiqueta'] == 'NEU'])\n",
"data_dev_pos=len(data_dev[data_dev['etiqueta'] == 'P'])\n",
"print ('Clase Negativa:',(data_dev_neg)/len(data_dev), '%')\n",
"print ('Clase Neutra:',(data_dev_neu)/len(data_dev), '%')\n",
"print ('Clase Positiva:', (data_dev_pos)/len(data_dev), '%')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PXxj8fRLOJKN"
},
"source": [
"## Clasificación de textos"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nri0DtXROJKN"
},
"source": [
"Ajustar y transformar los datos de entrenamiento `X_train` utilizando un `count_vectorizer` con parámetros predeterminados.\n",
"\n",
"Luego, ajustar un modelo de clasificación Naive Bayes multinomial. Calcular medidas de exactitud, presición, recall y f1-score usando los datos de validación transformados."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tFK5v8ugOJKO"
},
"outputs": [],
"source": [
"# importamos librerías necesarias\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.svm import SVC, LinearSVC #maquinas de vectores de soporte\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report\n",
"from sklearn.metrics import confusion_matrix\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ctc10Jj-OJKO",
"outputId": "4759e70f-89d2-4b51-99a8-bb3e47078370"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"2443\n",
"['N' 'N' 'N' ... 'NEU' 'N' 'N']\n"
]
}
],
"source": [
"\n"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment