Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save jayralencar/72a279db66f0ec8366fd75fcab81a200 to your computer and use it in GitHub Desktop.
Save jayralencar/72a279db66f0ec8366fd75fcab81a200 to your computer and use it in GitHub Desktop.
Identificando discursos de ódio em redes sociais usando Transformers.ipynb
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Identificando discursos de ódio em redes sociais usando Transformers.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"authorship_tag": "ABX9TyML3b68QLFamu09MB3dU7W9",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/jayralencar/72a279db66f0ec8366fd75fcab81a200/identificando-discursos-de-dio-em-redes-sociais-usando-transformers.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"Você sabe o que é discurso de ódio?\n",
"\n",
"Eu sei, é uma pergunta difícil de responder. Isso ocorre porque o termo representa um conceito ainda não muito bem definido. Porém, o site politize define como a manifestação de “ideias que incitem a discriminação racial, social ou religiosa em determinados grupos, na maioria das vezes, as minorias”.\n",
"\n",
"Ok! Podemos então considerar a definição citada. Mas podemos também considerar a seguinte situação: digamos que você é um cientista de dados e foi contratado por uma empresa para desenvolver um modelo que seja capaz de indicar se um candidato a determinada vaga costuma publicar discursos de ódio em redes sociais. Nesse caso, você deve considerar que a definição de o que é discurso de ódio será dada pela empresa contratante. E que o seu modelo/sistema será usado como suporte a decisão, ajudando os profissionais de recursos humanos a identificar tais condutas.\n",
"\n",
">**NOTA: considere aqui um caso puramente hipotético, com nenhuma ligação com qualquer caso real de qualquer organização.**\n",
"\n",
"Continuando… O seu objetivo é unicamente analisar os textos dos candidatos e indicar se tem conteúdo ofensivo ou não. Porém, é óbvio que você não vai fazer isso manualmente ou “no olho”. Você deve usar um algoritmo pra isso, e um algoritmo que seja inteligente."
],
"metadata": {
"id": "NbxyyiB2J9l7"
}
},
{
"cell_type": "markdown",
"source": [
"## Classificadores\n",
"\n",
"<p>Observe que estamos diante de um problema de classificação. E um problema de classificação binária, pois devemos indicar se há discurso de ódio ou não (duas classes). Para isso, podemos usar um classificador, ou seja, uma equação/modelo/agente que, com base nas características de cada postagem, nos indique se ela é ofensiva ou não. Ou seja, precisamos de características. </p>\n",
"\n",
"Extrair características a partir de textos não é propriamente uma tarefa fácil. Pois textos são dados não estruturados e, considerando o domínio abordado (i.e., redes sociais), a qualidade dos textos analisados não são garantidas.\n",
"\n",
"Existem diversas maneiras de extrair características a partir de textos. Eu abordei algumas dessas maneiras no minicurso que ministrei na Semana da Informática (<a rel=\"noreferrer noopener\" href=\"https://www.even3.com.br/seinfo2021crato/\" target=\"_blank\">SEINFO</a>) do Instituto Federal de Ciência e Tecnologia do Ceará (IFCE), campus Crato. Dentre elas, a que se mostrou mais eficiente (<a href=\"https://colab.research.google.com/drive/1Ke2L6EvgRtLkiqRJj2vBM-7TJKD0swNk?usp=sharing\">ver notebook com experimentos</a>) foi a baseada em Transformers.\n",
"\n",
"\n",
"\n",
"<p>Por isso, neste post, abordaremos com utilizar um Transformer, mais especificamente o <a href=\"https://github.com/neuralmind-ai/portuguese-bert\">BERTimbau</a> para classificação de postagens em redes sociais, seguindo a seguinte agenda:</p>\n",
"\n",
"\n",
"\n",
"1. O dataset\n",
"2. BERT e BERTimbau\n",
"3. Classificação de sequências\n"
],
"metadata": {
"id": "ckJJUsdXKI_0"
}
},
{
"cell_type": "markdown",
"source": [
"## 1. O dataset\n",
"Como na maioria dos problemas de classificação, nós precisamos de um conjunto de exemplos de posts em redes sociais que contenham discurso de ódio. Assim como um conjunto de exemplos que não contenham tal linguagem. É preferível que esses exemplos já sejam anotados, ou seja, sejam associados a uma das classes: contêm linguagem de ódio ou não contêm.\n",
"\n",
"Utilizaremos o [Hate Speech Detection Dataset](https://github.com/LaCAfe/Dataset-Hatespeech), que é um dataset para análise de linguagem ofensiva na língua portuguesa com dados coletados do Twitter.\n",
"\n",
"> NASCIMENTO, Gabriel et al. Hate speech detection using Brazilian imageboards. In: Proceedings of the 25th Brazillian Symposium on Multimedia and the Web. 2019. p. 325-328.\n",
"\n",
"O primeiro passo é baixar o dataset."
],
"metadata": {
"id": "9cYw48GaKlwa"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "54yAbxaTJ6Tu",
"outputId": "5439d505-f685-4174-cbd5-3ccb8ba35dd3"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Cloning into 'Dataset-Hatespeech'...\n",
"remote: Enumerating objects: 12, done.\u001b[K\n",
"remote: Counting objects: 100% (6/6), done.\u001b[K\n",
"remote: Compressing objects: 100% (5/5), done.\u001b[K\n",
"remote: Total 12 (delta 1), reused 6 (delta 1), pack-reused 6\u001b[K\n",
"Unpacking objects: 100% (12/12), done.\n"
]
}
],
"source": [
"!git clone https://github.com/LaCAfe/Dataset-Hatespeech"
]
},
{
"cell_type": "markdown",
"source": [
"Vamos ver como o dataset é organizado:"
],
"metadata": {
"id": "_Ia6ZSrBK8-A"
}
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"df = pd.read_excel(\"Dataset-Hatespeech/data/df_dataset.xlsx\")\n",
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "6FpU5orXK-v6",
"outputId": "efeaa809-f135-4ff9-ada8-c0e93eec28a7"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-105b1703-3156-4019-b889-fe3df397b608\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>docno</th>\n",
" <th>has_anger</th>\n",
" <th>origin</th>\n",
" <th>txt</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>23310</td>\n",
" <td>S</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1141756984838054016</td>\n",
" <td>NaN</td>\n",
" <td>twitter</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>10</td>\n",
" <td>1141783320881241984</td>\n",
" <td>NaN</td>\n",
" <td>twitter</td>\n",
" <td>@flopdani kkkkkkkkkkkk amei e fiquei com vonta...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>100</td>\n",
" <td>60492</td>\n",
" <td>S</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1000</td>\n",
" <td>1137392542050393984</td>\n",
" <td>NaN</td>\n",
" <td>twitter</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-105b1703-3156-4019-b889-fe3df397b608')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-105b1703-3156-4019-b889-fe3df397b608 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-105b1703-3156-4019-b889-fe3df397b608');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" Unnamed: 0 ... txt\n",
"0 0 ... >>22994apóio o >>22995. um passo de cada vez. ...\n",
"1 1 ... eu ainda vou surtar com essa fic\n",
"2 10 ... @flopdani kkkkkkkkkkkk amei e fiquei com vonta...\n",
"3 100 ... >debater com luquistajá cansei de bater palma ...\n",
"4 1000 ... hoje eu só tô minha irmã ali kkkkkkkkkkkkkk \n",
"\n",
"[5 rows x 5 columns]"
]
},
"metadata": {},
"execution_count": 2
}
]
},
{
"cell_type": "markdown",
"source": [
"Ao verificar o balanceamento dos dados entre as classes, percebemos que as classes tem exatamente o mesmo número de elementos. Observe também que no nosso dataset temos agora duas classes: 1 (tem discurso de ódio) e 0 (não tem discurso de ódio)."
],
"metadata": {
"id": "cN0kh53_LD3M"
}
},
{
"cell_type": "code",
"source": [
"import seaborn as sns\n",
"\n",
"df['has_anger'] = df['has_anger'].fillna(0)\n",
"df['has_anger'] = df['has_anger'].replace(\"S\",1)\n",
"\n",
"ax = sns.countplot(x=\"has_anger\",data=df)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"id": "OiIlhV-xLGQ6",
"outputId": "4b7cbc74-01fd-4236-e9b8-bbb52c4c16a3"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEKCAYAAADjDHn2AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAVNUlEQVR4nO3df7BfdZ3f8efLgOAqFVhuWTYJDeumdWBbAe8Cu3amLI4k0K7BXXVhdiVSxrhT2NF2ZytsW3FRpjrqUrDIbBwiwbEiq25JHbpsClrXVn4kGn4EZLgFLMkEcpcAQq10Et/94/uJfknuzbnJ3nPvDff5mDlzz3mfzznnfZmQV86P7/mmqpAkaV9eNdsNSJLmPsNCktTJsJAkdTIsJEmdDAtJUifDQpLUqfewSLIgyfeSfL0tn5Dk7iRjSb6c5NWtflhbHmvrlwzt4/JWfyTJsr57liS93EycWXwAeHho+RPA1VX1y8CzwMWtfjHwbKtf3caR5ETgfOAkYDnw2SQLZqBvSVKTPj+Ul2QRsBa4CvhXwG8C48AvVNXOJL8GfKSqliW5vc1/J8khwFPACHAZQFX9+7bPn46b7LjHHHNMLVmypLffS5JeiTZu3Pg3VTUy0bpDej72fwD+NXBEW/554Lmq2tmWtwAL2/xC4EmAFiTPt/ELgbuG9jm8zYSWLFnChg0bpuUXkKT5IskPJlvX22WoJP8M2F5VG/s6xh7HW5VkQ5IN4+PjM3FISZo3+rxn8Rbg7UmeAG4GzgKuAY5sl5kAFgFb2/xWYDFAW/964Jnh+gTb/FRVra6q0aoaHRmZ8CxKknSAeguLqrq8qhZV1RIGN6jvrKrfBb4BvLMNWwnc2ubXtWXa+jtrcENlHXB+e1rqBGApcE9ffUuS9tb3PYuJfAi4OcnHgO8BN7T6DcAXkowBOxgEDFW1OcktwEPATuCSqto1821L0vzV69NQs2V0dLS8wS1J+yfJxqoanWidn+CWJHUyLCRJnQwLSVInw0KS1Gk2noY6KLz5j26a7RY0B2385IWz3QIA//vKfzjbLWgOOv7DD/S2b88sJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdeguLJIcnuSfJfUk2J/mTVr8xyeNJNrXp5FZPkmuTjCW5P8mpQ/tameTRNq3sq2dJ0sT6fEX5S8BZVfVikkOBbyf5r23dH1XVV/YYfw6wtE2nA9cDpyc5GrgCGAUK2JhkXVU922PvkqQhvZ1Z1MCLbfHQNtU+NlkB3NS2uws4MslxwDJgfVXtaAGxHljeV9+SpL31es8iyYIkm4DtDP7Cv7utuqpdaro6yWGtthB4cmjzLa02WV2SNEN6DYuq2lVVJwOLgNOS/ApwOfBG4FeBo4EPTcexkqxKsiHJhvHx8enYpSSpmZGnoarqOeAbwPKq2tYuNb0EfB44rQ3bCiwe2mxRq01W3/MYq6tqtKpGR0ZG+vg1JGne6vNpqJEkR7b51wBvA77f7kOQJMB5wINtk3XAhe2pqDOA56tqG3A7cHaSo5IcBZzdapKkGdLn01DHAWuTLGAQSrdU1deT3JlkBAiwCfj9Nv424FxgDPgRcBFAVe1I8lHg3jbuyqra0WPfkqQ99BYWVXU/cMoE9bMmGV/AJZOsWwOsmdYGJUlT5ie4JUmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVKn3sIiyeFJ7klyX5LNSf6k1U9IcneSsSRfTvLqVj+sLY+19UuG9nV5qz+SZFlfPUuSJtbnmcVLwFlV9SbgZGB5kjOATwBXV9UvA88CF7fxFwPPtvrVbRxJTgTOB04ClgOfTbKgx74lSXvoLSxq4MW2eGibCjgL+EqrrwXOa/Mr2jJt/VuTpNVvrqqXqupxYAw4ra++JUl76/WeRZIFSTYB24H1wP8CnquqnW3IFmBhm18IPAnQ1j8P/PxwfYJtJEkzoNewqKpdVXUysIjB2cAb+zpWklVJNiTZMD4+3tdhJGlempGnoarqOeAbwK8BRyY5pK1aBGxt81uBxQBt/euBZ4brE2wzfIzVVTVaVaMjIyO9/B6SNF/1+TTUSJIj2/xrgLcBDzMIjXe2YSuBW9v8urZMW39nVVWrn9+eljoBWArc01ffkqS9HdI95IAdB6xtTy69Crilqr6e5CHg5iQfA74H3NDG3wB8IckYsIPBE1BU1eYktwAPATuBS6pqV499S5L20FtYVNX9wCkT1B9jgqeZqurHwLsm2ddVwFXT3aMkaWr8BLckqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6tRbWCRZnOQbSR5KsjnJB1r9I0m2JtnUpnOHtrk8yViSR5IsG6ovb7WxJJf11bMkaWK9fQc3sBP4w6r6bpIjgI1J1rd1V1fVp4YHJzkROB84CfhF4L8l+ftt9XXA24AtwL1J1lXVQz32Lkka0ltYVNU2YFubfyHJw8DCfWyyAri5ql4CHk8yBpzW1o1V1WMASW5uYw0LSZohM3LPIskS4BTg7la6NMn9SdYkOarVFgJPDm22pdUmq0uSZkjvYZHkdcBXgQ9W1Q+B64E3ACczOPP49DQdZ1WSDUk2jI+PT8cuJUlNr2GR5FAGQfHFqvoaQFU9XVW7quonwOf42aWmrcDioc0Xtdpk9ZepqtVVNVpVoyMjI9P/y0jSPNbn01ABbgAerqo/HaofNzTsHcCDbX4dcH6Sw5KcACwF7gHuBZYmOSHJqxncBF/XV9+SpL31+TTUW4D3AA8k2dRqfwxckORkoIAngPcDVNXmJLcwuHG9E7ikqnYBJLkUuB1YAKypqs099i1J2kOfT0N9G8gEq27bxzZXAVdNUL9tX9tJkvrlJ7glSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnaYUFknumEpNkvTKtM/XfSQ5HPg54Jj2vRO7X9/xd/A7JSRp3uh6N9T7gQ8y+JrTjfwsLH4I/Mce+5IkzSH7DIuquga4JskfVNVnZqgnSdIcM6W3zlbVZ5L8OrBkeJuquqmnviRJc8iUwiLJFxh8FeomYFcrF2BYSNI8MNXvsxgFTqyq6rMZSdLcNNXPWTwI/EKfjUiS5q6pnlkcAzyU5B7gpd3Fqnp7L11JkuaUqYbFR/Z3x0kWM7incSyD+xurq+qaJEcDX2Zws/wJ4N1V9WySANcA5wI/At5bVd9t+1oJ/Nu2649V1dr97UeSdOCm+jTUfz+Afe8E/rCqvpvkCGBjkvXAe4E7qurjSS4DLgM+BJwDLG3T6cD1wOktXK5gcN+k2n7WVdWzB9CTJOkATPV1Hy8k+WGbfpxkV5If7mubqtq2+8ygql4AHmbwqe8VwO4zg7XAeW1+BXBTDdwFHJnkOGAZsL6qdrSAWA8s38/fU5L0tzDVM4sjds+3y0UrgDOmepAkS4BTgLuBY6tqW1v1FIPLVDAIkieHNtvSapPVJUkzZL/fOtv+5f+fGfyLv1OS1wFfBT5YVS87G2mP4k7L47hJViXZkGTD+Pj4dOxSktRM9UN5vzW0+CoG9w9+PIXtDmUQFF+sqq+18tNJjquqbe0y0/ZW3wosHtp8UattBc7co/7NPY9VVauB1QCjo6N+HkSSptFUzyx+c2haBrzA4FLUpNrlqhuAh6vqT4dWrQNWtvmVwK1D9QszcAbwfLtcdTtwdpKj2ptvz241SdIMmeo9i4sOYN9vAd4DPJBkU6v9MfBx4JYkFwM/AN7d1t3G4LHZMQaPzl7Ujr0jyUeBe9u4K6tqxwH0I0k6QFO9DLUI+AyDAAD4a+ADVbVlsm2q6tv87JXme3rrBOMLuGSSfa0B1kylV0nS9JvqZajPM7hM9Itt+i+tJkmaB6YaFiNV9fmq2tmmG4GRHvuSJM0hUw2LZ5L8XpIFbfo94Jk+G5MkzR1TDYt/zuBG9FPANuCdDF7bIUmaB6b6IsErgZW738fU3tf0KQYhIkl6hZvqmcU/Gn5xX3t09ZR+WpIkzTVTDYtXtQ/EAT89s5jqWYkk6SA31b/wPw18J8mft+V3AVf105Ikaa6Z6ie4b0qyATirlX6rqh7qry1J0lwy5UtJLRwMCEmah/b7FeWSpPnHsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVKn3sIiyZok25M8OFT7SJKtSTa16dyhdZcnGUvySJJlQ/XlrTaW5LK++pUkTa7PM4sbgeUT1K+uqpPbdBtAkhOB84GT2jaf3f1FS8B1wDnAicAFbawkaQb19ubYqvpWkiVTHL4CuLmqXgIeTzIGnNbWjVXVYwBJbm5jfe2IJM2g2bhncWmS+9tlqt2vPV8IPDk0ZkurTVaXJM2gmQ6L64E3ACcz+HrWT0/XjpOsSrIhyYbx8fHp2q0kiRkOi6p6uqp2VdVPgM/xs0tNW4HFQ0MXtdpk9Yn2vbqqRqtqdGRkZPqbl6R5bEbDIslxQ4vvAHY/KbUOOD/JYUlOAJYC9wD3AkuTnJDk1Qxugq+byZ4lST3e4E7yJeBM4JgkW4ArgDOTnAwU8ATwfoCq2pzkFgY3rncCl1TVrrafS4HbgQXAmqra3FfPkqSJ9fk01AUTlG/Yx/irmOCrWtvjtbdNY2uSpP3kJ7glSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUqfewiLJmiTbkzw4VDs6yfokj7afR7V6klybZCzJ/UlOHdpmZRv/aJKVffUrSZpcn2cWNwLL96hdBtxRVUuBO9oywDnA0jatAq6HQbgAVwCnA6cBV+wOGEnSzOktLKrqW8COPcorgLVtfi1w3lD9phq4CzgyyXHAMmB9Ve2oqmeB9ewdQJKkns30PYtjq2pbm38KOLbNLwSeHBq3pdUmq0uSZtCs3eCuqgJquvaXZFWSDUk2jI+PT9duJUnMfFg83S4v0X5ub/WtwOKhcYtabbL6XqpqdVWNVtXoyMjItDcuSfPZTIfFOmD3E00rgVuH6he2p6LOAJ5vl6tuB85OclS7sX12q0mSZtAhfe04yZeAM4Fjkmxh8FTTx4FbklwM/AB4dxt+G3AuMAb8CLgIoKp2JPkocG8bd2VV7XnTXJLUs97CoqoumGTVWycYW8Alk+xnDbBmGluTJO0nP8EtSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjrNSlgkeSLJA0k2JdnQakcnWZ/k0fbzqFZPkmuTjCW5P8mps9GzJM1ns3lm8RtVdXJVjbbly4A7qmopcEdbBjgHWNqmVcD1M96pJM1zc+ky1ApgbZtfC5w3VL+pBu4Cjkxy3Gw0KEnz1WyFRQF/lWRjklWtdmxVbWvzTwHHtvmFwJND225pNUnSDDlklo77j6tqa5K/C6xP8v3hlVVVSWp/dthCZxXA8ccfP32dSpJm58yiqra2n9uBvwBOA57efXmp/dzehm8FFg9tvqjV9tzn6qoararRkZGRPtuXpHlnxsMiyWuTHLF7HjgbeBBYB6xsw1YCt7b5dcCF7amoM4Dnhy5XSZJmwGxchjoW+Isku4//n6rqL5PcC9yS5GLgB8C72/jbgHOBMeBHwEUz37IkzW8zHhZV9RjwpgnqzwBvnaBewCUz0JokaRJz6dFZSdIcZVhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6HTRhkWR5kkeSjCW5bLb7kaT55KAIiyQLgOuAc4ATgQuSnDi7XUnS/HFQhAVwGjBWVY9V1f8DbgZWzHJPkjRvHCxhsRB4cmh5S6tJkmbAIbPdwHRJsgpY1RZfTPLIbPbzCnMM8Dez3cRckE+tnO0WtDf/fO52Rf62e/h7k604WMJiK7B4aHlRq/1UVa0GVs9kU/NFkg1VNTrbfUgT8c/nzDhYLkPdCyxNckKSVwPnA+tmuSdJmjcOijOLqtqZ5FLgdmABsKaqNs9yW5I0bxwUYQFQVbcBt812H/OUl/c0l/nncwakqma7B0nSHHew3LOQJM0iw0L75GtWNBclWZNke5IHZ7uX+cKw0KR8zYrmsBuB5bPdxHxiWGhffM2K5qSq+hawY7b7mE8MC+2Lr1mRBBgWkqQpMCy0L52vWZE0PxgW2hdfsyIJMCy0D1W1E9j9mpWHgVt8zYrmgiRfAr4D/IMkW5JcPNs9vdL5CW5JUifPLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCGpJkia+9lvZmWEivQEkOmq9M1sHBsJD2tiDJ55JsTvJXSV6T5H1J7k1yX5KvJvk5gCTvSvJgq39rsh22M5a/TvLdNv16q5+Z5JtJvpLk+0m+mCRt3bmttjHJtUm+3uqvbV/+c0+S7yVZ0ervTbIuyZ3AHb3/V9K8YlhIe1sKXFdVJwHPAb8NfK2qfrWq3sTg1Se7Xy/xYWBZq799H/vcDrytqk4Ffge4dmjdKcAHGXzB1C8Bb0lyOPBnwDlV9WZgZGj8vwHurKrTgN8APpnktW3dqcA7q+qfHODvLk3IsJD29nhVbWrzG4ElwK+0M4MHgN8FTmrr/wdwY5L3AQv2sc9Dgc+17f+cQTDsdk9VbamqnwCb2vHeCDxWVY+3MV8aGn82cFmSTcA3gcOB49u69VXllwJp2nldU9rbS0Pzu4DXMPgaz/Oq6r4k7wXOBKiq309yOvBPgY1J3lxVz0ywz38JPA28icE/0n68j+N1/X8Z4Ler6pGXFQd9/J+ObaUD4pmFNDVHANuSHMrgzAKAJG+oqrur6sPAOC///o9hrwe2tbOH97DvsxCAR4BfSrKkLf/O0LrbgT8Yurdxyn7+LtJ+Myykqfl3wN0MLjt9f6j+ySQPtMdt/ydw3yTbfxZYmeQ+BpeY9nkGUFX/F/gXwF8m2Qi8ADzfVn+UwWWt+5NsbstSr3xFuTRHJXldVb3YziCuAx6tqqtnuy/NT55ZSHPX+9pN7M0MLmP92Sz3o3nMMwtpGiVZBnxij/LjVfWO2ehHmi6GhSSpk5ehJEmdDAtJUifDQpLUybCQJHUyLCRJnf4/GpwkCY1/BjQAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"### 1.1. Tratando os dados\n",
"\n",
"Na maioria das vezes, os datasets utilizados como base para problemas de classificação precisam de um pré-processamento. Pois, muitas vezes, os dados podem conter problemas como: duplicidade, pouco poder discriminante (i.e., acrescenta muito pouco para a decisão por uma classe ou até mesmo atrapalha), e no caso de texto, podemos ter documentos muito pequenos (menos de 3 tokens).\n",
"\n",
"Aqui precisamos introduzir dois termos importantes que usaremos a partir deste ponto:\n",
"\n",
"**Documento:** Em Processamento de Linguagem Natural (PLN), um documento pode ser um texto (ex.: artigo de jornal), um parágrafo, uma frase, uma oração, etc. É uma unidade de linguagem. Que no nosso caso é uma postagem em rede social. Assim, não importa a quantidade de frases que há na postagem, para este notebook é sempre um único documento.\n",
"\n",
"**Token:** Também é uma unidade de linguagem, só que bem menor do que um documento, pois é a palavra propriamente dita. Consideramos assim cada palavra como um token. Veremos mais na frente que se pode considerar sub palavras como tokens em alguns contexto. Mas para fins de pré-processamento neste notebook, cada palavra é um token.\n",
"\n",
"Podemos dizer assim que cada documento é uma sequência de tokens. "
],
"metadata": {
"id": "o1C68jiMM1b2"
}
},
{
"cell_type": "markdown",
"source": [
"#### 1.2.1. Remoção de duplicados\n",
"O primeiro passo do nosso tratamento é excluir os duplicados."
],
"metadata": {
"id": "cQzu_XZqUapp"
}
},
{
"cell_type": "code",
"source": [
"df.drop_duplicates(subset=['txt'], keep='first', inplace=True)\n",
"df.info()"
],
"metadata": {
"id": "oExLFEluUjKO",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "42fc1169-64dc-4e42-ce10-3cb161f4b067"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 7636 entries, 0 to 7671\n",
"Data columns (total 5 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Unnamed: 0 7636 non-null int64 \n",
" 1 docno 7636 non-null int64 \n",
" 2 has_anger 7636 non-null int64 \n",
" 3 origin 7636 non-null object\n",
" 4 txt 7635 non-null object\n",
"dtypes: int64(3), object(2)\n",
"memory usage: 357.9+ KB\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"#### 1.1.2. Remoção de nomes de usuários\n",
"\n",
"Neste contexto, nomes de usuário não carregam poder discriminante. Ou seja, não devem afetar a escolha do modelo por uma classe ou outra. Além disso, remover os nomes de usuários torna o nosso modelo mais impessoal."
],
"metadata": {
"id": "Q3pBvoOLUj_D"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"import re\n",
"\n",
"df['txt']=df['txt'].apply(str)\n",
"df['txt'] = df['txt'].str.lower()\n",
"\n",
"def remove_users(tweet, pattern1, pattern2):\n",
" r = re.findall(pattern1, tweet)\n",
" for i in r:\n",
" tweet = re.sub(i, '', tweet)\n",
" \n",
" r = re.findall(pattern2, tweet)\n",
" for i in r:\n",
" tweet = re.sub(i, '', tweet)\n",
" return tweet\n",
" \n",
"df['tidy_tweet'] = np.vectorize(remove_users)(df['txt'], \"@ [\\w]*\", \"@[\\w]*\")\n",
"df.head()"
],
"metadata": {
"id": "jTW_hD2fUjSj",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 354
},
"outputId": "a9cc3cbc-613e-459d-a575-0f9ad7d34e66"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-cf4b1c12-3122-4c57-8a05-fb0fdda9a2ad\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>docno</th>\n",
" <th>has_anger</th>\n",
" <th>origin</th>\n",
" <th>txt</th>\n",
" <th>tidy_tweet</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>23310</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1141756984838054016</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>10</td>\n",
" <td>1141783320881241984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>@flopdani kkkkkkkkkkkk amei e fiquei com vonta...</td>\n",
" <td>kkkkkkkkkkkk amei e fiquei com vontade desse ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>100</td>\n",
" <td>60492</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1000</td>\n",
" <td>1137392542050393984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-cf4b1c12-3122-4c57-8a05-fb0fdda9a2ad')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-cf4b1c12-3122-4c57-8a05-fb0fdda9a2ad button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-cf4b1c12-3122-4c57-8a05-fb0fdda9a2ad');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" Unnamed: 0 ... tidy_tweet\n",
"0 0 ... >>22994apóio o >>22995. um passo de cada vez. ...\n",
"1 1 ... eu ainda vou surtar com essa fic\n",
"2 10 ... kkkkkkkkkkkk amei e fiquei com vontade desse ...\n",
"3 100 ... >debater com luquistajá cansei de bater palma ...\n",
"4 1000 ... hoje eu só tô minha irmã ali kkkkkkkkkkkkkk \n",
"\n",
"[5 rows x 6 columns]"
]
},
"metadata": {},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"source": [
"#### 1.1.3 Remoção de hashtags\n",
"\n",
"O mesmo se aplica a hashtags.\n",
"\n",
"> **OBSERVAÇÃO:** você pode pular etapas deste notebook ou modificá-lo para testar se a remoção de hashtags ou de nomes de usuário faz diferença no resultado final.\n",
"\n"
],
"metadata": {
"id": "e5OOeiFW28kC"
}
},
{
"cell_type": "code",
"source": [
"df['tidy_tweet'] = np.vectorize(remove_users)(df['tidy_tweet'], \"# [\\w]*\", \"#[\\w]*\")\n",
"df.head(10)"
],
"metadata": {
"id": "tOEGO0Kv3Ug8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 563
},
"outputId": "87bf6dc2-fd30-412d-8046-76fea17afa1f"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-703596d7-3db9-43f1-a5bb-c1cc9d9f1987\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>docno</th>\n",
" <th>has_anger</th>\n",
" <th>origin</th>\n",
" <th>txt</th>\n",
" <th>tidy_tweet</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>23310</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1141756984838054016</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>10</td>\n",
" <td>1141783320881241984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>@flopdani kkkkkkkkkkkk amei e fiquei com vonta...</td>\n",
" <td>kkkkkkkkkkkk amei e fiquei com vontade desse ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>100</td>\n",
" <td>60492</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1000</td>\n",
" <td>1137392542050393984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1001</td>\n",
" <td>58451</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;&gt;58447&gt;ela me chamou pra dormir na casa dela ...</td>\n",
" <td>&gt;&gt;58447&gt;ela me chamou pra dormir na casa dela ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1002</td>\n",
" <td>1141760017302983040</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>,,a co delas ve volnem case?’’\\nja:</td>\n",
" <td>,,a co delas ve volnem case?’’\\nja:</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1003</td>\n",
" <td>1141757895014539008</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>@ju_reis_21 obrigado ❤😎</td>\n",
" <td>obrigado ❤😎</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1004</td>\n",
" <td>58049</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;&gt;58025seu psiquiatra parece ter errado. muito...</td>\n",
" <td>&gt;&gt;58025seu psiquiatra parece ter errado. muito...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1005</td>\n",
" <td>1141779411802439936</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>bora assistir!</td>\n",
" <td>bora assistir!</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-703596d7-3db9-43f1-a5bb-c1cc9d9f1987')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-703596d7-3db9-43f1-a5bb-c1cc9d9f1987 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-703596d7-3db9-43f1-a5bb-c1cc9d9f1987');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" Unnamed: 0 ... tidy_tweet\n",
"0 0 ... >>22994apóio o >>22995. um passo de cada vez. ...\n",
"1 1 ... eu ainda vou surtar com essa fic\n",
"2 10 ... kkkkkkkkkkkk amei e fiquei com vontade desse ...\n",
"3 100 ... >debater com luquistajá cansei de bater palma ...\n",
"4 1000 ... hoje eu só tô minha irmã ali kkkkkkkkkkkkkk \n",
"5 1001 ... >>58447>ela me chamou pra dormir na casa dela ...\n",
"6 1002 ... ,,a co delas ve volnem case?’’\\nja: \n",
"7 1003 ... obrigado ❤😎\n",
"8 1004 ... >>58025seu psiquiatra parece ter errado. muito...\n",
"9 1005 ... bora assistir!\n",
"\n",
"[10 rows x 6 columns]"
]
},
"metadata": {},
"execution_count": 6
}
]
},
{
"cell_type": "markdown",
"source": [
"#### 1.1.4. Remoção de URLs\n",
"\n",
"O mesmo se aplica a URLs."
],
"metadata": {
"id": "KbQen15T3Zn-"
}
},
{
"cell_type": "code",
"source": [
"def remove_links(tweet):\n",
" tweet_no_link = re.sub(r\"http\\S+\", \"\", tweet)\n",
" return tweet_no_link\n",
"df['tidy_tweet'] = np.vectorize(remove_links)(df['tidy_tweet'])\n",
"df.head()"
],
"metadata": {
"id": "-g6mvRlv3c4s",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 354
},
"outputId": "a16aa230-e360-4ce5-e45c-b8d09f2a06f5"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-e88f13a1-54d1-4187-a1e0-88c5f7986bbb\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>docno</th>\n",
" <th>has_anger</th>\n",
" <th>origin</th>\n",
" <th>txt</th>\n",
" <th>tidy_tweet</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>23310</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1141756984838054016</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>10</td>\n",
" <td>1141783320881241984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>@flopdani kkkkkkkkkkkk amei e fiquei com vonta...</td>\n",
" <td>kkkkkkkkkkkk amei e fiquei com vontade desse ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>100</td>\n",
" <td>60492</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1000</td>\n",
" <td>1137392542050393984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e88f13a1-54d1-4187-a1e0-88c5f7986bbb')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-e88f13a1-54d1-4187-a1e0-88c5f7986bbb button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-e88f13a1-54d1-4187-a1e0-88c5f7986bbb');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" Unnamed: 0 ... tidy_tweet\n",
"0 0 ... >>22994apóio o >>22995. um passo de cada vez. ...\n",
"1 1 ... eu ainda vou surtar com essa fic\n",
"2 10 ... kkkkkkkkkkkk amei e fiquei com vontade desse ...\n",
"3 100 ... >debater com luquistajá cansei de bater palma ...\n",
"4 1000 ... hoje eu só tô minha irmã ali kkkkkkkkkkkkkk \n",
"\n",
"[5 rows x 6 columns]"
]
},
"metadata": {},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"source": [
"#### 1.1.5. Remoção de pontuação e numeração"
],
"metadata": {
"id": "R9285Rba3sq9"
}
},
{
"cell_type": "code",
"source": [
"df['tidy_tweet'] = df['tidy_tweet'].str.replace(\"[^\\w\\s]\", \" \")\n",
"df['tidy_tweet'] = df['tidy_tweet'].str.replace(\"[0-9]\", \" \")\n",
"df.head()"
],
"metadata": {
"id": "Ub6jxEJO3xWF",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 354
},
"outputId": "86698932-2f00-4e91-a99b-108a8dd1fa1b"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-0e2ad563-e120-4977-a890-76e9342c544f\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>docno</th>\n",
" <th>has_anger</th>\n",
" <th>origin</th>\n",
" <th>txt</th>\n",
" <th>tidy_tweet</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>23310</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" <td>apóio o um passo de cada vez ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1141756984838054016</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>10</td>\n",
" <td>1141783320881241984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>@flopdani kkkkkkkkkkkk amei e fiquei com vonta...</td>\n",
" <td>kkkkkkkkkkkk amei e fiquei com vontade desse ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>100</td>\n",
" <td>60492</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" <td>debater com luquistajá cansei de bater palma ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1000</td>\n",
" <td>1137392542050393984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0e2ad563-e120-4977-a890-76e9342c544f')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-0e2ad563-e120-4977-a890-76e9342c544f button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-0e2ad563-e120-4977-a890-76e9342c544f');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" Unnamed: 0 ... tidy_tweet\n",
"0 0 ... apóio o um passo de cada vez ...\n",
"1 1 ... eu ainda vou surtar com essa fic\n",
"2 10 ... kkkkkkkkkkkk amei e fiquei com vontade desse ...\n",
"3 100 ... debater com luquistajá cansei de bater palma ...\n",
"4 1000 ... hoje eu só tô minha irmã ali kkkkkkkkkkkkkk \n",
"\n",
"[5 rows x 6 columns]"
]
},
"metadata": {},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"source": [
"#### 1.1.6. Tokenização\n",
"\n",
"Como aboradado acima, um token é uma unidade de linguagem. No caso dete tutorial, consideramos token como a unidade mais básica da linguagem. \n",
"\n",
"Utilizaremos duas formas de tokenização. A primeira divide os documentos em tokens, sendo cada token uma palavra completa. Essa primeira forma nós utilizaremos apenas como um pré-processamento para limpeza dos dados."
],
"metadata": {
"id": "gIqDUy_MGBXD"
}
},
{
"cell_type": "code",
"source": [
"import nltk\n",
"nltk.download('punkt')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QDHMfoLbHjME",
"outputId": "b370b5b4-6517-4a4b-b59a-2be256c51476"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
"[nltk_data] Unzipping tokenizers/punkt.zip.\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "code",
"source": [
"from nltk.tokenize import word_tokenize\n",
"\n",
"def tokenize_function(items):\n",
" for item in items:\n",
" yield(word_tokenize(item)) \n",
"\n",
"df['tidy_tweet_tokens'] = list(tokenize_function(df['tidy_tweet']))\n",
"\n",
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 459
},
"id": "OhSox_HsHPkD",
"outputId": "7cd2cf1b-ae78-4a0b-92f8-18159e33c026"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-b0a4cb6c-23a4-4b61-ab8e-8a8a34100395\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>docno</th>\n",
" <th>has_anger</th>\n",
" <th>origin</th>\n",
" <th>txt</th>\n",
" <th>tidy_tweet</th>\n",
" <th>tidy_tweet_tokens</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>23310</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;&gt;22994apóio o &gt;&gt;22995. um passo de cada vez. ...</td>\n",
" <td>apóio o um passo de cada vez ...</td>\n",
" <td>[apóio, o, um, passo, de, cada, vez, não, tenh...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1141756984838054016</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" <td>eu ainda vou surtar com essa fic</td>\n",
" <td>[eu, ainda, vou, surtar, com, essa, fic]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>10</td>\n",
" <td>1141783320881241984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>@flopdani kkkkkkkkkkkk amei e fiquei com vonta...</td>\n",
" <td>kkkkkkkkkkkk amei e fiquei com vontade desse ...</td>\n",
" <td>[kkkkkkkkkkkk, amei, e, fiquei, com, vontade, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>100</td>\n",
" <td>60492</td>\n",
" <td>1</td>\n",
" <td>55chan/pol</td>\n",
" <td>&gt;debater com luquistajá cansei de bater palma ...</td>\n",
" <td>debater com luquistajá cansei de bater palma ...</td>\n",
" <td>[debater, com, luquistajá, cansei, de, bater, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1000</td>\n",
" <td>1137392542050393984</td>\n",
" <td>0</td>\n",
" <td>twitter</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" <td>hoje eu só tô minha irmã ali kkkkkkkkkkkkkk</td>\n",
" <td>[hoje, eu, só, tô, minha, irmã, ali, kkkkkkkkk...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b0a4cb6c-23a4-4b61-ab8e-8a8a34100395')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-b0a4cb6c-23a4-4b61-ab8e-8a8a34100395 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-b0a4cb6c-23a4-4b61-ab8e-8a8a34100395');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" Unnamed: 0 ... tidy_tweet_tokens\n",
"0 0 ... [apóio, o, um, passo, de, cada, vez, não, tenh...\n",
"1 1 ... [eu, ainda, vou, surtar, com, essa, fic]\n",
"2 10 ... [kkkkkkkkkkkk, amei, e, fiquei, com, vontade, ...\n",
"3 100 ... [debater, com, luquistajá, cansei, de, bater, ...\n",
"4 1000 ... [hoje, eu, só, tô, minha, irmã, ali, kkkkkkkkk...\n",
"\n",
"[5 rows x 7 columns]"
]
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "markdown",
"source": [
"A outra forma de tokenização varemos mais a frente."
],
"metadata": {
"id": "g56zQ19QHwsD"
}
},
{
"cell_type": "markdown",
"source": [
"#### 1.1.7. Remoção de documentos com menos de 3 tokens\n",
"\n",
"Geralmente, documentos com menos de 3 tokens carrengam pouca informação, da qual não se pode extrair informações analíticas o suficiente."
],
"metadata": {
"id": "a807ykRbHttY"
}
},
{
"cell_type": "code",
"source": [
"df['length'] = df['tidy_tweet_tokens'].apply(len)\n",
"df = df.drop(df[df['length']<3].index)\n",
"df = df.drop(['length'], axis=1)\n",
"df.shape\n",
"df.reset_index(drop=True, inplace=True)\n",
"df.info()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "T0tiM0oqH82G",
"outputId": "e8262db3-19c9-4fb8-ed54-c3191f35792d"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 7299 entries, 0 to 7298\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Unnamed: 0 7299 non-null int64 \n",
" 1 docno 7299 non-null int64 \n",
" 2 has_anger 7299 non-null int64 \n",
" 3 origin 7299 non-null object\n",
" 4 txt 7299 non-null object\n",
" 5 tidy_tweet 7299 non-null object\n",
" 6 tidy_tweet_tokens 7299 non-null object\n",
"dtypes: int64(3), object(4)\n",
"memory usage: 399.3+ KB\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### 1.2. Análise dos dados\n",
"\n",
"Para se resolver a maioria dos problemas de classificação binária é de extrema importância que seja feita uma análise no dataset usado. Essa análise pode ajudar a identificar fatores que tem potencial de inserção de viés no modelo."
],
"metadata": {
"id": "Z-ZHycAMIMRj"
}
},
{
"cell_type": "markdown",
"source": [
"#### 1.2.1. Balanceamento entre as classes\n",
"\n",
"Um desses fatores é o balanceamento de dados entre as classes, ou a falta dele. Quando os dados estão desbalanceados o modelo pode tender a decidir mais por uma classe em detrimento da outra. \n",
"\n",
"Como vemos no plot abaixo, os nossos dados estão de fato desbalanceados. Isso aconteceu neste dataset porque nós removemos alguns documentos acima. \n"
],
"metadata": {
"id": "T7RW1eODIO7m"
}
},
{
"cell_type": "code",
"source": [
"ax = sns.countplot(x=\"has_anger\",data=df)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 284
},
"id": "msi95FhkIXRV",
"outputId": "bc3aaffa-17a2-4dfc-b70f-3338d412c949"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAELCAYAAAAoUKpTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAVGElEQVR4nO3de5Be9X3f8ffH4ubYxEDYUqJLRRy1HkhrwBtw486UwBgEaSOc2A5MYsuUsZwpZOxOmhrSqXFw6NhjO9SkmIlcZMDjoii+FJVRQ1TAdd2Gi2SLi8AMW8BFGhkpCLCJazqi3/7x/NZ5LO3uWeE9uyv2/Zo5s+d8z++c57uM0Efn8pyTqkKSpKm8Zq4bkCTNf4aFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpU+9hkWRRkm8lub0tn5Tk3iRjSf40yRGtfmRbHmvrlw/t48pWfyzJeX33LEn6cbNxZPFB4NGh5U8A11bVzwPPAZe2+qXAc61+bRtHkpOBi4BTgJXAZ5MsmoW+JUlNr2GRZAnwK8B/aMsBzga+1IbcDFzY5le1Zdr6c9r4VcD6qnqpqp4ExoAz+uxbkvTjDut5//8O+FfA0W35Z4Dnq2pfW94BLG7zi4GnAapqX5IX2vjFwD1D+xzeZkLHH398LV++fCb6l6QFY+vWrX9VVSMTrestLJL8E2B3VW1NclZfnzP0eWuANQDLli1jy5YtfX+kJL2qJPnOZOv6PA31NuBXkzwFrGdw+ukzwDFJxkNqCbCzze8ElgK09W8Anh2uT7DNj1TV2qoararRkZEJg1GS9Ar1FhZVdWVVLamq5QwuUN9VVb8J3A28sw1bDdzW5je2Zdr6u2rwlMONwEXtbqmTgBXAfX31LUk6UN/XLCbyYWB9kj8EvgXc2Oo3Al9IMgbsZRAwVNX2JBuAR4B9wGVV9fLsty1JC1dejY8oHx0dLa9ZSNLBSbK1qkYnWuc3uCVJnQwLSVInw0KS1MmwkCR1MiwkSZ3m4tZZST+h/33135/rFjQPLfvIQ73t2yMLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnXoLiyRHJbkvyQNJtif5g1a/KcmTSba16dRWT5LrkowleTDJ6UP7Wp3k8Tat7qtnSdLE+nzq7EvA2VX1YpLDgW8k+S9t3e9V1Zf2G38+sKJNZwI3AGcmOQ64ChgFCtiaZGNVPddj75KkIb0dWdTAi23x8DbVFJusAm5p290DHJPkROA8YHNV7W0BsRlY2VffkqQD9XrNIsmiJNuA3Qz+wr+3rbqmnWq6NsmRrbYYeHpo8x2tNlldkjRLeg2Lqnq5qk4FlgBnJPkF4ErgTcAvAscBH56Jz0qyJsmWJFv27NkzE7uUJDWz8qa8qno+yd3Ayqr6VCu/lOTzwL9syzuBpUObLWm1ncBZ+9W/NsFnrAXWAoyOjk51umta3vJ7t/yku9Cr0NZPvneuW5DmRJ93Q40kOabNvxZ4O/Dtdh2CJAEuBB5um2wE3tvuinor8EJV7QLuAM5NcmySY4FzW02SNEv6PLI4Ebg5ySIGobShqm5PcleSESDANuC32/hNwAXAGPAD4BKAqtqb5GPA/W3c1VW1t8e+JUn76S0squpB4LQJ6mdPMr6AyyZZtw5YN6MNSpKmzW9wS5I6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROvYVFkqOS3JfkgSTbk/xBq5+U5N4kY0n+NMkRrX5kWx5r65cP7evKVn8syXl99SxJmlifRxYvAWdX1ZuBU4GVSd4KfAK4tqp+HngOuLSNvxR4rtWvbeNIcjJwEXAKsBL4bJJFPfYtSdpPb2FRAy+2xcPbVMDZwJda/Wbgwja/qi3T1p+TJK2+vqpeqqongTHgjL76liQdqNdrFkkWJdkG7AY2A/8LeL6q9rUhO4DFbX4x8DRAW/8C8DPD9Qm2kSTNgl7DoqperqpTgSUMjgbe1NdnJVmTZEuSLXv27OnrYyRpQZqVu6Gq6nngbuAfAsckOaytWgLsbPM7gaUAbf0bgGeH6xNsM/wZa6tqtKpGR0ZGevk9JGmh6vNuqJEkx7T51wJvBx5lEBrvbMNWA7e1+Y1tmbb+rqqqVr+o3S11ErACuK+vviVJBzqse8grdiJwc7tz6TXAhqq6PckjwPokfwh8C7ixjb8R+EKSMWAvgzugqKrtSTYAjwD7gMuq6uUe+5Yk7ae3sKiqB4HTJqg/wQR3M1XVD4F3TbKva4BrZrpHSdL0+A1uSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdeotLJIsTXJ3kkeSbE/ywVb/aJKdSba16YKhba5MMpbksSTnDdVXttpYkiv66lmSNLHDetz3PuB3q+qbSY4GtibZ3NZdW1WfGh6c5GTgIuAU4GeB/5rk77bV1wNvB3YA9yfZWFWP9Ni7JGlIb2FRVbuAXW3++0keBRZPsckqYH1VvQQ8mWQMOKOtG6uqJwCSrG9jDQtJmiWzcs0iyXLgNODeVro8yYNJ1iU5ttUWA08Pbbaj1Sar7/8Za5JsSbJlz549M/wbSNLC1ntYJHk98GXgQ1X1PeAG4I3AqQyOPD49E59TVWurarSqRkdGRmZil5Kkps9rFiQ5nEFQfLGqvgJQVc8Mrf8ccHtb3AksHdp8SasxRV2SNAv6vBsqwI3Ao1X1R0P1E4eGvQN4uM1vBC5KcmSSk4AVwH3A/cCKJCclOYLBRfCNffUtSTpQn0cWbwPeAzyUZFur/T5wcZJTgQKeAj4AUFXbk2xgcOF6H3BZVb0MkORy4A5gEbCuqrb32LckaT993g31DSATrNo0xTbXANdMUN801XaSpH75DW5JUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktRpWmGR5M7p1CRJr05TPhsqyVHATwHHt5cUjT/r6aeZ+q13kqRXka4HCX4A+BCDd2Jv5W/C4nvAv++xL0nSPDJlWFTVZ4DPJPmdqvrjWepJkjTPTOsR5VX1x0l+CVg+vE1V3dJTX5KkeWRaYZHkCwzem70NeLmVCzAsJGkBmO7Lj0aBk6uq+mxGkjQ/Tfd7Fg8Df7vPRiRJ89d0w+J44JEkdyTZOD5NtUGSpUnuTvJIku1JPtjqxyXZnOTx9vPYVk+S65KMJXkwyelD+1rdxj+eZPUr/WUlSa/MdE9DffQV7Hsf8LtV9c0kRwNbk2wG3gfcWVUfT3IFcAXwYeB8YEWbzgRuAM5MchxwFYNTYdX2s7GqnnsFPUmSXoHp3g313w52x1W1C9jV5r+f5FEGX+RbBZzVht0MfI1BWKwCbmnXRe5JckySE9vYzVW1F6AFzkrg1oPtSZL0ykz3bqjvM/hXPcARwOHAX1fVT09z++XAacC9wAktSAC+C5zQ5hcDTw9ttqPVJqvv/xlrgDUAy5Ytm05bkqRpmu6RxdHj80nC4CjgrdPZNsnrgS8DH6qq7w02/9F+K8mM3GFVVWuBtQCjo6PetSVJM+ignzpbA/8JOK9rbJLDGQTFF6vqK638TDu9RPu5u9V3AkuHNl/SapPVJUmzZLqnoX5taPE1DC42/7BjmwA3Ao9W1R8NrdoIrAY+3n7eNlS/PMl6Bhe4X6iqXUnuAP7t+F1TwLnAldPpW5I0M6Z7N9Q/HZrfBzzF4FTUVN4GvAd4KMm2Vvt9BiGxIcmlwHeAd7d1m4ALgDHgB8AlAFW1N8nHgPvbuKvHL3ZLkmbHdK9ZXHKwO66qb/A3T6nd3zkTjC/gskn2tQ5Yd7A9SJJmxnRffrQkyVeT7G7Tl5Ms6bs5SdL8MN0L3J9ncE3hZ9v0n1tNkrQATDcsRqrq81W1r003ASM99iVJmkemGxbPJvmtJIva9FvAs302JkmaP6YbFv+MwV1L32XwCI93MnjGkyRpAZjurbNXA6vHH97XHu73KQYhIkl6lZvukcU/GH7Ka/uew2n9tCRJmm+mGxavGfoG9fiRxXSPSiRJh7jp/oX/aeAvk/xZW34XcE0/LUmS5pvpfoP7liRbgLNb6deq6pH+2pIkzSfTPpXUwsGAkKQF6KAfUS5JWngMC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUqbewSLKuvVXv4aHaR5PsTLKtTRcMrbsyyViSx5KcN1Rf2WpjSa7oq19J0uT6PLK4CVg5Qf3aqjq1TZsAkpwMXASc0rb57Pi7M4DrgfOBk4GL21hJ0izq7WGAVfX1JMunOXwVsL6qXgKeTDIGnNHWjVXVEwBJ1rexfpNckmbRXFyzuDzJg+001fiTbBcDTw+N2dFqk9UPkGRNki1JtuzZs6ePviVpwZrtsLgBeCNwKoM37n16pnZcVWurarSqRkdGfD24JM2kWX0nRVU9Mz6f5HPA7W1xJ7B0aOiSVmOKuiRplszqkUWSE4cW3wGM3ym1EbgoyZFJTgJWAPcB9wMrkpyU5AgGF8E3zmbPkqQejyyS3AqcBRyfZAdwFXBWklOBAp4CPgBQVduTbGBw4XofcFlVvdz2czlwB7AIWFdV2/vqWZI0sT7vhrp4gvKNU4y/hgnevtdur900g61Jkg6S3+CWJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ16C4sk65LsTvLwUO24JJuTPN5+HtvqSXJdkrEkDyY5fWib1W3840lW99WvJGlyfR5Z3ASs3K92BXBnVa0A7mzLAOcDK9q0BrgBBuECXAWcCZwBXDUeMJKk2dNbWFTV14G9+5VXATe3+ZuBC4fqt9TAPcAxSU4EzgM2V9XeqnoO2MyBASRJ6tlsX7M4oap2tfnvAie0+cXA00PjdrTaZHVJ0iyaswvcVVVAzdT+kqxJsiXJlj179szUbiVJzH5YPNNOL9F+7m71ncDSoXFLWm2y+gGqam1VjVbV6MjIyIw3LkkL2WyHxUZg/I6m1cBtQ/X3trui3gq80E5X3QGcm+TYdmH73FaTJM2iw/racZJbgbOA45PsYHBX08eBDUkuBb4DvLsN3wRcAIwBPwAuAaiqvUk+Btzfxl1dVftfNJck9ay3sKiqiydZdc4EYwu4bJL9rAPWzWBrkqSD5De4JUmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVKnOQmLJE8leSjJtiRbWu24JJuTPN5+HtvqSXJdkrEkDyY5fS56lqSFbC6PLH65qk6tqtG2fAVwZ1WtAO5sywDnAyvatAa4YdY7laQFbj6dhloF3NzmbwYuHKrfUgP3AMckOXEuGpSkhWquwqKAv0iyNcmaVjuhqna1+e8CJ7T5xcDTQ9vuaDVJ0iw5bI4+9x9V1c4kfwvYnOTbwyurqpLUweywhc4agGXLls1cp5KkuTmyqKqd7edu4KvAGcAz46eX2s/dbfhOYOnQ5ktabf99rq2q0aoaHRkZ6bN9SVpwZj0skrwuydHj88C5wMPARmB1G7YauK3NbwTe2+6KeivwwtDpKknSLJiL01AnAF9NMv75/7Gq/jzJ/cCGJJcC3wHe3cZvAi4AxoAfAJfMfsuStLDNelhU1RPAmyeoPwucM0G9gMtmoTVJ0iTm062zkqR5yrCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1OmTCIsnKJI8lGUtyxVz3I0kLySERFkkWAdcD5wMnAxcnOXluu5KkheOQCAvgDGCsqp6oqv8LrAdWzXFPkrRgHCphsRh4emh5R6tJkmbBYXPdwExJsgZY0xZfTPLYXPbzKnM88Fdz3cR8kE+tnusWdCD/fI67Kj/pHv7OZCsOlbDYCSwdWl7Saj9SVWuBtbPZ1EKRZEtVjc51H9JE/PM5Ow6V01D3AyuSnJTkCOAiYOMc9yRJC8YhcWRRVfuSXA7cASwC1lXV9jluS5IWjEMiLACqahOwaa77WKA8vaf5zD+fsyBVNdc9SJLmuUPlmoUkaQ4ZFpqSj1nRfJRkXZLdSR6e614WCsNCk/IxK5rHbgJWznUTC4lhoan4mBXNS1X1dWDvXPexkBgWmoqPWZEEGBaSpGkwLDSVzsesSFoYDAtNxcesSAIMC02hqvYB449ZeRTY4GNWNB8kuRX4S+DvJdmR5NK57unVzm9wS5I6eWQhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFtKQJMt97LV0IMNCehVKcsi8MlmHBsNCOtCiJJ9Lsj3JXyR5bZL3J7k/yQNJvpzkpwCSvCvJw63+9cl22I5Y/nuSb7bpl1r9rCRfS/KlJN9O8sUkaesuaLWtSa5Lcnurv669/Oe+JN9KsqrV35dkY5K7gDt7/6+kBcWwkA60Ari+qk4Bngd+HfhKVf1iVb2ZwaNPxh8v8RHgvFb/1Sn2uRt4e1WdDvwGcN3QutOADzF4wdTPAW9LchTwJ8D5VfUWYGRo/L8G7qqqM4BfBj6Z5HVt3enAO6vqH7/C312akGEhHejJqtrW5rcCy4FfaEcGDwG/CZzS1v8P4KYk7wcWTbHPw4HPte3/jEEwjLuvqnZU1f8DtrXPexPwRFU92cbcOjT+XOCKJNuArwFHAcvaus1V5UuBNOM8rykd6KWh+ZeB1zJ4jeeFVfVAkvcBZwFU1W8nORP4FWBrkrdU1bMT7PNfAM8Ab2bwj7QfTvF5Xf9fBvj1qnrsx4qDPv66Y1vpFfHIQpqeo4FdSQ5ncGQBQJI3VtW9VfURYA8//v6PYW8AdrWjh/cw9VEIwGPAzyVZ3pZ/Y2jdHcDvDF3bOO0gfxfpoBkW0vT8G+BeBqedvj1U/2SSh9rttv8TeGCS7T8LrE7yAINTTFMeAVTV/wH+OfDnSbYC3wdeaKs/xuC01oNJtrdlqVc+olyap5K8vqpebEcQ1wOPV9W1c92XFiaPLKT56/3tIvZ2Bqex/mSO+9EC5pGFNIOSnAd8Yr/yk1X1jrnoR5ophoUkqZOnoSRJnQwLSVInw0KS1MmwkCR1MiwkSZ3+P75l1djL4q7HAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "code",
"source": [
"df.has_anger.value_counts()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "U3eHKFyNJ7YB",
"outputId": "cb903a5a-bff1-43d9-8bf2-744db4710015"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"1 3815\n",
"0 3484\n",
"Name: has_anger, dtype: int64"
]
},
"metadata": {},
"execution_count": 18
}
]
},
{
"cell_type": "markdown",
"source": [
"Como vemos, a diferença entre as duas classes até que não é tão grande. Pouco mais de 300 tweets. No entanto, mesmo assim vamos remover esse problema. Vamos fazer isso removendo de forma aleatória a quantidade de documentos excedentes da classe majoriária."
],
"metadata": {
"id": "uyyqDvoWJ7lL"
}
},
{
"cell_type": "code",
"source": [
"excluir_n = len(df[df.has_anger == 1]) - len(df[df.has_anger == 0]) # Indica a quantidade que precisa ser removida\n",
"new_df = df.drop(df.query('has_anger == 1').sample(n=excluir_n).index) # Remove de forma aleatória.\n",
"new_df.info()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "vaQKg7T1KHsp",
"outputId": "6c4965b4-9279-4e1e-ff3e-e950c6bf9a56"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 6968 entries, 0 to 7298\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Unnamed: 0 6968 non-null int64 \n",
" 1 docno 6968 non-null int64 \n",
" 2 has_anger 6968 non-null int64 \n",
" 3 origin 6968 non-null object\n",
" 4 txt 6968 non-null object\n",
" 5 tidy_tweet 6968 non-null object\n",
" 6 tidy_tweet_tokens 6968 non-null object\n",
"dtypes: int64(3), object(4)\n",
"memory usage: 435.5+ KB\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"ax = sns.countplot(x=\"has_anger\",data=new_df)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 280
},
"id": "-S183QD4Kdyn",
"outputId": "29c2d9a5-d7ba-4dab-a0dd-ec6a358fae55"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEHCAYAAABfkmooAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAATtElEQVR4nO3df5Bd5X3f8ffHAoxjOwGKSrEkKuyqzUBaBN4CjTtTgscg6DTCie1AGyNTxnKm0Ik7aaY4nRoHl5lkbIcJDWYiDzKQcU2Jf9Sqh4ao2KnrNgZJjvghMMMW7CKNQBuEsYlrOiLf/nEfxdfS7j5X9t7dFft+zZzZc7/Pc879LiP00flxz01VIUnSbF610A1IkhY/w0KS1GVYSJK6DAtJUpdhIUnqOmahGxiHk08+uVavXr3QbUjSUWXHjh1/XlXLpxt7RYbF6tWr2b59+0K3IUlHlSTfmmnM01CSpC7DQpLUZVhIkroMC0lSl2EhSeoaW1gkOT7JA0keTLIryW+2+u1Jnkqysy1rWz1Jbk4ymeShJOcM7WtDkifasmFcPUuSpjfOW2dfAi6sqheTHAt8Ncl/bWO/XlWfOWT+JcCatpwH3Aqcl+Qk4HpgAihgR5ItVfX8GHuXJA0Z25FFDbzYXh7bltmeh74euLNt9zXghCSnAhcDW6tqfwuIrcC6cfUtSTrcWK9ZJFmWZCewj8Ff+Pe3oRvbqaabkry61VYATw9tvrvVZqof+l4bk2xPsn1qamrOfxdJWsrG+gnuqnoZWJvkBODzSX4G+ADwDHAcsAn4N8ANc/Bem9r+mJiY+LG/0enNv37nj7sLvQLt+MiVC90CAP/nhr+70C1oETrtgw+Pbd/zcjdUVX0b+DKwrqr2tlNNLwGfBM5t0/YAq4Y2W9lqM9UlSfNknHdDLW9HFCR5DfA24BvtOgRJAlwGPNI22QJc2e6KOh94oar2AvcCFyU5McmJwEWtJkmaJ+M8DXUqcEeSZQxC6e6q+mKSLyVZDgTYCfxKm38PcCkwCXwPuAqgqvYn+TCwrc27oar2j7FvSdIhxhYWVfUQcPY09QtnmF/ANTOMbQY2z2mDkqSR+QluSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpa2xhkeT4JA8keTDJriS/2eqnJ7k/yWSS/5TkuFZ/dXs92cZXD+3rA63+eJKLx9WzJGl64zyyeAm4sKrOAtYC65KcD/w2cFNV/S3geeDqNv9q4PlWv6nNI8kZwOXAmcA64ONJlo2xb0nSIcYWFjXwYnt5bFsKuBD4TKvfAVzW1te317TxtyZJq99VVS9V1VPAJHDuuPqWJB1urNcskixLshPYB2wF/jfw7ao60KbsBla09RXA0wBt/AXgrw3Xp9lGkjQPxhoWVfVyVa0FVjI4Gvjpcb1Xko1JtifZPjU1Na63kaQlaV7uhqqqbwNfBv4BcEKSY9rQSmBPW98DrAJo4z8FPDdcn2ab4ffYVFUTVTWxfPnysfwekrRUjfNuqOVJTmjrrwHeBjzGIDTe0aZtAL7Q1re017TxL1VVtfrl7W6p04E1wAPj6luSdLhj+lN+ZKcCd7Q7l14F3F1VX0zyKHBXkn8P/BlwW5t/G/AHSSaB/QzugKKqdiW5G3gUOABcU1Uvj7FvSdIhxhYWVfUQcPY09SeZ5m6mqvo+8M4Z9nUjcONc9yhJGo2f4JYkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lS19jCIsmqJF9O8miSXUl+tdU/lGRPkp1tuXRomw8kmUzyeJKLh+rrWm0yyXXj6lmSNL1jxrjvA8CvVdXXk7we2JFkaxu7qao+Ojw5yRnA5cCZwBuA/5bkb7fhW4C3AbuBbUm2VNWjY+xdkjRkbGFRVXuBvW39u0keA1bMssl64K6qegl4KskkcG4bm6yqJwGS3NXmGhaSNE/m5ZpFktXA2cD9rXRtkoeSbE5yYqutAJ4e2mx3q81UP/Q9NibZnmT71NTUHP8GkrS0jT0skrwO+Czw/qr6DnAr8CZgLYMjj4/NxftU1aaqmqiqieXLl8/FLiVJzTivWZDkWAZB8amq+hxAVT07NP4J4Ivt5R5g1dDmK1uNWeqSpHkwzruhAtwGPFZVvzNUP3Vo2tuBR9r6FuDyJK9OcjqwBngA2AasSXJ6kuMYXATfMq6+JUmHG+eRxVuAdwMPJ9nZar8BXJFkLVDAN4H3AVTVriR3M7hwfQC4pqpeBkhyLXAvsAzYXFW7xti3JOkQ47wb6qtAphm6Z5ZtbgRunKZ+z2zbSZLGy09wS5K6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKlrbGGRZFWSLyd5NMmuJL/a6icl2ZrkifbzxFZPkpuTTCZ5KMk5Q/va0OY/kWTDuHqWJE1vnEcWB4Bfq6ozgPOBa5KcAVwH3FdVa4D72muAS4A1bdkI3AqDcAGuB84DzgWuPxgwkqT5MVJYJLlvlNqwqtpbVV9v698FHgNWAOuBO9q0O4DL2vp64M4a+BpwQpJTgYuBrVW1v6qeB7YC60bpW5I0N46ZbTDJ8cBPACe3f82nDf0kg7/4R5JkNXA2cD9wSlXtbUPPAKe09RXA00Ob7W61meqHvsdGBkcknHbaaaO2JkkawaxhAbwPeD/wBmAHPwiL7wC/N8obJHkd8Fng/VX1nSR/NVZVlaSOtOnpVNUmYBPAxMTEnOxTkjQw62moqvrdqjod+NdV9caqOr0tZ1VVNyySHMsgKD5VVZ9r5Wfb6SXaz32tvgdYNbT5ylabqS5JmicjXbOoqv+Q5GeT/NMkVx5cZtsmg0OI24DHqup3hoa2AAfvaNoAfGGofmW7K+p84IV2uupe4KIkJ7ZTYRe1miRpnvROQwGQ5A+ANwE7gZdbuYA7Z9nsLcC7gYeT7Gy13wB+C7g7ydXAt4B3tbF7gEuBSeB7wFUAVbU/yYeBbW3eDVW1f5S+JUlzY6SwACaAM6pq5GsBVfVVfnCN41BvnWZ+AdfMsK/NwOZR31uSNLdG/ZzFI8DfGGcjkqTFa9Qji5OBR5M8ALx0sFhVPz+WriRJi8qoYfGhcTYhSVrcRgqLqvrv425EkrR4jXo31HcZ3P0EcBxwLPAXVfWT42pMkrR4jHpk8fqD6+3zE+sZPBxQkrQEHPFTZ9uD/v4zgwf8SZKWgFFPQ/3C0MtXMfjcxffH0pEkadEZ9W6ofzK0fgD4JoNTUZKkJWDUaxZXjbsRSdLiNeqXH61M8vkk+9ry2SQrx92cJGlxGPUC9ycZPBX2DW35L60mSVoCRg2L5VX1yao60JbbgeVj7EuStIiMGhbPJfnlJMva8svAc+NsTJK0eIwaFv+cwfdOPAPsBd4BvGdMPUmSFplRb529AdhQVc8DJDkJ+CiDEJEkvcKNemTx9w4GBQy+vQ44ezwtSZIWm1HD4lXt+6+BvzqyGPWoRJJ0lBv1L/yPAX+a5A/b63cCN46nJUnSYjPqJ7jvTLIduLCVfqGqHh1fW5KkxWTkp85W1aNV9Xtt6QZFks3t096PDNU+lGRPkp1tuXRo7ANJJpM8nuTiofq6VptMct2R/HKSpLlxxI8oPwK3A+umqd9UVWvbcg9AkjOAy4Ez2zYfP/iZDuAW4BLgDOCKNleSNI/GdpG6qr6SZPWI09cDd1XVS8BTSSaBc9vYZFU9CZDkrjbXU2CSNI/GeWQxk2uTPNROUx28w2oF8PTQnN2tNlNdkjSP5jssbgXeBKxl8Enwj83VjpNsTLI9yfapqam52q0kiXkOi6p6tqperqq/BD7BD0417QFWDU1d2Woz1afb96aqmqiqieXLfcahJM2leQ2LJKcOvXw7cPBOqS3A5UleneR0YA3wALANWJPk9CTHMbgIvmU+e5YkjfECd5JPAxcAJyfZDVwPXJBkLVAMvpr1fQBVtSvJ3QwuXB8Arqmql9t+rgXuBZYBm6tq17h6liRNb5x3Q10xTfm2WebfyDSfCm+3194zh61Jko7QQtwNJUk6yhgWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lS19jCIsnmJPuSPDJUOynJ1iRPtJ8ntnqS3JxkMslDSc4Z2mZDm/9Ekg3j6leSNLNxHlncDqw7pHYdcF9VrQHua68BLgHWtGUjcCsMwgW4HjgPOBe4/mDASJLmz9jCoqq+Auw/pLweuKOt3wFcNlS/swa+BpyQ5FTgYmBrVe2vqueBrRweQJKkMZvvaxanVNXetv4McEpbXwE8PTRvd6vNVD9Mko1JtifZPjU1NbddS9ISt2AXuKuqgJrD/W2qqomqmli+fPlc7VaSxPyHxbPt9BLt575W3wOsGpq3stVmqkuS5tF8h8UW4OAdTRuALwzVr2x3RZ0PvNBOV90LXJTkxHZh+6JWkyTNo2PGteMknwYuAE5OspvBXU2/Bdyd5GrgW8C72vR7gEuBSeB7wFUAVbU/yYeBbW3eDVV16EVzSdKYjS0squqKGYbeOs3cAq6ZYT+bgc1z2Jok6Qj5CW5JUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdS1IWCT5ZpKHk+xMsr3VTkqyNckT7eeJrZ4kNyeZTPJQknMWomdJWsoW8sji56pqbVVNtNfXAfdV1RrgvvYa4BJgTVs2ArfOe6eStMQtptNQ64E72vodwGVD9Ttr4GvACUlOXYgGJWmpWqiwKOCPk+xIsrHVTqmqvW39GeCUtr4CeHpo292t9kOSbEyyPcn2qampcfUtSUvSMQv0vv+wqvYk+evA1iTfGB6sqkpSR7LDqtoEbAKYmJg4om0lSbNbkCOLqtrTfu4DPg+cCzx78PRS+7mvTd8DrBrafGWrSZLmybyHRZLXJnn9wXXgIuARYAuwoU3bAHyhrW8Brmx3RZ0PvDB0ukqSNA8W4jTUKcDnkxx8//9YVX+UZBtwd5KrgW8B72rz7wEuBSaB7wFXzX/LkrS0zXtYVNWTwFnT1J8D3jpNvYBr5qE1SdIMFtOts5KkRcqwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSuo6asEiyLsnjSSaTXLfQ/UjSUnJUhEWSZcAtwCXAGcAVSc5Y2K4kaek4KsICOBeYrKonq+r/AXcB6xe4J0laMo5Z6AZGtAJ4euj1buC84QlJNgIb28sXkzw+T70tBScDf77QTSwG+eiGhW5Bh/PP50HX58fdw9+caeBoCYuuqtoEbFroPl6JkmyvqomF7kOajn8+58fRchpqD7Bq6PXKVpMkzYOjJSy2AWuSnJ7kOOByYMsC9yRJS8ZRcRqqqg4kuRa4F1gGbK6qXQvc1lLi6T0tZv75nAepqoXuQZK0yB0tp6EkSQvIsJAkdRkWmpWPWdFilGRzkn1JHlnoXpYKw0Iz8jErWsRuB9YtdBNLiWGh2fiYFS1KVfUVYP9C97GUGBaazXSPWVmxQL1IWkCGhSSpy7DQbHzMiiTAsNDsfMyKJMCw0Cyq6gBw8DErjwF3+5gVLQZJPg38KfB3kuxOcvVC9/RK5+M+JEldHllIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdRkW0pAkq33stXQ4w0J6BUpyzEL3oFcWw0I63LIkn0iyK8kfJ3lNkvcm2ZbkwSSfTfITAEnemeSRVv/KTDtsRyz/I8nX2/KzrX5Bkj9J8pkk30jyqSRpY5e22o4kNyf5Yqu/tn35zwNJ/izJ+lZ/T5ItSb4E3Df2/0paUgwL6XBrgFuq6kzg28AvAp+rqr9fVWcxePTJwcdLfBC4uNV/fpZ97gPeVlXnAL8E3Dw0djbwfgZfMPVG4C1Jjgd+H7ikqt4MLB+a/2+BL1XVucDPAR9J8to2dg7wjqr6Rz/i7y5Ny7CQDvdUVe1s6zuA1cDPtCODh4F/BpzZxv8ncHuS9wLLZtnnscAn2vZ/yCAYDnqgqnZX1V8CO9v7/TTwZFU91eZ8emj+RcB1SXYCfwIcD5zWxrZWlV8KpDnneU3pcC8Nrb8MvIbB13heVlUPJnkPcAFAVf1KkvOAfwzsSPLmqnpumn3+K+BZ4CwG/0j7/izv1/v/MsAvVtXjP1Qc9PEXnW2lH4lHFtJoXg/sTXIsgyMLAJK8qarur6oPAlP88Pd/DPspYG87eng3sx+FADwOvDHJ6vb6l4bG7gX+5dC1jbOP8HeRjphhIY3m3wH3Mzjt9I2h+keSPNxut/1fwIMzbP9xYEOSBxmcYpr1CKCq/i/wL4A/SrID+C7wQhv+MIPTWg8l2dVeS2PlI8qlRSrJ66rqxXYEcQvwRFXdtNB9aWnyyEJavN7bLmLvYnAa6/cXuB8tYR5ZSHMoycXAbx9Sfqqq3r4Q/UhzxbCQJHV5GkqS1GVYSJK6DAtJUpdhIUnq+v9RMXaVZCUGggAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Existem ainda outros fatores que podemos abordar aqui neste ponto. Como a redução de dimensionalidade, por exemplo. No entanto, para fins de brevidade, não trataremos deles."
],
"metadata": {
"id": "cg5nSFyTNmyf"
}
},
{
"cell_type": "markdown",
"source": [
"## 2. Transformers\n",
"\n"
],
"metadata": {
"id": "d0nu1Bo3N5Pf"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment