Skip to content

Instantly share code, notes, and snippets.

@RodolfoFerro
Created August 5, 2022 17:08
Show Gist options
  • Save RodolfoFerro/5a2880fede2faa0b8c19199707356d12 to your computer and use it in GitHub Desktop.
Save RodolfoFerro/5a2880fede2faa0b8c19199707356d12 to your computer and use it in GitHub Desktop.
Text Classifier Model
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Text Classifier Model",
"private_outputs": true,
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyNMCvUSvmrBTQ6rJd9A9Rpe",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/RodolfoFerro/5a2880fede2faa0b8c19199707356d12/text-classifier-model.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Modelo clasificador de Tweets"
],
"metadata": {
"id": "fSBdg2x1ldjB"
}
},
{
"cell_type": "markdown",
"source": [
"## Estructura de nuestro modelo\n",
"\n",
"Para nuestro modelo utilizaremos una neurona similar a las anteriores, con una función de activación sigmoide."
],
"metadata": {
"id": "45Nuhu_7so2z"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"from random import randint\n",
"import math\n",
"\n",
"\n",
"class neuron_model():\n",
" def __init__(self, X, Y):\n",
" \"\"\"Constructor of the class.\"\"\"\n",
"\n",
" # Set seed and set data attributes\n",
" np.random.seed(123)\n",
"\n",
" self.n_data = X.shape[1]\n",
" self.X = X.T\n",
" self.Y = Y.T\n",
" \n",
" # weights vector initialization\n",
" self.w = np.zeros((self.X.shape[0], 1))\n",
"\n",
" # Weights random initialization\n",
" for j in range(self.X.shape[0]):\n",
" self.w[j, 0] = randint(-10, 10) * 0.01\n",
" \n",
" # Bias initialization\n",
" self.b = 0\n",
"\n",
"\n",
" def sigmoid(self, x):\n",
" \"\"\"Sigmoid function.\"\"\"\n",
"\n",
" return 1.0 / (1.0 + np.exp(-x))\n",
" \n",
"\n",
" def predict(self, x):\n",
" \"\"\"Prediction function.\"\"\"\n",
"\n",
" return self.sigmoid(np.dot(x, self.w) + self.b)\n",
"\n",
"\n",
" def train(self, iterations, learning_rate=0.1):\n",
" \"\"\"Training function.\"\"\"\n",
" \n",
" # Cost initialization\n",
" cost = 0\n",
" for i in range(iterations):\n",
" out = self.sigmoid(np.dot(self.w.T, self.X) + self.b)\n",
" cost = (-1. / self.n_data) * np.sum((self.Y * np.log(out)) + (1. - self.Y) * np.log(1. - out))\n",
" \n",
" print(f'[INFO] Iteration: {i + 1}/{iterations}, cost: {cost}')\n",
"\n",
" if math.isnan(cost):\n",
" break\n",
" \n",
" dw = (1. / self.n_data) * np.dot(self.X, (out - self.Y).T)\n",
" db = (1. / self.n_data) * np.sum(out - self.Y)\n",
" self.w = self.w - dw * learning_rate\n",
" self.b = self.b - db * learning_rate\n",
" \n",
" print('[INFO] Training succeeded!')"
],
"metadata": {
"id": "h9icUJ3ngDY2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Descarga de _dataset_\n",
"\n",
"El dataset que utilizaremos consta de opiniones de películas y proviene de la _Internet Movie Database (IMDB)_: https://www.imdb.com/interfaces/\n",
"\n",
"El conjunto lo descargaremos de un repositorio público y para hacerlo podemos utilizar la siguiente línea de código. El conjunto de datos que utilizaremos cuenta con 50,000 reviews de varias películas conocidas."
],
"metadata": {
"id": "2Nu2TyfDmfDj"
}
},
{
"cell_type": "code",
"source": [
"!wget https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv"
],
"metadata": {
"id": "jjmQlIh2ksuP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Transformación de datos\n",
"\n",
"Será necesario transformar los datos para poder hacer un modelado de los mismos."
],
"metadata": {
"id": "zoDeIpU7nRsq"
}
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"\n",
"\n",
"df = pd.read_csv('IMDB-Dataset.csv')\n",
"df.head(10)"
],
"metadata": {
"id": "E1vr_2pZmesI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"sentiment = df[['sentiment']].values\n",
"review = df[['review']].values\n",
"\n",
"sentiment = [1 if value == 'positive' else 0 for value in sentiment]\n",
"review = [str(text) for text in review]"
],
"metadata": {
"id": "zHOvifbdnlfY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from sklearn.feature_extraction.text import HashingVectorizer\n",
"\n",
"# Create the transformation\n",
"vectorizer = HashingVectorizer(n_features=20)\n",
"vectorizer.fit(review)"
],
"metadata": {
"id": "2EyWR7K0xmdQ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# encode document\n",
"vector = vectorizer.transform([review[0]])\n",
"\n",
"# summarize encoded vector\n",
"print(vector.shape)\n",
"print(vector.toarray())"
],
"metadata": {
"id": "lMwwpyjgykBA"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Preparamos los datos finales"
],
"metadata": {
"id": "rPkJ6Q_jzCAL"
}
},
{
"cell_type": "code",
"source": [
"X = vectorizer.transform(review).toarray()\n",
"Y = np.array(sentiment)"
],
"metadata": {
"id": "TWYZeH4syv6x"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Creación de un modelo\n",
"\n",
"Ya que hemos cargado nuestros datos, podemos crear un modelo y entrenarlo con los datos mencionados."
],
"metadata": {
"id": "F4g7opXTs0RR"
}
},
{
"cell_type": "code",
"source": [
"model = neuron_model(X, Y)\n",
"model.train(10000, learning_rate=0.0001)"
],
"metadata": {
"id": "sd-j5OjNr-ow"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Predicción con el modelo\n",
"\n",
"Una vez entrenado el modelo, podemos realizar predicciones con el mismo, para ello podemos cargar algunos datos ejemplo de para prueas."
],
"metadata": {
"id": "Jtwy-gTOtCOy"
}
},
{
"cell_type": "code",
"source": [
"phrases = [\n",
" 'I do not consider this comment as something positive.',\n",
" 'I really love life and people and animals and everything. I enjoy being happy.',\n",
" 'I hate people.'\n",
"]\n",
"\n",
"x = vectorizer.transform(phrases).toarray()\n",
"model.predict(x)\n"
],
"metadata": {
"id": "eeaSISKGrs0b"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"res_X = model.predict(X)\n",
"res_X = np.array([1 if val >= 0.5 else 0 for val in res_X])\n",
"\n",
"accuracy_score(Y, res_X)"
],
"metadata": {
"id": "eHlVRsLH9t6Q"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"model.w"
],
"metadata": {
"id": "yZRPzb4v-sh4"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"model.b"
],
"metadata": {
"id": "QydZL8ke-u2t"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"--------\n",
"\n",
"> Contenido creado por **Rodolfo Ferro** para [CdeCMx](https://clubesdeciencia.mx/), 2022. <br>\n",
"> Puedes contactarme a través de Insta ([@rodo_ferro](https://www.instagram.com/rodo_ferro/)) o Twitter ([@rodo_ferro](https://twitter.com/rodo_ferro))."
],
"metadata": {
"id": "LzaGoXnfuQjC"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment