Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save skilfoy/67d14c41e96d70c463dd0082e300c2f0 to your computer and use it in GitHub Desktop.
Save skilfoy/67d14c41e96d70c463dd0082e300c2f0 to your computer and use it in GitHub Desktop.
Sentiment Analysis of Tweets on Indian Prime Ministerial Candidates.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/skilfoy/67d14c41e96d70c463dd0082e300c2f0/sentiment-analysis-of-tweets-on-indian-prime-ministerial-candidates.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xUeLhRVJQG83"
},
"source": [
"\n",
"# Sentiment Analysis of Tweets on Indian Prime Ministerial Candidates\n",
"\n",
"**Sean Kilfoy**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0Sc3QlmEQXSs"
},
"source": [
"## Scenario\n",
"\n",
"You are a data scientist working for a Political Consulting Firm. You are given a dataset containing in Twitter_Data.csv. This dataset has the following two columns:\n",
"\n",
"- **clean_text**: Tweets made by the people extracted from Twitter Mainly Focused on tweets Made by People on Modi(2019 Indian Prime Minister candidate) and Other Prime Ministerial Candidates.\n",
"- **category**: It describes the actual sentiment of the respective tweet with three values of -1, 0, and 1."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QTbQGvz7ehbJ"
},
"source": [
"## Introduction\n",
"\n",
"In this analysis, we will explore the sentiments of tweets regarding Indian Prime Ministerial candidates, primarily focusing on Narendra Modi during the 2019 elections. The dataset contains tweets and their associated sentiment categories. We will perform various text preprocessing and feature extraction techniques to understand the sentiment better."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EUhXK8Eeenb8"
},
"source": [
"## Step 1: Load the Dataset\n",
"\n",
"First, we will load the dataset `Twitter_Data.csv` into memory. This dataset contains two columns: `clean_text` and `category`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "OhF_fGBYRhnc",
"outputId": "83eaa576-6d2d-45bf-c220-3ea393397fa0"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.feature_extraction.text import HashingVectorizer\n",
"\n",
"from google.colab import drive\n",
"drive.mount('/content/drive')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "7sFjJ0sNSFAI",
"outputId": "b114c331-1457-4ec0-c45e-27695e8d8a67"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" clean_text category\n",
"0 when modi promised “minimum government maximum... -1\n",
"1 talk all the nonsense and continue all the dra... 0\n",
"2 what did just say vote for modi welcome bjp t... 1\n",
"3 asking his supporters prefix chowkidar their n... 1\n",
"4 answer who among these the most powerful world... 1"
],
"text/html": [
"\n",
" <div id=\"df-c2954525-b984-4395-b86e-a0eac69fe4ca\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>clean_text</th>\n",
" <th>category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>when modi promised “minimum government maximum...</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>talk all the nonsense and continue all the dra...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>what did just say vote for modi welcome bjp t...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>asking his supporters prefix chowkidar their n...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>answer who among these the most powerful world...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c2954525-b984-4395-b86e-a0eac69fe4ca')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-c2954525-b984-4395-b86e-a0eac69fe4ca button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-c2954525-b984-4395-b86e-a0eac69fe4ca');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-91928bad-5cbd-44a1-829a-e238259ea465\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-91928bad-5cbd-44a1-829a-e238259ea465')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-91928bad-5cbd-44a1-829a-e238259ea465 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "data"
}
},
"metadata": {},
"execution_count": 9
}
],
"source": [
"data = pd.read_csv(\"/content/drive/MyDrive/Colab Notebooks/DSCI614_text_mining/Week 4/Twitter_Data.csv\")\n",
"\n",
"# Check for null values in the clean_text column and remove them\n",
"data = data.dropna(subset=['clean_text', 'category'])\n",
"\n",
"# Ensure the category column is of type int\n",
"data['category'] = data['category'].astype(int)\n",
"\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IbA2Nq52dHs0"
},
"source": [
"## Step 2: Convert Tweets to a Matrix of Token Counts Using CountVectorizer\n",
"We will use the `CountVectorizer` to transform the `clean_text` column into a matrix of token counts, including both unigrams and bigrams.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "12McK0R5cOSj",
"outputId": "30f0a14c-7fd8-41e3-ae45-4a5b15d1d335"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"(162969, 1199719)\n"
]
}
],
"source": [
"# Initialize the CountVectorizer\n",
"count_vect = CountVectorizer(ngram_range=(1, 2))\n",
"\n",
"# Transform the clean_text to a matrix of token counts\"\n",
"X_counts = count_vect.fit_transform(data['clean_text'])\n",
"\n",
"# Display the shape of the resulting matrix\n",
"print(X_counts.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RJFj5u2jhTEP"
},
"source": [
"## Step 3: Perform TF-IDF Analysis Using CountVectorizer and TfidfTransformer\n",
"\n",
"Next, we will perform TF-IDF analysis by first converting the `clean_text` to token counts using `CountVectorizer` and then applying `TfidfTransformer`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HDr3NC4-hctW",
"outputId": "68239f0d-f6a1-48dd-e2ef-f570f8a7db6f"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"(162969, 1199719)\n"
]
}
],
"source": [
"# Initialize the TfidfTransformer\n",
"tfidf_transformer = TfidfTransformer()\n",
"\n",
"# Transform the token counts to TF-IDF representation\n",
"X_tfidf = tfidf_transformer.fit_transform(X_counts)\n",
"\n",
"# Display the shape of the resulting TF-IDF matrix\n",
"print(X_tfidf.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NiNCpdZ8hfiu"
},
"source": [
"## Step 4: Perform TF-IDF Analysis Using TfidfVectorizer\n",
"\n",
"We will use the `TfidfVectorizer` to directly convert the `clean_text` column into a TF-IDF matrix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "L8tkbshDheUP",
"outputId": "b6c33f8f-ff1a-40bd-e251-3f5c4f6c2fff"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"(162969, 1199719)\n"
]
}
],
"source": [
"# Initialize the TfidfVectorizer\n",
"tfidf_vect = TfidfVectorizer(ngram_range=(1, 2))\n",
"\n",
"# Transform the clean_text to a TF-IDF representation\n",
"X_tfidf_vect = tfidf_vect.fit_transform(data['clean_text'])\n",
"\n",
"# Display the shape of the resulting TF-IDF matrix\n",
"print(X_tfidf_vect.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fJ_cirlphpqL"
},
"source": [
"## Step 5: Perform TF-IDF Analysis Using HashingVectorizer and TfidfTransformer\n",
"\n",
"Finally, we will use `HashingVectorizer` to convert the `clean_text` column into a hashed term-frequency matrix, and then apply `TfidfTransformer` to transform it into a TF-IDF representation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5GrR1rcLhnKH",
"outputId": "fb3e1129-ca8e-45a3-b729-dc50b5ae5fa2"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"(162969, 1048576)\n"
]
}
],
"source": [
"# Initialize the HashingVectorizer\n",
"hash_vect = HashingVectorizer(ngram_range=(1, 2), alternate_sign=False)\n",
"\n",
"# Transform the clean_text to a hashed term-frequency matrix\n",
"X_hash = hash_vect.transform(data['clean_text'])\n",
"\n",
"# Apply TfidfTransformer to the hashed term-frequency matrix\n",
"X_hash_tfidf = tfidf_transformer.fit_transform(X_hash)\n",
"\n",
"# Display the shape of the resulting TF-IDF matrix\n",
"print(X_hash_tfidf.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EzbFPTMd28fZ"
},
"source": [
"## Advanced Analytics\n",
"\n",
"Next, I will explore a few additional topics and techniques that TF-IDF allows us to leverage."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "slmSGWgEi9UN"
},
"source": [
"### 1. Sentiment Classification\n",
"\n",
"With TF-IDF vectors, we can train machine learning models to classify the sentiment of tweets. For instance, we can use algorithms like Logistic Regression, Support Vector Machines, or Random Forests to predict whether a tweet has a positive, neutral, or negative sentiment based on its TF-IDF features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7G9mrN-Qi8Gv",
"outputId": "2b975e40-441c-48d2-c433-c1c0a6d26180"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" precision recall f1-score support\n",
"\n",
" -1 0.91 0.66 0.76 7152\n",
" 0 0.82 0.95 0.88 11067\n",
" 1 0.89 0.90 0.90 14375\n",
"\n",
" accuracy 0.87 32594\n",
" macro avg 0.87 0.84 0.85 32594\n",
"weighted avg 0.87 0.87 0.86 32594\n",
"\n",
"Accuracy: 0.867030741854329\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import classification_report, accuracy_score\n",
"\n",
"# Split the data into training and testing sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X_tfidf, data['category'], test_size=0.2, random_state=42)\n",
"\n",
"# Train a Logistic Regression model with appropriate solver\n",
"model = LogisticRegression(solver='liblinear')\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Predict on the test set\n",
"y_pred = model.predict(X_test)\n",
"\n",
"# Evaluate the model\n",
"print(classification_report(y_test, y_pred))\n",
"print('Accuracy:', accuracy_score(y_test, y_pred))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ww2jqchsjIjc"
},
"source": [
"### 2. Topic Modeling\n",
"\n",
"TF-IDF can be used in topic modeling to discover abstract topics within a collection of tweets. Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF) can be applied to TF-IDF matrices to uncover hidden themes and topics in the data.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "kYjapWbEi8DI",
"outputId": "c7aeefff-e372-4c80-fee3-f039bf1c1bff"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Topic #0\n",
"['for', 'was', 'has', 'not', 'will', 'that', 'this', 'modi', 'and', 'the']\n",
"Topic #1\n",
"['the', 'nation', 'address', 'modi', 'minister narendra', 'prime minister', 'prime', 'minister', 'narendra modi', 'narendra']\n",
"Topic #2\n",
"['only', 'bjp', 'modi for', 'will vote', 'will', 'modi', 'for modi', 'vote for', 'vote', 'for']\n",
"Topic #3\n",
"['and', 'have', 'not', 'modi you', 'with', 'modi', 'your', 'you are', 'are', 'you']\n",
"Topic #4\n",
"['shot', 'shot down', 'down satellite', 'says', 'space power', 'power', 'down', 'satellite', 'space', 'india']\n"
]
}
],
"source": [
"from sklearn.decomposition import NMF\n",
"\n",
"# Initialize the NMF model\n",
"nmf_model = NMF(n_components=5, random_state=42)\n",
"\n",
"# Fit the model to the TF-IDF matrix\n",
"nmf_model.fit(X_tfidf)\n",
"\n",
"# Extract topics\n",
"for index, topic in enumerate(nmf_model.components_):\n",
" print(f'Topic #{index}')\n",
" print([tfidf_vect.get_feature_names_out()[i] for i in topic.argsort()[-10:]])\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0_-MDtnfjRic"
},
"source": [
"### 3. Similarity Search\n",
"TF-IDF vectors can be used to find similar tweets or documents. By calculating the cosine similarity between vectors, you can identify tweets that are most similar to a given query tweet."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "soSdkFD2hxE3",
"outputId": "25cc8180-c552-482e-f915-9c9baf777c31"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Top 10 important features for sentiment classification:\n",
"more\n",
"corrupt\n",
"stupid\n",
"other\n",
"bad\n",
"failed\n",
"wrong\n",
"fake\n",
"poor\n",
"hate\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"# Get feature importance from the trained model\n",
"feature_importance = np.abs(model.coef_[0])\n",
"\n",
"# Get the indices of the top 10 important features\n",
"top_features = feature_importance.argsort()[-10:]\n",
"\n",
"print('Top 10 important features for sentiment classification:')\n",
"for index in top_features:\n",
" print(tfidf_vect.get_feature_names_out()[index])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "it-ngJlzjNYi"
},
"source": [
"### 4. Clustering\n",
"You can use clustering algorithms like K-Means to group similar tweets together based on their TF-IDF vectors. This can help identify natural groupings or segments within the data, such as clusters of tweets discussing similar issues or expressing similar sentiments."
]
},
{
"cell_type": "code",
"source": [
"from sklearn.decomposition import TruncatedSVD\n",
"from sklearn.cluster import KMeans\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Initialize the TruncatedSVD model\n",
"svd = TruncatedSVD(n_components=2, random_state=42)\n",
"\n",
"# Fit and transform the TF-IDF matrix\n",
"X_svd = svd.fit_transform(X_tfidf)\n",
"\n",
"# Initialize the KMeans model\n",
"kmeans = KMeans(n_clusters=3, random_state=11)\n",
"\n",
"# Fit the model to the SVD-transformed TF-IDF matrix\n",
"kmeans.fit(X_svd)\n",
"\n",
"# Predict the clusters\n",
"clusters = kmeans.predict(X_svd)\n",
"\n",
"# Visualize the clusters\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(X_svd[:, 0], X_svd[:, 1], c=clusters, cmap='viridis')\n",
"plt.title('Tweet Clusters after Truncated SVD')\n",
"plt.xlabel('Component 1')\n",
"plt.ylabel('Component 2')\n",
"plt.colorbar(label='Cluster Label')\n",
"plt.show()"
],
"metadata": {
"id": "kzdUuW_LJlDd",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 560
},
"outputId": "a20a78b0-ded7-4356-addd-268c276b9327"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n",
" warnings.warn(\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1000x600 with 2 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
]
}
],
"metadata": {
"accelerator": "TPU",
"colab": {
"gpuType": "V28",
"machine_shape": "hm",
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment