Skip to content

Instantly share code, notes, and snippets.

@rishi-a
Created March 29, 2020 10:49
Show Gist options
  • Save rishi-a/f4642dbadfc0dce502c5ce0b21ae5d62 to your computer and use it in GitHub Desktop.
Save rishi-a/f4642dbadfc0dce502c5ce0b21ae5d62 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Topic Modeling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Topic Modeling is performed using Latend Drichlet Allocation **(LDA)**\n",
"\n",
"LDA is a generative model, meaning the the model can be used to generate the corpus. Each document (tweet) in the corpus is a probabilistic mixture of topics and each topic is a probabilistic mixtire of terms. Let there be $M$ tweets each with $N_m$ words ($m=1:M$). Let, the distinct words in corpus be equal to $W$. Let us assume that the total number of topics in the corpus is equal to $K$. $\\alpha$ and $\\eta$ are hyperparameters for Dirichlet distributions.\n",
"\n",
"The Dirichlet distributions produce the $k$-dimensional topic vector represented by $\\theta$ and $W$-dimensional word vector denoted by $\\beta$\n",
"\n",
"$\\theta$ = \\begin{bmatrix}\n",
"k_1 \\\\\n",
"k_2 \\\\\n",
"\\vdots\n",
"k_K\n",
"\\end{bmatrix}\n",
"\n",
"$\\beta$ = \\begin{bmatrix}\n",
"w_1 \\\\\n",
"w_2 \\\\\n",
"\\vdots\n",
"w_W\n",
"\\end{bmatrix}\n",
"\n",
"\n",
"The goal of training **LDA** model is to determine $\\theta$ and $\\beta$ such that probability of generating the actual corpus is maximixed. \n",
"\n",
"\n",
"<br/>\n",
"\n",
"**Pooled LDA**\n",
"\n",
"LDA's preformence suffers in shorter document. Thus in our case, every document in the corpus should include the message of every tweet posted by a particular twitter in Bag of Word format. *We havent done this yet\n",
" \n",
"<br/>\n",
"\n",
"**Python Package To Perform LDA**\n",
"\n",
"There exists a [python package](https://radimrehurek.com/gensim/models/ldamodel.html) to perform LDA\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\ProgramData\\Anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:2785: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.\n",
" interactivity=interactivity, compiler=compiler, result=result)\n"
]
}
],
"source": [
"data = pd.read_csv('archive/all-delhi-tweets-2019.csv')\n",
"document = data['text']"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\ProgramData\\Anaconda3\\lib\\site-packages\\gensim\\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
" warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package wordnet to\n",
"[nltk_data] C:\\Users\\Rishiraj\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package wordnet is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import gensim\n",
"from gensim.utils import simple_preprocess\n",
"from gensim.parsing.preprocessing import STOPWORDS\n",
"from nltk.stem import WordNetLemmatizer, SnowballStemmer\n",
"from nltk.stem.porter import *\n",
"import numpy as np\n",
"np.random.seed(2018)\n",
"import nltk\n",
"nltk.download('wordnet')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Routines to Lammatize and remove stopword"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"def lemmatize_stemming(text):\n",
" stemmer = SnowballStemmer(\"english\")\n",
" return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))\n",
"\n",
"def preprocess(text):\n",
" result = []\n",
" for token in gensim.utils.simple_preprocess(text):\n",
" if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:\n",
" result.append(lemmatize_stemming(token))\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 [https, form, pollutionfre, aamaadmiparti]\n",
"1 [join, movement, preserv, environ, environ, de...\n",
"2 [environ, matter, delhi, pollut, join, movemen...\n",
"3 [environ, matter, delhi, pollut, join, movemen...\n",
"4 [environ, matter, delhi, pollut, join, movemen...\n",
"5 [survey, https, form, aisa, delhi, environ, gl...\n",
"6 [survey, https, form, aisa, delhi, environn, g...\n",
"7 [join, movement, creat, onlin, survey, know, f...\n",
"8 [human, chain, pledg, environ, august, ramja, ...\n",
"9 [delhiunivers, massiv, human, chain, student, ...\n",
"Name: text, dtype: object"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"processed_docs = document.map(preprocess)\n",
"processed_docs[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Bag of Words"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"dictionary = gensim.corpora.Dictionary(processed_docs)\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\ncount = 0\\nfor k,v in dictionary.iteritems():\\n count+=1\\n print(k,v)\\n'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'''\n",
"count = 0\n",
"for k,v in dictionary.iteritems():\n",
" count+=1\n",
" print(k,v)\n",
"'''"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(2, 1),\n",
" (5, 1),\n",
" (17, 2),\n",
" (152, 1),\n",
" (171, 1),\n",
" (175, 1),\n",
" (180, 1),\n",
" (286, 1),\n",
" (327, 1),\n",
" (348, 2),\n",
" (374, 3),\n",
" (382, 1),\n",
" (394, 1),\n",
" (407, 1),\n",
" (533, 2),\n",
" (539, 1),\n",
" (651, 1),\n",
" (665, 1),\n",
" (1010, 1),\n",
" (1220, 1),\n",
" (1452, 1),\n",
" (1572, 1),\n",
" (1732, 1),\n",
" (2268, 1),\n",
" (2348, 1),\n",
" (2367, 1),\n",
" (2370, 1),\n",
" (3053, 1),\n",
" (3744, 1)]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]\n",
"bow_corpus[4310]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### TF-IDF"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"from gensim import corpora, models\n",
"\n",
"tfidf = models.TfidfModel(bow_corpus)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"corpus_tfidf = tfidf[bow_corpus]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 0.4552169115485739), (1, 0.3289913754784397), (2, 0.1450076156012076), (3, 0.8145643189574624)]\n"
]
}
],
"source": [
"for doc in corpus_tfidf:\n",
" print(doc)\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Running LDA using BoW"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topic: 0 \n",
"Words: 0.052*\"pollut\" + 0.023*\"road\" + 0.021*\"timesofindia\" + 0.019*\"dust\" + 0.018*\"indiatim\" + 0.018*\"utm_sourc\" + 0.017*\"articleshow\" + 0.016*\"pollutionkil\" + 0.016*\"utm_medium\" + 0.014*\"smog\"\n",
"Topic: 1 \n",
"Words: 0.157*\"twitter\" + 0.097*\"status\" + 0.056*\"airpollut\" + 0.052*\"https\" + 0.048*\"delhipollut\" + 0.042*\"oddeven\" + 0.035*\"arvindkejriw\" + 0.029*\"delhiairqu\" + 0.014*\"delhichok\" + 0.011*\"pollut\"\n",
"Topic: 2 \n",
"Words: 0.073*\"india\" + 0.057*\"pollut\" + 0.042*\"https\" + 0.028*\"citi\" + 0.023*\"world\" + 0.022*\"airpollut\" + 0.020*\"embassi\" + 0.020*\"qualiti\" + 0.017*\"health\" + 0.016*\"news\"\n",
"Topic: 3 \n",
"Words: 0.075*\"airpollut\" + 0.069*\"twitter\" + 0.047*\"delhipollut\" + 0.038*\"delhiairqu\" + 0.023*\"delhichok\" + 0.020*\"delhismog\" + 0.019*\"http\" + 0.017*\"smog\" + 0.016*\"arvindkejriw\" + 0.014*\"pmoindia\"\n",
"Topic: 4 \n",
"Words: 0.054*\"water\" + 0.053*\"help\" + 0.052*\"wemeantoclean\" + 0.052*\"swachhbharat\" + 0.052*\"airpollut\" + 0.051*\"cleandelhi\" + 0.050*\"reduc\" + 0.050*\"twitter\" + 0.049*\"wast\" + 0.049*\"mycleanindia\"\n",
"Topic: 5 \n",
"Words: 0.071*\"qualiti\" + 0.040*\"pollut\" + 0.036*\"twitter\" + 0.035*\"airpollut\" + 0.032*\"https\" + 0.025*\"news\" + 0.023*\"sever\" + 0.022*\"poor\" + 0.018*\"airqual\" + 0.016*\"level\"\n",
"Topic: 6 \n",
"Words: 0.060*\"pollut\" + 0.037*\"burn\" + 0.025*\"punjab\" + 0.023*\"govt\" + 0.021*\"stubbl\" + 0.021*\"haryana\" + 0.020*\"state\" + 0.017*\"govern\" + 0.015*\"blame\" + 0.012*\"action\"\n",
"Topic: 7 \n",
"Words: 0.056*\"pollut\" + 0.021*\"twitter\" + 0.020*\"delhibachao\" + 0.019*\"mask\" + 0.019*\"https\" + 0.017*\"airpollut\" + 0.014*\"diwali\" + 0.012*\"smog\" + 0.012*\"delhipollut\" + 0.009*\"peopl\"\n",
"Topic: 8 \n",
"Words: 0.135*\"airqual\" + 0.135*\"like\" + 0.133*\"smoke\" + 0.128*\"cigarett\" + 0.128*\"hand\" + 0.127*\"second\" + 0.126*\"knowyourair\" + 0.003*\"bangladesh\" + 0.002*\"cricket\" + 0.002*\"embassi\"\n",
"Topic: 9 \n",
"Words: 0.043*\"pollut\" + 0.015*\"oddeven\" + 0.012*\"vehicl\" + 0.010*\"need\" + 0.009*\"citi\" + 0.009*\"mumbai\" + 0.009*\"reduc\" + 0.008*\"scheme\" + 0.008*\"minist\" + 0.008*\"work\"\n"
]
}
],
"source": [
"for idx, topic in lda_model.print_topics(-1):\n",
" print('Topic: {} \\nWords: {}'.format(idx, topic))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Running LDA using TF-IDF"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topic: 0 Word: 0.007*\"airpollut\" + 0.007*\"pollut\" + 0.006*\"twitter\" + 0.005*\"mask\" + 0.005*\"breath\" + 0.005*\"oxygen\" + 0.005*\"smog\" + 0.005*\"road\" + 0.004*\"peopl\" + 0.004*\"free\"\n",
"Topic: 1 Word: 0.061*\"delhiairqu\" + 0.053*\"oddeven\" + 0.043*\"delhibachao\" + 0.038*\"twitter\" + 0.032*\"delhipollut\" + 0.030*\"delhismog\" + 0.028*\"delhichok\" + 0.028*\"status\" + 0.027*\"arvindkejriw\" + 0.021*\"airpollut\"\n",
"Topic: 2 Word: 0.055*\"wastemanag\" + 0.055*\"segreg\" + 0.054*\"eypj\" + 0.054*\"wslcxtkst\" + 0.054*\"form\" + 0.053*\"mycleanindia\" + 0.053*\"visit\" + 0.053*\"wast\" + 0.052*\"contribut\" + 0.052*\"cleandelhi\"\n",
"Topic: 3 Word: 0.019*\"pollutionkil\" + 0.009*\"status\" + 0.008*\"twitter\" + 0.008*\"bangladesh\" + 0.007*\"narendramodi\" + 0.007*\"arvindkejriw\" + 0.007*\"delhipollut\" + 0.007*\"cricket\" + 0.007*\"pmoindia\" + 0.007*\"airpollut\"\n",
"Topic: 4 Word: 0.035*\"surfac\" + 0.013*\"airpollut\" + 0.011*\"pollutionp\" + 0.011*\"twitter\" + 0.009*\"newdelhi\" + 0.009*\"instagram\" + 0.009*\"chernobyl\" + 0.009*\"smog\" + 0.008*\"igshid\" + 0.008*\"https\"\n",
"Topic: 5 Word: 0.140*\"knowyourair\" + 0.138*\"second\" + 0.137*\"cigarett\" + 0.137*\"hand\" + 0.132*\"smoke\" + 0.120*\"airqual\" + 0.118*\"like\" + 0.004*\"delhip\" + 0.001*\"maxim\" + 0.001*\"delhiwint\"\n",
"Topic: 6 Word: 0.026*\"qualiti\" + 0.013*\"sever\" + 0.012*\"airpollut\" + 0.012*\"poor\" + 0.011*\"news\" + 0.011*\"categori\" + 0.011*\"wors\" + 0.010*\"knowyourairp\" + 0.010*\"level\" + 0.009*\"https\"\n",
"Topic: 7 Word: 0.011*\"india\" + 0.009*\"news\" + 0.009*\"airpollut\" + 0.009*\"https\" + 0.008*\"pollut\" + 0.008*\"world\" + 0.007*\"asia\" + 0.007*\"skymet\" + 0.007*\"level\" + 0.007*\"qualiti\"\n",
"Topic: 8 Word: 0.052*\"embassi\" + 0.040*\"india\" + 0.027*\"aqicn\" + 0.022*\"synop\" + 0.022*\"sourc\" + 0.022*\"observ\" + 0.021*\"organ\" + 0.021*\"synopt\" + 0.018*\"citi\" + 0.017*\"world\"\n",
"Topic: 9 Word: 0.009*\"burn\" + 0.008*\"pollut\" + 0.006*\"meteorolog\" + 0.006*\"india\" + 0.006*\"punjab\" + 0.006*\"peopl\" + 0.006*\"govt\" + 0.006*\"stubbl\" + 0.006*\"govern\" + 0.006*\"blame\"\n"
]
}
],
"source": [
"lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)\n",
"\n",
"\n",
"for idx, topic in lda_model_tfidf.print_topics(-1):\n",
" print('Topic: {} Word: {}'.format(idx, topic))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sentiment Analysis\n",
"\n",
"Valence Aware Dictionary and Sentiment Resoner (VADER) sentiment analysis model created for **social media** sentiment analysis. The model determines a prelimary sentiment score by examining the words. A score between -1 (strongly negative) and 1 (strongly positive) is alloted.\n",
"\n",
"The sentiment score is decided on preset of rules\n",
"\n",
"1. Exclamation marks and capitalization are treated as sentiment amplifiers\n",
"2. Conjunction and negations are taken care of\n",
"3. Takes care of degree modifiers to change score (eg. Good, better, best)\n",
"4. The model is also trained on emoticons and slangs\n",
"\n",
"**TODO**:\n",
"Curate 100-200 manual tweets and see how VADER performs"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer\n",
"analyser = SentimentIntensityAnalyzer()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"def sentiment_analyzer_scores(sentence):\n",
" score = analyser.polarity_scores(sentence)\n",
" print(\"{:-<40} {}\".format(sentence, str(score)))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The phone is super cool.---------------- {'neg': 0.0, 'neu': 0.326, 'pos': 0.674, 'compound': 0.7351}\n"
]
}
],
"source": [
"sentiment_analyzer_scores(\"The phone is super cool.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment