Skip to content

Instantly share code, notes, and snippets.

@mdvsh
Last active January 26, 2020 04:45
Show Gist options
  • Save mdvsh/493a3f477bd97fc15b24c853e7de7d9c to your computer and use it in GitHub Desktop.
Save mdvsh/493a3f477bd97fc15b24c853e7de7d9c to your computer and use it in GitHub Desktop.
Comparing Sentiment Analysis Models (on the Amazon Review Dataset).
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Compare SentimentAnalysis models - The Julia Programming Language\n",
"### @author : PseudoCodeNerd\n",
"## Task Description \n",
"Use the amazon review data from Kaggle to test the efficiency of our Sentiment Analysis models that live in TextAnalysis.jl. Compare it with models in ScikitLearn and Spacy python libraries. Upload your results as an issue in the TextAnalysis package.\n",
"\n",
"Some basic machine learning knowledge is useful for this task.\n",
"\n",
"### Special thanks to Ayush Kaushal; an exemplary mentor without whom this task wouldn't be possible.\n",
"\n",
"Find below, the `julia` part of the task. The python notebook would be attached too but would have sparse documentation.\n",
"\n",
"> The process of algorithmically identifying and categorizing opinions expressed in text to determine the user’s attitude toward the subject of the document (or post).\n",
"\n",
"This is how I understand it.\n",
"\n",
"## Importing Required Packages"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"using TextAnalysis, FileIO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I would be working on the test data since the train one is **humongous** and my laptop was unable to render that in Jupyter every single time even when left for about an hour. \n",
"\n",
"So, declaring the test reviews as `review` as evident by the code below."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"FileDocument(\"text/test.ft.txt\", TextAnalysis.DocumentMetadata(Languages.English(), \"text/test.ft.txt\", \"Unknown Author\", \"Unknown Time\"))"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reviews = Document(\"text/test.ft.txt\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting to know some of our data. \n",
"\n",
"We can see that the `.txt` file contains reviews in the form of :\n",
"\n",
"\"**__label__1(/2)** *space* **...the review...**\" \n",
"\n",
"Exploratory data analysis also reveals that reviews beginning with `__label__2` are positive reviews. That means that their sentiment score would also be higher (I'll demonstrate that in a sec...)\n",
"Similarly, reviews beginning with `__label__1` are negative reviews and so their sentiment score should evidently be lower."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting our pre-trained Sentiment Analyser to check on these lines."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"┌ Info: CUDAdrv.jl failed to initialize, GPU functionality unavailable (set JULIA_CUDA_SILENT or JULIA_CUDA_VERBOSE to silence or expand this message)\n",
"└ @ CUDAdrv C:\\Users\\shekh\\.julia\\packages\\CUDAdrv\\3EzC1\\src\\CUDAdrv.jl:69\n"
]
},
{
"data": {
"text/plain": [
"Sentiment Analysis Model Trained on IMDB with a 88587 word corpus"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sent = SentimentAnalyzer()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"3-element Array{String,1}:\n",
" \"__label__2 Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing \\\"Who was that singing ?\\\"\" \n",
" \"__label__2 One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it.\"\n",
" \"__label__1 Batteries died within a year ...: I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power.\" "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#seeing how the data is arranged.\n",
"readlines(\"text/test.ft.txt\")[1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I'll see that for the first 10 reviews in our dataset, what the actual label is and what sentiment score does our model return. From this we'll be able to know that the model isn't perfect and does indeed predict wrong sentiments for some reviews. Thus, developing a need to do text pre-processing to make the reviews comparable and remove unnecessary stuff like urls and other things generally not contributing to the *read/feel* of the review. !"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SNo. | Label | Prediction Score | Should be | Predicted | Correct/Incorrect \n",
"--------------------------------------------------------------------------------------\n",
"1 | __label__2 | 0.39506337 | +ve | -ve | Incorrect \n",
"2 | __label__2 | 0.5314957 | +ve | +ve | Correct \n",
"3 | __label__1 | 0.52432084 | -ve | +ve | Incorrect \n",
"4 | __label__2 | 0.5501878 | +ve | +ve | Correct \n",
"5 | __label__2 | 0.5919624 | +ve | +ve | Correct \n",
"6 | __label__1 | 0.61544746 | -ve | +ve | Incorrect \n",
"7 | __label__1 | 0.732198 | -ve | +ve | Incorrect \n",
"8 | __label__1 | 0.55473757 | -ve | +ve | Incorrect \n",
"9 | __label__2 | 0.4127747 | +ve | -ve | Incorrect \n",
"10 | __label__1 | 0.58470565 | -ve | +ve | Incorrect \n",
"11 | __label__2 | 0.5855292 | +ve | +ve | Correct \n",
"12 | __label__1 | 0.51694876 | -ve | +ve | Incorrect \n",
"13 | __label__1 | 0.5547061 | -ve | +ve | Incorrect \n",
"14 | __label__2 | 0.45876318 | +ve | -ve | Incorrect \n",
"15 | __label__1 | 0.52366424 | -ve | +ve | Incorrect \n"
]
}
],
"source": [
"tab = \"SNo. | Label | Prediction Score | Should be | Predicted | Correct/Incorrect \"\n",
"println(tab)\n",
"println(\"-\"^(length(tab)+5))\n",
"for i in 1:15\n",
" label = readlines(\"text/test.ft.txt\")[i][1:10];\n",
" review = readlines(\"text/test.ft.txt\")[i][11:end];\n",
" review = StringDocument(review);\n",
" pred = sent(review);\n",
" if label == \"__label__2\"\n",
" should_be = \"+ve\"\n",
" else\n",
" should_be = \"-ve\"\n",
" end\n",
" if pred >= 0.5\n",
" pred_be = \"+ve\"\n",
" elseif pred < 0.5\n",
" pred_be = \"-ve\"\n",
" end\n",
" if pred_be == should_be\n",
" correct = \"Correct\"\n",
" else\n",
" correct = \"Incorrect\"\n",
" end\n",
" println(\"$i | $label | $pred | $should_be | $pred_be | $correct \")\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's clear that our model isn't optimal since out of 15 samples, only **4** were correct predictions. However, I went a little too harsh on the model since in some cases, like in\n",
"\n",
"`14 | __label__2 | 0.45876318 | +ve | -ve | Incorrect`\n",
"\n",
"the model was within some limit of correct predictions. So yeah, sorry Mr. Sentiment Analyzer.\n",
"\n",
"Moving on towards trying to improve the accuracy of predcitions by performing some general pre-defined text-processing functions in TextAnalysis package. But first, I want to know the length of our test data set so I can make batches of processing accrodinly to my computational powers."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"400000"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data = readlines(\"text/test.ft.txt\")\n",
"length(test_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, so now we know the size of the data we're dealing with let's get started with the pre-processing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.\n",
"\n",
"A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "BoundsError",
"evalue": "BoundsError: attempt to access 32×5000 Array{Float32,2} at index [Base.Slice(Base.OneTo(32)), 5001]",
"output_type": "error",
"traceback": [
"BoundsError: attempt to access 32×5000 Array{Float32,2} at index [Base.Slice(Base.OneTo(32)), 5001]",
"",
"Stacktrace:",
" [1] throw_boundserror(::Array{Float32,2}, ::Tuple{Base.Slice{Base.OneTo{Int64}},Int64}) at .\\abstractarray.jl:538",
" [2] checkbounds at .\\abstractarray.jl:503 [inlined]",
" [3] _getindex at .\\multidimensional.jl:669 [inlined]",
" [4] getindex at .\\abstractarray.jl:981 [inlined]",
" [5] embedding(::Array{Float32,2}, ::Array{Float64,1}) at C:\\Users\\shekh\\.julia\\packages\\TextAnalysis\\pcFQf\\src\\sentiment.jl:27",
" [6] (::TextAnalysis.var\"#24#25\"{Dict{Symbol,Any}})(::Array{Float64,1}) at C:\\Users\\shekh\\.julia\\packages\\TextAnalysis\\pcFQf\\src\\sentiment.jl:40",
" [7] get_sentiment(::TextAnalysis.var\"#26#27\", ::Array{String,1}, ::Dict{Symbol,Any}, ::Dict{String,Any}) at C:\\Users\\shekh\\.julia\\packages\\TextAnalysis\\pcFQf\\src\\sentiment.jl:59",
" [8] (::TextAnalysis.SentimentModel)(::Function, ::Array{String,1}) at C:\\Users\\shekh\\.julia\\packages\\TextAnalysis\\pcFQf\\src\\sentiment.jl:87",
" [9] SentimentAnalyzer at C:\\Users\\shekh\\.julia\\packages\\TextAnalysis\\pcFQf\\src\\sentiment.jl:103 [inlined] (repeats 2 times)",
" [10] top-level scope at .\\In[33]:28"
]
}
],
"source": [
"test_labels = []\n",
"test_string = []\n",
"fal_pos = 0\n",
"fal_neg = 0\n",
"tru_pos = 0\n",
"tru_neg = 0\n",
"for i in 1:length(test_data)\n",
" label =test_data[i][1:10];\n",
" push!(test_labels, label);\n",
" review = test_data[i][11:end];\n",
" push!(test_string, review)\n",
" #after adding reviews and labels in their respective arrays.\n",
" #I'll perform pre-processing on individual reviews.\n",
" review_sd = StringDocument(review)\n",
" \n",
" remove_corrupt_utf8!(review_sd)\n",
" stem!(review_sd)\n",
" remove_case!(review_sd)\n",
" #remove_indefinite_articles!(review_sd)\n",
" #remove_definite_articles!(review_sd)\n",
" \n",
" if label == \"__label__2\"\n",
" should_be = \"+ve\"\n",
" else\n",
" should_be = \"-ve\"\n",
" end\n",
" \n",
" pred = sent(review_sd)\n",
" \n",
" if pred >= 0.5\n",
" pred_be = \"+ve\"\n",
" elseif pred < 0.5\n",
" pred_be = \"-ve\"\n",
" end\n",
" \n",
" if pred_be == \"+ve\" && should_be == \"+ve\"\n",
" tru_pos += 1\n",
" elseif pred_be == \"-ve\" && should_be == \"-ve\"\n",
" tru_neg += 1\n",
" elseif pred_be == \"-ve\" && should_be == \"+ve\"\n",
" fal_pos += 1\n",
" elseif pred_be == \"+ve\" && should_be == \"-ve\"\n",
" fal_neg += 1\n",
" end\n",
" \n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ahh! Finally it's complete.\n",
"\n",
"We get `BoundsError: attempt to access 32×5000 Array{Float32,2} at index [Base.Slice(Base.OneTo(32)), 5001]` error however on seeing this issue on TextAnalysis package. \n",
"\n",
"> Ref: [BoundsError in sentiment analysis](https://github.com/JuliaText/TextAnalysis.jl/issues/160)\n",
"\n",
"I've decided to ignore it. Let's get on towards calculating our predictions metrices: Precision / F1Score / Recall.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$P = \\frac{T_p}{T_p+F_p}$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$R = \\frac{T_p}{T_p + F_n}$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$ F1 = \\frac{2 \\cdot P\\cdot R}{P+ R} $$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Precision"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision is 0.583117838593833\n"
]
}
],
"source": [
"precision = tru_pos / (tru_pos + fal_pos)\n",
"println(\"Precision is $precision\")"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recall is 0.5144996465068449\n"
]
}
],
"source": [
"recall = tru_pos / (tru_pos + fal_neg)\n",
"println(\"Recall is $recall\")"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1Score is 0.5466638895622987.\n"
]
}
],
"source": [
"f1score = (2 * precision * recall) / (precision + recall)\n",
"#f1score is from 0 --> 1\n",
"println(\"F1Score is $f1score.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## End of Report"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.3.0",
"language": "julia",
"name": "julia-1.3"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.3.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment