Skip to content

Instantly share code, notes, and snippets.

@mdvsh
Last active January 26, 2020 04:45
Show Gist options
  • Save mdvsh/493a3f477bd97fc15b24c853e7de7d9c to your computer and use it in GitHub Desktop.
Save mdvsh/493a3f477bd97fc15b24c853e7de7d9c to your computer and use it in GitHub Desktop.
Comparing Sentiment Analysis Models (on the Amazon Review Dataset).
Display the source blob
Display the rendered blob
Raw
{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Python complement to the Task.\n# Compare SentimentAnalysis models - The Julia Programming Language\n### @author : PseudoCodeNerd\n## Task Description \nUse the amazon review data from Kaggle to test the efficiency of our Sentiment Analysis models that live in TextAnalysis.jl. Compare it with models in ScikitLearn and Spacy python libraries. Upload your results as an issue in the TextAnalysis package.\n\nSome basic machine learning knowledge is useful for this task.\n\n### Special thanks to Ayush Kaushal; an exemplary mentor without whom this task wouldn't be possible.\n\nFind below, the `julia` part of the task. The python notebook would be attached too but would have sparse documentation.\n\n> The process of algorithmically identifying and categorizing opinions expressed in text to determine the user’s attitude toward the subject of the document (or post).\n\nThis is how I understand it.\n\nThis would be done using a simple one layer neural network. NLTK was..let's just forget it! (BRAINHACKED)\n## Importing Required Packages"},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"import numpy as np \nimport pandas as pd \nimport bz2\nfrom tqdm import tqdm #for cool progress bars\nimport os\nprint(os.listdir(\"../input\"))\n#shifted to a NN because my NLTK model was lost as I suffered a blue screen of death.\n#this is on kaggle\nfrom sklearn.utils import shuffle #for shuffling data\nimport tensorflow as tf\nfrom tensorflow.keras.layers import *\nfrom keras.preprocessing.text import Tokenizer\nfrom keras.preprocessing.sequence import pad_sequences\nimport re","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"400fa34d9a714a22a20259ba3140a030551e1986"},"cell_type":"markdown","source":"# 1. Read & Preprocess data"},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"train = bz2.BZ2File('../input/train.ft.txt.bz2')\ntest = bz2.BZ2File('../input/test.ft.txt.bz2')","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"93213b70d17edcdeeba304a7344e415733d0ce17","_cell_guid":"d554f2ba-0eeb-4764-b1df-febfbdda5c2c","trusted":true},"cell_type":"code","source":"train_lines = train.readlines()\ntest_lines = test.readlines()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def review_process(review):\n review = review.split(' ', 1)[1][:-1].lower()\n review = re.sub('\\d','0',review)\n if 'www.' in review or 'http:' in review or 'https:' in review or '.com' in review:\n review = re.sub(r\"([^ ]+(?<=\\.[a-z]{3}))\", \"<url>\", review) #predefined regex, i don't like regexes.\n return review","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#already decoded into parsable strings, cell deleted and cannot be undoes,\n# weird Kaggle Jupyter Rules\ndef split_labels_rev(data):\n reviews = []\n labels = []\n for review in tqdm(data):\n if review.split(' ')[0] == '__label__1':\n label = [1,0]\n else:\n label = [0,1] \n rev = review_process(review) \n reviews.append(rev[:512])\n labels.append(label)\n return reviews, labels","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"splitting into labels and text"},{"metadata":{"trusted":true},"cell_type":"code","source":"train_rev, y_train = split_labels_rev(train_lines)\ntest_rev, y_test = split_labels_rev(test_lines)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"we'll now randomly shuffle our data."},{"metadata":{"trusted":true},"cell_type":"code","source":"train_rev, y_train = shuffle(train_rev, y_train)\ntest_rev, y_test = shuffle(test_rev, y_test)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"let's convert our labels into arrays now"},{"metadata":{"trusted":true},"cell_type":"code","source":"y_train = np.array(y_train)\ny_test = np.array(y_test)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"i learnt tokenizers yeahh!\n"},{"metadata":{"trusted":true},"cell_type":"code","source":"tokenizer = Tokenizer(num_words=8192)\n#saw this implementation on a blog somewhere , above\ntokenizer.fit_on_texts(train_rev)\n# https://stackoverflow.com/questions/51699001/tokenizer-texts-to-sequences-keras-tokenizer-gives-almost-all-zeros","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train_tokens = tokenizer.texts_to_sequences(train_rev)\ntest_tokens = tokenizer.texts_to_sequences(test_rev)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Getting to our neural net.\nsomething else i learnt, a neural network can do almost anything from character rec, to sentiment analysis to classification. It's so cool and easy to build."},{"metadata":{"trusted":true},"cell_type":"code","source":"#https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences\nf_train = pad_sequences(train_tokens, maxlen=128, padding='post')\nf_test = pad_sequences(test_tokens, maxlen=128, padding='post')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#finally our model : using the cliched overutilised softmax, adam and crossentropy parameters\n#8192 are the max features\nmodel = tf.keras.Sequential([Embedding(8192, 1, input_shape=(128,)),\n Flatten(),Dense(2, activation='softmax')])\nmodel.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#training for only 2 epochs and with a cross validation set of 0.2 also.\nmodel.fit(f_train, y_train, batch_size=16384, epochs=2, validation_split=0.2)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"> # Result"},{"metadata":{"trusted":true},"cell_type":"code","source":"model.evaluate(f_test, y_test)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"We get accuracy as 0.87398 which is pretyy good as compared to our Julia model and considering very little prepreocessing was involved. (Also because I don't know how to do it lol).\n\n**END OF REPORT**"},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"}},"nbformat":4,"nbformat_minor":1}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment