mdvsh/SentimentAnalysis(julia).ipynb

## comparing-sentanalysis-ii.ipynb
{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Python complement to the Task.\n# Compare SentimentAnalysis models - The Julia Programming Language\n### @author : PseudoCodeNerd\n## Task Description \nUse the amazon review data from Kaggle to test the efficiency of our Sentiment Analysis models that live in TextAnalysis.jl. Compare it with models in ScikitLearn and Spacy python libraries. Upload your results as an issue in the TextAnalysis package.\n\nSome basic machine learning knowledge is useful for this task.\n\n### Special thanks to Ayush Kaushal; an exemplary mentor without whom this task wouldn't be possible.\n\nFind below, the `julia` part of the task. The python notebook would be attached too but would have sparse documentation.\n\n> The process of algorithmically identifying and categorizing opinions expressed in text to determine the user’s attitude toward the subject of the document (or post).\n\nThis is how I understand it.\n\nThis would be done using a simple one layer neural network. NLTK was..let's just forget it! (BRAINHACKED)\n## Importing Required Packages"},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"import numpy as np \nimport pandas as pd \nimport bz2\nfrom tqdm import tqdm #for cool progress bars\nimport os\nprint(os.listdir(\"../input\"))\n#shifted to a NN because my NLTK model was lost as I suffered a blue screen of death.\n#this is on kaggle\nfrom sklearn.utils import shuffle #for shuffling data\nimport tensorflow as tf\nfrom tensorflow.keras.layers import *\nfrom keras.preprocessing.text import Tokenizer\nfrom keras.preprocessing.sequence import pad_sequences\nimport re","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"400fa34d9a714a22a20259ba3140a030551e1986"},"cell_type":"markdown","source":"# 1. Read & Preprocess data"},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"train = bz2.BZ2File('../input/train.ft.txt.bz2')\ntest = bz2.BZ2File('../input/test.ft.txt.bz2')","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"93213b70d17edcdeeba304a7344e415733d0ce17","_cell_guid":"d554f2ba-0eeb-4764-b1df-febfbdda5c2c","trusted":true},"cell_type":"code","source":"train_lines = train.readlines()\ntest_lines = test.readlines()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def review_process(review):\n    review = review.split(' ', 1)[1][:-1].lower()\n    review = re.sub('\\d','0',review)\n    if 'www.' in review or 'http:' in review or 'https:' in review or '.com' in review:\n        review = re.sub(r\"([^ ]+(?<=\\.[a-z]{3}))\", \"<url>\", review) #predefined regex, i don't  like regexes.\n    return review","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#already decoded into parsable strings, cell deleted and cannot be undoes,\n# weird Kaggle Jupyter Rules\ndef split_labels_rev(data):\n    reviews = []\n    labels = []\n    for review in tqdm(data):\n        if review.split(' ')[0] == '__label__1':\n            label = [1,0]\n        else:\n            label = [0,1] \n        rev = review_process(review)       \n        reviews.append(rev[:512])\n        labels.append(label)\n    return reviews, labels","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"splitting into labels and text"},{"metadata":{"trusted":true},"cell_type":"code","source":"train_rev, y_train = split_labels_rev(train_lines)\ntest_rev, y_test = split_labels_rev(test_lines)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"we'll now randomly shuffle our data."},{"metadata":{"trusted":true},"cell_type":"code","source":"train_rev, y_train = shuffle(train_rev, y_train)\ntest_rev, y_test = shuffle(test_rev, y_test)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"let's convert our labels into arrays now"},{"metadata":{"trusted":true},"cell_type":"code","source":"y_train = np.array(y_train)\ny_test = np.array(y_test)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"i learnt tokenizers yeahh!\n"},{"metadata":{"trusted":true},"cell_type":"code","source":"tokenizer = Tokenizer(num_words=8192)\n#saw this implementation on a blog somewhere , above\ntokenizer.fit_on_texts(train_rev)\n# https://stackoverflow.com/questions/51699001/tokenizer-texts-to-sequences-keras-tokenizer-gives-almost-all-zeros","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train_tokens = tokenizer.texts_to_sequences(train_rev)\ntest_tokens = tokenizer.texts_to_sequences(test_rev)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Getting to our neural net.\nsomething else i learnt, a neural network can do almost anything from character rec, to sentiment analysis to classification. It's so cool and easy to build."},{"metadata":{"trusted":true},"cell_type":"code","source":"#https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences\nf_train = pad_sequences(train_tokens, maxlen=128, padding='post')\nf_test = pad_sequences(test_tokens, maxlen=128, padding='post')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#finally our model : using the cliched overutilised softmax, adam and crossentropy parameters\n#8192 are the max features\nmodel = tf.keras.Sequential([Embedding(8192, 1, input_shape=(128,)),\n    Flatten(),Dense(2, activation='softmax')])\nmodel.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#training for only 2 epochs and with a cross validation set of 0.2 also.\nmodel.fit(f_train, y_train, batch_size=16384, epochs=2, validation_split=0.2)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"> # Result"},{"metadata":{"trusted":true},"cell_type":"code","source":"model.evaluate(f_test, y_test)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"We get accuracy as 0.87398 which is pretyy good as compared to our Julia model and considering very little prepreocessing was involved. (Also because I don't know how to do it lol).\n\n**END OF REPORT**"},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"}},"nbformat":4,"nbformat_minor":1}

## SentimentAnalysis(julia).ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              SentimentAnalysis(julia).ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.