sujnesh/6d5a8683-1611-4c6b-8d5f-a0fc9d538053.ipynb

## 6d5a8683-1611-4c6b-8d5f-a0fc9d538053.ipynb
{"nbformat":4,"nbformat_minor":5,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.6"},"toc-autonumbering":false,"toc-showcode":false,"toc-showmarkdowntxt":false,"toc-showtags":false,"colab":{"name":"Starter Notebook.ipynb","provenance":[{"file_id":"15ud38uJb_8bHaHKp8xoSavlJnPoWhQ0u","timestamp":1614347270118}],"toc_visible":true}},"cells":[{"cell_type":"markdown","metadata":{"id":"boolean-morrison"},"source":["<div style=\"text-align: center\">\n","  <a href=\"https://www.aicrowd.com/challenges/dlnlp-note\"><img alt=\"AIcrowd\" src=\"https://gitlab.aicrowd.com/S.Rathi/iit-b-notebook-misc/-/raw/S.Rathi-master-patch-59012/creative_updated%20on%208.2.21_1%20_desktopbanner.jpg\"></a>\n","</div>"],"id":"boolean-morrison"},{"cell_type":"markdown","metadata":{"heading_collapsed":"true","tags":[],"id":"everyday-commerce"},"source":["# Assignment 1, Sentiment Analysis\n","\n","In this assignment, you’re tasked with identifying the rating of a review using sentiment analysis.\n","\n","The training dataset consists of {sentence id, review, rating}. Given a review, your neural network should be able to assign a rating score between 1 to 5 - low to high, respectively.\n","\n","Do note:\n","\n","* Perform sentiment analysis using a neural network with just one input and output layer (with softmax as activation function).\n","* You’re not allowed to have hidden layers.\n","\n","# How to use this notebook? 📝\n","\n","<p style=\"text-align: center\"><img src=\"https://gitlab.aicrowd.com/aicrowd/assets/-/raw/master/notebook/aicrowd_notebook_submission_flow.png?inline=false\" alt=\"notebook overview\" style=\"width: 650px;\"/></p>\n","\n","- **Update the config parameters**. You can define the common variables here\n","\n","Variable | Description\n","--- | ---\n","`TRAIN_DATA_PATH` | Path to the file containing training data (The data will be available at `/data/`).\n","`TEST_DATA_PATH` | Path to the file containing test data (The data will be available at `/data/`).\n","`PREDICTIONS_PATH` | Path to write the output to.\n","`ASSETS_DIR` | In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.\n","`API_KEY` | In order to submit your code to AICrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me\n","\n","- **Installing packages**. Please use the [Install packages 🗃](#install-packages-) section to install the packages\n","- **Training your models**. All the code within the [Training phase ⚙️](#training-phase-) section will be skipped during evaluation. **Please make sure to save your model weights in the assets directory and load them in the predictions phase section** "],"id":"everyday-commerce"},{"cell_type":"markdown","metadata":{"id":"LjUlX3D8X-4z"},"source":["#Dataset Specifications 💾\n","\n","* **train.csv**: has 3 columns with latter two being 'reviews' & corresponding 'ratings'.\n","* **test.csv**: has 2 columns with latter being 'reviews'. You will have to predict the corresponding 'ratings'.\n","* 'ratings' in predictions should be integers in range \\[1,5\\] i.e. {1,2,3,4,5}"],"id":"LjUlX3D8X-4z"},{"cell_type":"markdown","metadata":{"tags":[],"id":"victorian-vegetarian"},"source":["# Setup AIcrowd Utilities 🛠\n","\n","We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block."],"id":"victorian-vegetarian"},{"cell_type":"code","metadata":{"id":"alike-sally"},"source":["!pip install -U git+https://gitlab.aicrowd.com/aicrowd/aicrowd-cli.git@notebook-submission-v2 > /dev/null"],"id":"alike-sally","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"hungarian-wedding"},"source":["%load_ext aicrowd.magic"],"id":"hungarian-wedding","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"tags":[],"id":"homeless-major"},"source":["# Install packages 🗃\n","\n","Please add all pacakages installations in this section"],"id":"homeless-major"},{"cell_type":"code","metadata":{"id":"interesting-married"},"source":["!pip install numpy pandas"],"id":"interesting-married","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"0yRp35EtZGZf"},"source":["# Import necessary modules and packages 📚"],"id":"0yRp35EtZGZf"},{"cell_type":"code","metadata":{"id":"0tTmvBpIZEoF"},"source":["import os\n","import pandas as pd\n","import numpy as np\n","\n","#Add your necessary modules & packages here"],"id":"0tTmvBpIZEoF","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"tags":[],"id":"accomplished-appraisal"},"source":["# AIcrowd Runtime Configuration 🧷\n","\n","Define configuration parameters. Please include any files needed for the notebook to run under `ASSETS_DIR`. We will copy the contents of this directory to your final submission file 🙂\n","\n","The dataset is available under `/data` on the workspace."],"id":"accomplished-appraisal"},{"cell_type":"code","metadata":{"id":"norwegian-mystery"},"source":["class AIcrowdConfig:\n","  DATASET_DIR = \"data\"\n","  TEST_DATA_PATH = os.path.join(DATASET_DIR, \"test.csv\")\n","  TRAIN_DATA_PATH = os.path.join(DATASET_DIR, \"train.csv\")\n","  PREDICTIONS_PATH = \"predictions.csv\"\n","  ASSETS_DIR = \"assets\"\n","  API_KEY = \"\" # Get your key from https://www.aicrowd.com/participants/me (ctrl + click the link)\n"],"id":"norwegian-mystery","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"0ZVSnda9gGOb"},"source":["# Download the dataset 📲\n","AIcrowd magic functions will download the dataset after authenticating your API key."],"id":"0ZVSnda9gGOb"},{"cell_type":"code","metadata":{"id":"_chH9s63gKwd"},"source":["%aicrowd login --api-key \"$AIcrowdConfig.API_KEY\"\n","%aicrowd dataset download -c dlnlp-note"],"id":"_chH9s63gKwd","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"SOjBQGBogdsO"},"source":["Extract the downloaded dataset to `data` directory"],"id":"SOjBQGBogdsO"},{"cell_type":"code","metadata":{"id":"DuaY7NXUgcFG"},"source":["!mkdir $AIcrowdConfig.DATASET_DIR\n","!mv train.csv $AIcrowdConfig.DATASET_DIR\n","!mv test.csv $AIcrowdConfig.DATASET_DIR"],"id":"DuaY7NXUgcFG","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"tags":[],"id":"intense-formation"},"source":["# Define preprocessing code 💻\n","\n","The code that is common between the training and the predictions sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here."],"id":"intense-formation"},{"cell_type":"code","metadata":{"id":"comparative-ethics"},"source":["'''\n","About the task:\n","\n","You are provided with a codeflow- which consists of functions to be implemented(MANDATORY).\n","\n","You need to implement each of the functions mentioned below, you may add your own function parameters if needed.\n","'''\n","\n","\n","def encode_data(text):\n","    # This function will be used to encode the reviews using a dictionary(created using corpus vocabulary) \n","    \n","    # Example of encoding :\"The food was fabulous but pricey\" has a vocabulary of 4 words, each one has to be mapped to an integer like: \n","    # {'The':1,'food':2,'was':3 'fabulous':4 'but':5 'pricey':6} this vocabulary has to be created for the entire corpus and then be used to \n","    # encode the words into integers \n","\n","    # return encoded examples\n","    pass\n","\n","\n","\n","def convert_to_lower(text):\n","    # return the reviews after convering then to lowercase\n","    pass\n","\n","\n","def remove_punctuation(text):\n","    # return the reviews after removing punctuations\n","    pass\n","\n","\n","def remove_stopwords(text):\n","    # return the reviews after removing the stopwords\n","    pass\n","\n","def perform_tokenization(text):\n","    # return the reviews after performing tokenization\n","    pass\n","\n","\n","def perform_padding(data):\n","    # return the reviews after padding the reviews to maximum length\n","    pass\n","\n","def preprocess_data(data):\n","    # make all the following function calls on your data\n","\n","    review = data[\"reviews\"]\n","    review = convert_to_lower(review)\n","    review = remove_punctuation(review)\n","    review = remove_stopwords(review)\n","    review = perform_tokenization(review)\n","    review = encode_data(review)\n","    processed_data = perform_padding(review)\n","\n","    # return processed_data # Uncomment this\n","    # Remove this dummy code at the bottom\n","    return np.zeros( (len(data[\"reviews\"]), 100) ) \n"],"id":"comparative-ethics","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"EIjrOo_WZlEH"},"source":["# Define your Softmax function\n","\n","You have to write your own implementation from scratch and return softmax values(using predefined softmax is prohibited)"],"id":"EIjrOo_WZlEH"},{"cell_type":"code","metadata":{"id":"UNhwLc2IZmZ_"},"source":["def softmax_activation(x):\n","    # write your implementation here\n","    pass"],"id":"UNhwLc2IZmZ_","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"tags":[],"id":"associate-reference"},"source":["# Training phase ⚙️\n","\n","You can define your training code here. This sections will be skipped during evaluation."],"id":"associate-reference"},{"cell_type":"markdown","metadata":{"id":"yymek65oZ1zA"},"source":["## Define your model\n","You should define your medal related methods here using the given template"],"id":"yymek65oZ1zA"},{"cell_type":"code","metadata":{"id":"north-organic"},"source":["# Example with tensorflow, but you can replace with pytorch\n","# For better code add all imports to the top cell marked for imports\n","import tensorflow\n","from tensorflow import keras"],"id":"north-organic","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"DroMMx9QaS5K"},"source":["class NeuralNet:\n","\n","    def __init__(self, reviews, ratings):\n","\n","        self.reviews = reviews\n","        self.ratings = ratings\n","\n","\n","    def build_nn(self):\n","        #add the input and output layer here; you can use either tensorflow or pytorch\n","        model = keras.models.Sequential()\n","        model.add(keras.layers.Input((100,)))\n","        model.add(keras.layers.Dense(np.max(self.ratings)+1, activation='softmax') )\n","\n","        ####### Use the softmax activation that you wrote code for above #####\n","        \n","        model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')\n","\n","        self.model = model\n","\n","    def train_nn(self,batch_size,epochs):\n","        # write the training loop here; you can use either tensorflow or pytorch\n","        # print validation accuracy\n","        self.model.fit(x=self.reviews, y=self.ratings, epochs=3)\n","\n","    def predict(self, reviews):\n","        # return a list containing all the ratings predicted by the trained model\n","\n","        self.model.predict(reviews)"],"id":"DroMMx9QaS5K","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"straight-lemon"},"source":["\n","\n","##  Load training data 💻"],"id":"straight-lemon"},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":204},"id":"bvsIe4NVgsJx","executionInfo":{"status":"ok","timestamp":1614346846278,"user_tz":-330,"elapsed":1720,"user":{"displayName":"Sudarsh Rathi","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gj3qELh8V4mCIujE5HqHQHHGpz-1_qYSdp4VG-i=s64","userId":"04293397634187217051"}},"outputId":"91a280f6-8c68-4c6b-97d1-3f998106045c"},"source":["train_data = pd.read_csv(AIcrowdConfig.TRAIN_DATA_PATH)\n","train_data.head()"],"id":"bvsIe4NVgsJx","execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>Unnamed: 0</th>\n","      <th>reviews</th>\n","      <th>ratings</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0</td>\n","      <td>This book was very informative, covering all a...</td>\n","      <td>4</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>1</td>\n","      <td>I am already a baseball fan and knew a bit abo...</td>\n","      <td>5</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>2</td>\n","      <td>I didn't like this product it smudged all unde...</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>3</td>\n","      <td>I simply love the product. I appreciate print ...</td>\n","      <td>5</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>4</td>\n","      <td>It goes on very easily and makes my eyes look ...</td>\n","      <td>5</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["   Unnamed: 0                                            reviews  ratings\n","0           0  This book was very informative, covering all a...        4\n","1           1  I am already a baseball fan and knew a bit abo...        5\n","2           2  I didn't like this product it smudged all unde...        1\n","3           3  I simply love the product. I appreciate print ...        5\n","4           4  It goes on very easily and makes my eyes look ...        5"]},"metadata":{"tags":[]},"execution_count":14}]},{"cell_type":"markdown","metadata":{"id":"rapid-integral"},"source":["## Initialize & Train your model"],"id":"rapid-integral"},{"cell_type":"code","metadata":{"id":"sound-lying"},"source":["batch_size, epochs= 1000, 3\n","    \n","train_reviews=preprocess_data(train_data)\n","train_ratings=train_data['ratings'].values - 1\n","\n","model=NeuralNet(train_reviews,train_ratings)\n","model.build_nn()\n","model.train_nn(batch_size,epochs)"],"id":"sound-lying","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"established-conditions"},"source":["## Save your trained model"],"id":"established-conditions"},{"cell_type":"code","metadata":{"id":"collected-seating"},"source":["if not os.path.isdir(AIcrowdConfig.ASSETS_DIR):\n","  os.mkdir(AIcrowdConfig.ASSETS_DIR)\n","# This is the example for a keras model, save your model according to your framework\n","model.model.save(os.path.join(AIcrowdConfig.ASSETS_DIR, \"dummy_model.h5\"))"],"id":"collected-seating","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"tags":[],"id":"addressed-ordering"},"source":["# Prediction phase 🔎\n","\n","Please make sure to save the weights from the training section in your assets directory and load them in this section"],"id":"addressed-ordering"},{"cell_type":"code","metadata":{"id":"theoretical-disclosure"},"source":["from tensorflow import keras\n","model = keras.models.load_model(os.path.join(AIcrowdConfig.ASSETS_DIR, \"dummy_model.h5\"))"],"id":"theoretical-disclosure","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"dense-joyce"},"source":["## Load test data"],"id":"dense-joyce"},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":204},"id":"contemporary-patio","executionInfo":{"status":"ok","timestamp":1614346875910,"user_tz":-330,"elapsed":894,"user":{"displayName":"Sudarsh Rathi","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gj3qELh8V4mCIujE5HqHQHHGpz-1_qYSdp4VG-i=s64","userId":"04293397634187217051"}},"outputId":"f74342a8-62ee-4c15-96c2-da467cfac1aa"},"source":["test_data = pd.read_csv(AIcrowdConfig.TEST_DATA_PATH)\n","test_data.head()"],"id":"contemporary-patio","execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>Unnamed: 0</th>\n","      <th>reviews</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0</td>\n","      <td>Doesn't work at ALL. Don't waste your money or...</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>1</td>\n","      <td>What crap.  Would need a lot more power to do ...</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>2</td>\n","      <td>Has no suction and didn't work. Not worth trying.</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>3</td>\n","      <td>That is definitely a trash. Unable to clean an...</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>4</td>\n","      <td>Didn't even worked on cleaning the ears at all...</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["   Unnamed: 0                                            reviews\n","0           0  Doesn't work at ALL. Don't waste your money or...\n","1           1  What crap.  Would need a lot more power to do ...\n","2           2  Has no suction and didn't work. Not worth trying.\n","3           3  That is definitely a trash. Unable to clean an...\n","4           4  Didn't even worked on cleaning the ears at all..."]},"metadata":{"tags":[]},"execution_count":18}]},{"cell_type":"markdown","metadata":{"id":"sL0mBtJjjDLA"},"source":["#### Read and preprocess the data"],"id":"sL0mBtJjjDLA"},{"cell_type":"code","metadata":{"id":"KfRDcyrJjEKi"},"source":["test_reviews=preprocess_data(test_data)"],"id":"KfRDcyrJjEKi","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"successful-daniel"},"source":["## Generate predictions"],"id":"successful-daniel"},{"cell_type":"code","metadata":{"id":"bizarre-documentary"},"source":["#Make your predictions here based on your model\n","raw_predictions = model.predict(test_reviews)\n","predictions = np.argmax(raw_predictions, axis=-1)\n","\n"],"id":"bizarre-documentary","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"caroline-brooks"},"source":["## Save predictions 📨"],"id":"caroline-brooks"},{"cell_type":"code","metadata":{"id":"checked-gentleman"},"source":["pd.DataFrame(predictions, columns=[\"ratings\"]).to_csv(AIcrowdConfig.PREDICTIONS_PATH)"],"id":"checked-gentleman","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"tags":[],"id":"beginning-somerset"},"source":["# Submit to AIcrowd 🚀\n","\n","**NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)**"],"id":"beginning-somerset"},{"cell_type":"code","metadata":{"id":"latest-throat"},"source":["%aicrowd submission create --jupyter -c dlnlp-note"],"id":"latest-throat","execution_count":null,"outputs":[]}]}