talha1503/nlp.ipynb

## nlp.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    },
    "colab": {
      "name": "nlp.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/talha1503/28f915d19cf770596454ecb8b6d53d99/nlp.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f7nB6ee4GPch",
        "colab_type": "text"
      },
      "source": [
        "<h1>Approach for solving the problem.</h1>\n",
        "\n",
        "In the given problem , we were given an input string and we have to parse it into different components like <i>sectors,fundamentals,time period,etc</i>.<br>\n",
        "We are also given a pre-defined list of sectors . <br>\n",
        "<u>**Note:**</u> <br>The problem has been solved with respect to the given <i>sectors</i> list. Hence, the solution would be to only parse the components on the basis of which sector they are closest to.<br>For example , currently we have a list of sectors ,say [Pharmaceuticals,Medical,Steel,Mining..etc].<br>\n",
        "Now , our program will return the <b>most similar sector</b> out of these. <br>\n",
        "If we are given with the list of <i>fundamentals</i> , we will compare the cosine similarity score , and if the word for which similairty score is maximum, belongs to <i>sectors</i> , we will parse the input as a part of Sector , OR the word for which similairty score is maximum, belongs to <i>fundamentals</i> , we will parse the input as a part of Fundamental.<br>\n",
        "<br>\n",
        "If provided with the list of <i>fundamentals</i> , our approach would be similar to the solution , except we would have to create an additional quantity each (i.e. <i>vector_dictionary,fundamentals_list</i>, and at the end, it would just be an `if else  ` comparison to chose between which category does the word belong to, i.e. <i>sectors/fundamentals/time-period</i><br>\n",
        "\n",
        "<u>Our approach would be like this:</u><br>\n",
        "a) We would first convert the input string to lowercase.<br>\n",
        "b) After that ,we need to tokenize it so that we check each token to which category does it belong.<br>\n",
        "c) Now ,to process each token , we have two options :<br>\n",
        "  <br>\n",
        "   1)<b> Contextual Similarity</b> <br>\n",
        "   2)<b> Syntactical Similarity</b> <br> \n",
        "<br>\n",
        "<h3>Contextual Similarity</h3>\n",
        "\n",
        "For contextual Similarity , we use pre-trained Glove vectors. These are 300 dimensional vectors which have been trained on a large <i>Common Crawl</i> corpus. Here's a link from where I have downloaded these vectors : <a>https://nlp.stanford.edu/projects/glove/</a><br>\n",
        "\n",
        "For this task , I have used the 840B token , 300 dimensional vectors .<br>\n",
        "( In order to run this file , you need to store the vectors in the same directory as that of this notebook.)\n",
        "Now , we create a dictionary which contins the sectors as the keys and the vectors for each key as its  values.<br>\n",
        "After the dictionary for the sectors has been created , we just need to find out the cosine similarity between the input string's vector and the vector for each <i>sector</i> and return the sector for which we get maxium similarity.<br>\n",
        "If we do not get any vector **OR** the user has entered the wrong spelling of some word , we use Syntactic  Similarity for the procedure.<br><br>\n",
        "\n",
        "<h3>Syntactic Similarity</h3><br>\n",
        "For Syntactic Similarity , we need to find out the edit distance between the two strings for all the sectors. <br>\n",
        "Edit distance is the minimum changes which we need to make in one string so that we can convert this string to the other string. <br>\n",
        "We do this for all the strings in the sector list with the input string and then , we return the best matching sector syntactically.<br><br>\n",
        "<h3><b> Real World Considerations for deployment in prodcution</b> </h3><br>\n",
        "For contextual similarity , we have used pre-trained Glove vectors which have a size of 2GB . The advantage of using this is that , since these are pre-trained , we dont need to train our model and dont need to take care about the training accuracy for vector representation as the vectors have been trained on a very large corpus.<br>We can store the dictionary OR the file of vectors in a <i>pkl</i> format .<br>\n",
        "The lookup is in O(1) time and is pretty fast for production usage.<br> Also , if we need to update the sector list , we can do so with the help of just adding a key to the stored dictionary.<br>\n",
        "For syntactic similarity , we need to run the fucntion and since our list of sectors is pretty small , it can be done in a fraction of second . <br>\n",
        "Hence , the solution is fast for production and requires no training on the go .<br> Also , updates can be made to the existing list of categories , with virtually no/minimal changes in the approach.<br><br>\n",
        "\n",
        "<h3><b>Libraries to be used:</b></h3><br>\n",
        "a) numpy<br>\n",
        "b) nltk<br><br>\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PUqd0L-cX2G6",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 121
        },
        "outputId": "1c5055ec-cd8a-4552-c162-8508dac09cd0"
      },
      "source": [
        "\n",
        "from google.colab import drive\n",
        "drive.mount('/content/drive')"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n",
            "\n",
            "Enter your authorization code:\n",
            "··········\n",
            "Mounted at /content/drive\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8c1npVju7M9x",
        "colab_type": "text"
      },
      "source": [
        "**Importing all the required libraries** .<br>\n",
        "a) nltk is used to get stopwords (common words)<br>\n",
        "b) re (regex) is used for cleaning the text.<br>\n",
        "c) numpy is used for processing the vectors.<br>\n",
        "d) pickle is used to store the loaded vectors.(Glove vectors)<br> "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1fMKlglUWxMJ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import nltk\n",
        "from nltk.corpus import stopwords\n",
        "import re\n",
        "import numpy as np\n",
        "import pickle"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IBo7Ty_n6_W0",
        "colab_type": "text"
      },
      "source": [
        "We define our sectors list as shown below. A sample sector list was taken from the given pre-defined sectors."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7z7kzdxPWxMT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sector_list = ['Cement','revenue','Fertilizers','Trading','Castings, Forgings & Fastners','Cement – Products','quarters'] "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y0yaDycf7u6j",
        "colab_type": "text"
      },
      "source": [
        "The **clean_sector** is used to remove any punctuation , special characters from the sector string. <br>\n",
        "If the sector is a bi-gram or an n-gram (consisting of multiple words),the function returns a list of individual words of that sector."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "IaE4qdRSWxMb",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def clean_sector(sector):\n",
        "    sector = sector.lower()\n",
        "    sector = re.sub(r'[^A-Za-z0-9]',' ',sector)\n",
        "    sector = sector.split()\n",
        "    return sector"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fQsYH4iL8L4A",
        "colab_type": "text"
      },
      "source": [
        "Here, we load the 300 dimensional Glove vectors.Since the vectors were stored in **pkl** format , pickle library is used to load the vectors."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2UCy1mZtYDg9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "glove_dictionary = pickle.load(open('/content/drive/My Drive/Embeddings/glove.840B.300d.pkl','rb'))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Nw639CkD8Wd2",
        "colab_type": "text"
      },
      "source": [
        "The loaded vectors are in the form of a dictionary .<br> For example:<br>\n",
        "The word ***apple*** will have a 300 dimensional vector representing it. <br>\n",
        "The word **banana** will have a 300 dimensional vector representing it.<br>\n",
        "These vectors will be used later on to compare the contextual similarity between words.<br>\n",
        "The **return_sector_vector** function is used to return a vector from the pre-loaded dictionary given an input word."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tpaFkDJiXwCA",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def return_sector_vector(sector):\n",
        "  try:\n",
        "    vector = glove_dictionary[sector]\n",
        "    return vector\n",
        "  except Exception as e:\n",
        "    return np.zeros(300)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "miaS8CKX9XnD",
        "colab_type": "text"
      },
      "source": [
        "The **generate_sector_dictionary** function takes an input as a list and creates a dictionary where *keys* are stored as the given sectors and the *values* as the vector representation of that key . <br>\n",
        "If we do not get any vector for the input token , we keep it as *-1* so that the given sector can be checked for syntactic symilarity instead of contextual similarity.<br>\n",
        "If the given sector consists of multiple words ,<br> For example: '*Cement – Products*', we take average of the vectors for both '*Cement*' and '*Products*' ."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rK7BxvKMWxM2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def generate_sector_dictionary(sector_list): \n",
        "    vector_dictionary = {}\n",
        "    cleaned_dictionary = {}\n",
        "    for sector in sector_list:\n",
        "        cleaned_sector = clean_sector(sector)\n",
        "        cleaned_dictionary[sector] = ' '.join(cleaned_sector)\n",
        "        if len(cleaned_sector) == 1 : \n",
        "           vector = return_sector_vector(cleaned_sector[0])\n",
        "           if (vector == np.zeros(300)).all():\n",
        "             vector_dictionary[sector] = -1    \n",
        "           else:\n",
        "            vector_dictionary[sector] = vector \n",
        "        else:\n",
        "           vector = sum([return_sector_vector(one_sector) for one_sector in cleaned_sector])\n",
        "           vector/=len(cleaned_sector)\n",
        "           if (vector == np.zeros(300)).all():\n",
        "             vector_dictionary[sector] = -1\n",
        "           else:\n",
        "            vector_dictionary[sector] = vector\n",
        "    return cleaned_dictionary,vector_dictionary"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zSoBTJSq-sAY",
        "colab_type": "text"
      },
      "source": [
        "The below function call is used to download a list of stopwords which is used for cleaning and pre-processing."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fAJLjNTBhHju",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 67
        },
        "outputId": "b595652c-a8a7-45d1-c9db-cae1cbbee4ee"
      },
      "source": [
        "nltk.download('stopwords')"
      ],
      "execution_count": 50,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
            "[nltk_data]   Package stopwords is already up-to-date!\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 50
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6yJqdiGn_Q5A",
        "colab_type": "text"
      },
      "source": [
        "<h1>Syntactic Similarity </h1>\n",
        "\n",
        "The **edit_distance** function is used to find out the syntactic similarity between given two strings . It basically calculates how many changes need to be done to convert one string to the another string(edit distance) . <br>\n",
        "**For example** : <br>\n",
        "Consider the two strings as <br>\n",
        "**a)** rvnu <br>\n",
        "**b)** revenue <br>\n",
        "In order to correct the first string to the second one , we need to make **3** changes. <br>\n",
        "This function is used because the user can make any errors while giving the input string .<br>\n",
        "This **edit_distance** is an optimized solution since it used Dynamic Programming and not a recursive solution.<br>\n",
        "The running time complexity of the function is *O(length1Xlength2)*."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tDhLM6CDm7h_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def edit_distance(string1,string2):\n",
        "  length1 = len(string1)\n",
        "  length2 = len(string2)\n",
        "\n",
        "  dp = [[0 for i in range(length2+1)] for i in range(length1+1)]\n",
        "  \n",
        "  for i in range(length1+1):\n",
        "    for j in range(length2+1):\n",
        "      if i==0:\n",
        "        dp[i][j] =j\n",
        "      elif j==0:\n",
        "        dp[i][j] = i\n",
        "      elif string1[i-1] == string2[j-1]:\n",
        "        dp[i][j] = dp[i-1][j-1]\n",
        "      else:\n",
        "        dp[i][j] = 1 + min(dp[i-1][j],dp[i][j-1],dp[i-1][j-1])\n",
        "\n",
        "  return dp[length1][length2]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rNENLX_6BMxs",
        "colab_type": "text"
      },
      "source": [
        "The **best_syntactic_similarity** makes use of the above ***edit_distance*** function to calculate the syntactic similarity of the **input word** with all the sectors in the list and returns the sector for which the edit distance is minimum."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VepRYstCnuRo",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def best_syntactic_similarity(input_string,sector_list):\n",
        "  edit_distance_score = {}\n",
        "  for sector in sector_list:\n",
        "    cleaned_sector = clean_sector(sector)\n",
        "    if len(cleaned_sector) == 1:\n",
        "      edit_distance_score[sector] = edit_distance(input_string,cleaned_sector[0])\n",
        "    else:\n",
        "      multi_scores = [edit_distance(input_string,sect) for sect in cleaned_sector]\n",
        "      min_score = min(multi_scores)\n",
        "      edit_distance_score[sector] = int(min_score)\n",
        "    \n",
        "  best_matching_sector =  sorted(edit_distance_score.items(), key=lambda x: x[1])[0][0]    \n",
        "\n",
        "  return best_matching_sector"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "vQiRsh49xu0u",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "cc1b5ac7-e9f7-4f0f-f1ae-6d36b0898b3e"
      },
      "source": [
        "best_syntactic_similarity('trade',sector_list)"
      ],
      "execution_count": 54,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'Trading'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 54
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Et6R5AJ_x0EB",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "62862747-14a3-41b9-e87e-db51573ae478"
      },
      "source": [
        "best_syntactic_similarity('qtrs',sector_list)"
      ],
      "execution_count": 55,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'quarters'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 55
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YW6cPD8SBpzg",
        "colab_type": "text"
      },
      "source": [
        "<h1>Contextual Similarity</h1>\n",
        "\n",
        "In order to find the contextual similarity , we had loaded the vectors earlier as a dictionary.<br>\n",
        "These vectors will now be used to find out how two words are contextually similar.<br>\n",
        "\n",
        "The **cosine_similarity** is takes an input of two vectors are calculates the cosine similarity between them . <br>\n",
        "If the two vectors are ,say, **a** and **b** , then the cosine similarity between them is given by :<br>\n",
        "\n",
        "<h3>(dot product of <i>a</i> and <i>b</i>)/norm(<i>a</i>,<i>b</i>)</h3>"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "45vdv2I8yCK3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def cosine_similarity(vector_1,vector_2):\n",
        "  cosine_similarity = 0\n",
        "  try:\n",
        "    cosine_similarity = (np.dot(vector_1,vector_2)/(np.linalg.norm(vector_1)*np.linalg.norm(vector_2)))\n",
        "  except Exception as e :\n",
        "    pass\n",
        "  return cosine_similarity "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gkciAH6ODH2o",
        "colab_type": "text"
      },
      "source": [
        "The **find_relevant_sector** function takes an input as a single word or token and returns the best matching sector.<br>\n",
        "The Approach would be as follows:<br>\n",
        "a) First we find out the vector for the input string.<br>\n",
        "b) If the vector is available , we calulate the cosine similarity of the input vector with the vectors of all the sectors and then return the sector which has the highest cosine similarity with the input vector.<br>\n",
        "c) If the vector for the input word is not available , we use the **best_syntactic_similarity** to find the most syntactically available sector for the given word."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jyEYZXi6h9yP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def find_relevant_sector(input_string,vector_dictionary,sector_list):\n",
        "   if (return_sector_vector(input_string) == np.zeros(300)).all():\n",
        "    return best_syntactic_similarity(input_string,sector_list)\n",
        "  \n",
        "   input_vector = return_sector_vector(input_string)\n",
        "   sector_scores = {}\n",
        "   for key,value in vector_dictionary.items():\n",
        "    if not isinstance(value,list):\n",
        "      sector_scores[key] = cosine_similarity(value,input_vector)\n",
        "   \n",
        "   best_conextual_similar_sector = sorted(sector_scores.items(),reverse=True, key=lambda x: x[1])[0][0]\n",
        "   return best_conextual_similar_sector"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2_str_QpuCV1",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "b1ad4c11-983d-488f-b984-38fc08672d68"
      },
      "source": [
        "print(find_relevant_sector('money',vector_dictionary,cleaned_dictionary,sector_list))"
      ],
      "execution_count": 63,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "revenue\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u4oSJo_KF07h",
        "colab_type": "text"
      },
      "source": [
        "This is the driver function which is takes the input string as a whole , splits it into words and then , parses it into components using different **best_matching_sector function** ."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fLw0TX7buOmQ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def evaluate_user_input(input_data):\n",
        "  input_data = input_data.lower()\n",
        "  input_data = input_data.split()\n",
        "  input_data = [word for word in input_data if word not in stopwords.words('english')]\n",
        "  for token in input_data:\n",
        "    best_matching_sector = find_relevant_sector(token,vector_dictionary,sector_list)\n",
        "    print(best_matching_sector+\": \"+token+\" \")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gMf-aCfnyKkY",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 185
        },
        "outputId": "f7e72eea-ed17-47ff-b119-a1f412303355"
      },
      "source": [
        "evaluate_user_input('Output Revenue, EBITDA margin for Steel and Metal stocks for past 10 qtrs')"
      ],
      "execution_count": 72,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "revenue: output \n",
            "revenue: revenue, \n",
            "revenue: ebitda \n",
            "revenue: margin \n",
            "Castings, Forgings & Fastners: steel \n",
            "Cement – Products: metal \n",
            "Trading: stocks \n",
            "quarters: past \n",
            "revenue: 10 \n",
            "quarters: qtrs \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_TSBhIuVySHb",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.9"
	},
	"colab": {
	"name": "nlp.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"include_colab_link": true
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/talha1503/28f915d19cf770596454ecb8b6d53d99/nlp.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "f7nB6ee4GPch",
	"colab_type": "text"
	},
	"source": [
	"<h1>Approach for solving the problem.</h1>\n",
	"\n",
	"In the given problem , we were given an input string and we have to parse it into different components like <i>sectors,fundamentals,time period,etc</i>.<br>\n",
	"We are also given a pre-defined list of sectors . <br>\n",
	"<u>Note:</u> <br>The problem has been solved with respect to the given <i>sectors</i> list. Hence, the solution would be to only parse the components on the basis of which sector they are closest to.<br>For example , currently we have a list of sectors ,say [Pharmaceuticals,Medical,Steel,Mining..etc].<br>\n",
	"Now , our program will return the <b>most similar sector</b> out of these. <br>\n",
	"If we are given with the list of <i>fundamentals</i> , we will compare the cosine similarity score , and if the word for which similairty score is maximum, belongs to <i>sectors</i> , we will parse the input as a part of Sector , OR the word for which similairty score is maximum, belongs to <i>fundamentals</i> , we will parse the input as a part of Fundamental.<br>\n",
	"<br>\n",
	"If provided with the list of <i>fundamentals</i> , our approach would be similar to the solution , except we would have to create an additional quantity each (i.e. <i>vector_dictionary,fundamentals_list</i>, and at the end, it would just be an `if else ` comparison to chose between which category does the word belong to, i.e. <i>sectors/fundamentals/time-period</i><br>\n",
	"\n",
	"<u>Our approach would be like this:</u><br>\n",
	"a) We would first convert the input string to lowercase.<br>\n",
	"b) After that ,we need to tokenize it so that we check each token to which category does it belong.<br>\n",
	"c) Now ,to process each token , we have two options :<br>\n",
	" <br>\n",
	" 1)<b> Contextual Similarity</b> <br>\n",
	" 2)<b> Syntactical Similarity</b> <br> \n",
	"<br>\n",
	"<h3>Contextual Similarity</h3>\n",
	"\n",
	"For contextual Similarity , we use pre-trained Glove vectors. These are 300 dimensional vectors which have been trained on a large <i>Common Crawl</i> corpus. Here's a link from where I have downloaded these vectors : <a>https://nlp.stanford.edu/projects/glove/</a><br>\n",
	"\n",
	"For this task , I have used the 840B token , 300 dimensional vectors .<br>\n",
	"( In order to run this file , you need to store the vectors in the same directory as that of this notebook.)\n",
	"Now , we create a dictionary which contins the sectors as the keys and the vectors for each key as its values.<br>\n",
	"After the dictionary for the sectors has been created , we just need to find out the cosine similarity between the input string's vector and the vector for each <i>sector</i> and return the sector for which we get maxium similarity.<br>\n",
	"If we do not get any vector OR the user has entered the wrong spelling of some word , we use Syntactic Similarity for the procedure.<br><br>\n",
	"\n",
	"<h3>Syntactic Similarity</h3><br>\n",
	"For Syntactic Similarity , we need to find out the edit distance between the two strings for all the sectors. <br>\n",
	"Edit distance is the minimum changes which we need to make in one string so that we can convert this string to the other string. <br>\n",
	"We do this for all the strings in the sector list with the input string and then , we return the best matching sector syntactically.<br><br>\n",
	"<h3><b> Real World Considerations for deployment in prodcution</b> </h3><br>\n",
	"For contextual similarity , we have used pre-trained Glove vectors which have a size of 2GB . The advantage of using this is that , since these are pre-trained , we dont need to train our model and dont need to take care about the training accuracy for vector representation as the vectors have been trained on a very large corpus.<br>We can store the dictionary OR the file of vectors in a <i>pkl</i> format .<br>\n",
	"The lookup is in O(1) time and is pretty fast for production usage.<br> Also , if we need to update the sector list , we can do so with the help of just adding a key to the stored dictionary.<br>\n",
	"For syntactic similarity , we need to run the fucntion and since our list of sectors is pretty small , it can be done in a fraction of second . <br>\n",
	"Hence , the solution is fast for production and requires no training on the go .<br> Also , updates can be made to the existing list of categories , with virtually no/minimal changes in the approach.<br><br>\n",
	"\n",
	"<h3><b>Libraries to be used:</b></h3><br>\n",
	"a) numpy<br>\n",
	"b) nltk<br><br>\n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "PUqd0L-cX2G6",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 121
	},
	"outputId": "1c5055ec-cd8a-4552-c162-8508dac09cd0"
	},
	"source": [
	"\n",
	"from google.colab import drive\n",
	"drive.mount('/content/drive')"
	],
	"execution_count": 1,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n",
	"\n",
	"Enter your authorization code:\n",
	"··········\n",
	"Mounted at /content/drive\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "8c1npVju7M9x",
	"colab_type": "text"
	},
	"source": [
	"Importing all the required libraries .<br>\n",
	"a) nltk is used to get stopwords (common words)<br>\n",
	"b) re (regex) is used for cleaning the text.<br>\n",
	"c) numpy is used for processing the vectors.<br>\n",
	"d) pickle is used to store the loaded vectors.(Glove vectors)<br> "
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "1fMKlglUWxMJ",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"import nltk\n",
	"from nltk.corpus import stopwords\n",
	"import re\n",
	"import numpy as np\n",
	"import pickle"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "IBo7Ty_n6_W0",
	"colab_type": "text"
	},
	"source": [
	"We define our sectors list as shown below. A sample sector list was taken from the given pre-defined sectors."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "7z7kzdxPWxMT",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"sector_list = ['Cement','revenue','Fertilizers','Trading','Castings, Forgings & Fastners','Cement – Products','quarters'] "
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Y0yaDycf7u6j",
	"colab_type": "text"
	},
	"source": [
	"The clean_sector is used to remove any punctuation , special characters from the sector string. <br>\n",
	"If the sector is a bi-gram or an n-gram (consisting of multiple words),the function returns a list of individual words of that sector."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "IaE4qdRSWxMb",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def clean_sector(sector):\n",
	" sector = sector.lower()\n",
	" sector = re.sub(r'[^A-Za-z0-9]',' ',sector)\n",
	" sector = sector.split()\n",
	" return sector"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "fQsYH4iL8L4A",
	"colab_type": "text"
	},
	"source": [
	"Here, we load the 300 dimensional Glove vectors.Since the vectors were stored in pkl format , pickle library is used to load the vectors."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "2UCy1mZtYDg9",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"glove_dictionary = pickle.load(open('/content/drive/My Drive/Embeddings/glove.840B.300d.pkl','rb'))"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Nw639CkD8Wd2",
	"colab_type": "text"
	},
	"source": [
	"The loaded vectors are in the form of a dictionary .<br> For example:<br>\n",
	"The word *apple* will have a 300 dimensional vector representing it. <br>\n",
	"The word banana will have a 300 dimensional vector representing it.<br>\n",
	"These vectors will be used later on to compare the contextual similarity between words.<br>\n",
	"The return_sector_vector function is used to return a vector from the pre-loaded dictionary given an input word."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "tpaFkDJiXwCA",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def return_sector_vector(sector):\n",
	" try:\n",
	" vector = glove_dictionary[sector]\n",
	" return vector\n",
	" except Exception as e:\n",
	" return np.zeros(300)"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "miaS8CKX9XnD",
	"colab_type": "text"
	},
	"source": [
	"The generate_sector_dictionary function takes an input as a list and creates a dictionary where keys are stored as the given sectors and the values as the vector representation of that key . <br>\n",
	"If we do not get any vector for the input token , we keep it as -1 so that the given sector can be checked for syntactic symilarity instead of contextual similarity.<br>\n",
	"If the given sector consists of multiple words ,<br> For example: 'Cement – Products', we take average of the vectors for both 'Cement' and 'Products' ."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "rK7BxvKMWxM2",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def generate_sector_dictionary(sector_list): \n",
	" vector_dictionary = {}\n",
	" cleaned_dictionary = {}\n",
	" for sector in sector_list:\n",
	" cleaned_sector = clean_sector(sector)\n",
	" cleaned_dictionary[sector] = ' '.join(cleaned_sector)\n",
	" if len(cleaned_sector) == 1 : \n",
	" vector = return_sector_vector(cleaned_sector[0])\n",
	" if (vector == np.zeros(300)).all():\n",
	" vector_dictionary[sector] = -1 \n",
	" else:\n",
	" vector_dictionary[sector] = vector \n",
	" else:\n",
	" vector = sum([return_sector_vector(one_sector) for one_sector in cleaned_sector])\n",
	" vector/=len(cleaned_sector)\n",
	" if (vector == np.zeros(300)).all():\n",
	" vector_dictionary[sector] = -1\n",
	" else:\n",
	" vector_dictionary[sector] = vector\n",
	" return cleaned_dictionary,vector_dictionary"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "zSoBTJSq-sAY",
	"colab_type": "text"
	},
	"source": [
	"The below function call is used to download a list of stopwords which is used for cleaning and pre-processing."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "fAJLjNTBhHju",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 67
	},
	"outputId": "b595652c-a8a7-45d1-c9db-cae1cbbee4ee"
	},
	"source": [
	"nltk.download('stopwords')"
	],
	"execution_count": 50,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
	"[nltk_data] Package stopwords is already up-to-date!\n"
	],
	"name": "stdout"
	},
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"True"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 50
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "6yJqdiGn_Q5A",
	"colab_type": "text"
	},
	"source": [
	"<h1>Syntactic Similarity </h1>\n",
	"\n",
	"The edit_distance function is used to find out the syntactic similarity between given two strings . It basically calculates how many changes need to be done to convert one string to the another string(edit distance) . <br>\n",
	"For example : <br>\n",
	"Consider the two strings as <br>\n",
	"a) rvnu <br>\n",
	"b) revenue <br>\n",
	"In order to correct the first string to the second one , we need to make 3 changes. <br>\n",
	"This function is used because the user can make any errors while giving the input string .<br>\n",
	"This edit_distance is an optimized solution since it used Dynamic Programming and not a recursive solution.<br>\n",
	"The running time complexity of the function is O(length1Xlength2)."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "tDhLM6CDm7h_",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def edit_distance(string1,string2):\n",
	" length1 = len(string1)\n",
	" length2 = len(string2)\n",
	"\n",
	" dp = [[0 for i in range(length2+1)] for i in range(length1+1)]\n",
	" \n",
	" for i in range(length1+1):\n",
	" for j in range(length2+1):\n",
	" if i==0:\n",
	" dp[i][j] =j\n",
	" elif j==0:\n",
	" dp[i][j] = i\n",
	" elif string1[i-1] == string2[j-1]:\n",
	" dp[i][j] = dp[i-1][j-1]\n",
	" else:\n",
	" dp[i][j] = 1 + min(dp[i-1][j],dp[i][j-1],dp[i-1][j-1])\n",
	"\n",
	" return dp[length1][length2]"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "rNENLX_6BMxs",
	"colab_type": "text"
	},
	"source": [
	"The best_syntactic_similarity makes use of the above *edit_distance* function to calculate the syntactic similarity of the input word with all the sectors in the list and returns the sector for which the edit distance is minimum."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "VepRYstCnuRo",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def best_syntactic_similarity(input_string,sector_list):\n",
	" edit_distance_score = {}\n",
	" for sector in sector_list:\n",
	" cleaned_sector = clean_sector(sector)\n",
	" if len(cleaned_sector) == 1:\n",
	" edit_distance_score[sector] = edit_distance(input_string,cleaned_sector[0])\n",
	" else:\n",
	" multi_scores = [edit_distance(input_string,sect) for sect in cleaned_sector]\n",
	" min_score = min(multi_scores)\n",
	" edit_distance_score[sector] = int(min_score)\n",
	" \n",
	" best_matching_sector = sorted(edit_distance_score.items(), key=lambda x: x[1])[0][0] \n",
	"\n",
	" return best_matching_sector"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "vQiRsh49xu0u",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 34
	},
	"outputId": "cc1b5ac7-e9f7-4f0f-f1ae-6d36b0898b3e"
	},
	"source": [
	"best_syntactic_similarity('trade',sector_list)"
	],
	"execution_count": 54,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"'Trading'"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 54
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Et6R5AJ_x0EB",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 34
	},
	"outputId": "62862747-14a3-41b9-e87e-db51573ae478"
	},
	"source": [
	"best_syntactic_similarity('qtrs',sector_list)"
	],
	"execution_count": 55,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"'quarters'"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 55
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "YW6cPD8SBpzg",
	"colab_type": "text"
	},
	"source": [
	"<h1>Contextual Similarity</h1>\n",
	"\n",
	"In order to find the contextual similarity , we had loaded the vectors earlier as a dictionary.<br>\n",
	"These vectors will now be used to find out how two words are contextually similar.<br>\n",
	"\n",
	"The cosine_similarity is takes an input of two vectors are calculates the cosine similarity between them . <br>\n",
	"If the two vectors are ,say, a and b , then the cosine similarity between them is given by :<br>\n",
	"\n",
	"<h3>(dot product of <i>a</i> and <i>b</i>)/norm(<i>a</i>,<i>b</i>)</h3>"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "45vdv2I8yCK3",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def cosine_similarity(vector_1,vector_2):\n",
	" cosine_similarity = 0\n",
	" try:\n",
	" cosine_similarity = (np.dot(vector_1,vector_2)/(np.linalg.norm(vector_1)*np.linalg.norm(vector_2)))\n",
	" except Exception as e :\n",
	" pass\n",
	" return cosine_similarity "
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "gkciAH6ODH2o",
	"colab_type": "text"
	},
	"source": [
	"The find_relevant_sector function takes an input as a single word or token and returns the best matching sector.<br>\n",
	"The Approach would be as follows:<br>\n",
	"a) First we find out the vector for the input string.<br>\n",
	"b) If the vector is available , we calulate the cosine similarity of the input vector with the vectors of all the sectors and then return the sector which has the highest cosine similarity with the input vector.<br>\n",
	"c) If the vector for the input word is not available , we use the best_syntactic_similarity to find the most syntactically available sector for the given word."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "jyEYZXi6h9yP",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def find_relevant_sector(input_string,vector_dictionary,sector_list):\n",
	" if (return_sector_vector(input_string) == np.zeros(300)).all():\n",
	" return best_syntactic_similarity(input_string,sector_list)\n",
	" \n",
	" input_vector = return_sector_vector(input_string)\n",
	" sector_scores = {}\n",
	" for key,value in vector_dictionary.items():\n",
	" if not isinstance(value,list):\n",
	" sector_scores[key] = cosine_similarity(value,input_vector)\n",
	" \n",
	" best_conextual_similar_sector = sorted(sector_scores.items(),reverse=True, key=lambda x: x[1])[0][0]\n",
	" return best_conextual_similar_sector"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "2_str_QpuCV1",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 34
	},
	"outputId": "b1ad4c11-983d-488f-b984-38fc08672d68"
	},
	"source": [
	"print(find_relevant_sector('money',vector_dictionary,cleaned_dictionary,sector_list))"
	],
	"execution_count": 63,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"revenue\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "u4oSJo_KF07h",
	"colab_type": "text"
	},
	"source": [
	"This is the driver function which is takes the input string as a whole , splits it into words and then , parses it into components using different best_matching_sector function ."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "fLw0TX7buOmQ",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def evaluate_user_input(input_data):\n",
	" input_data = input_data.lower()\n",
	" input_data = input_data.split()\n",
	" input_data = [word for word in input_data if word not in stopwords.words('english')]\n",
	" for token in input_data:\n",
	" best_matching_sector = find_relevant_sector(token,vector_dictionary,sector_list)\n",
	" print(best_matching_sector+\": \"+token+\" \")"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "gMf-aCfnyKkY",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 185
	},
	"outputId": "f7e72eea-ed17-47ff-b119-a1f412303355"
	},
	"source": [
	"evaluate_user_input('Output Revenue, EBITDA margin for Steel and Metal stocks for past 10 qtrs')"
	],
	"execution_count": 72,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"revenue: output \n",
	"revenue: revenue, \n",
	"revenue: ebitda \n",
	"revenue: margin \n",
	"Castings, Forgings & Fastners: steel \n",
	"Cement – Products: metal \n",
	"Trading: stocks \n",
	"quarters: past \n",
	"revenue: 10 \n",
	"quarters: qtrs \n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "_TSBhIuVySHb",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	""
	],
	"execution_count": 0,
	"outputs": []
	}
	]
	}