Skip to content

Instantly share code, notes, and snippets.

@talha1503
Created May 26, 2020 18:43
Show Gist options
  • Save talha1503/28f915d19cf770596454ecb8b6d53d99 to your computer and use it in GitHub Desktop.
Save talha1503/28f915d19cf770596454ecb8b6d53d99 to your computer and use it in GitHub Desktop.
nlp.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
},
"colab": {
"name": "nlp.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/talha1503/28f915d19cf770596454ecb8b6d53d99/nlp.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f7nB6ee4GPch",
"colab_type": "text"
},
"source": [
"<h1>Approach for solving the problem.</h1>\n",
"\n",
"In the given problem , we were given an input string and we have to parse it into different components like <i>sectors,fundamentals,time period,etc</i>.<br>\n",
"We are also given a pre-defined list of sectors . <br>\n",
"<u>**Note:**</u> <br>The problem has been solved with respect to the given <i>sectors</i> list. Hence, the solution would be to only parse the components on the basis of which sector they are closest to.<br>For example , currently we have a list of sectors ,say [Pharmaceuticals,Medical,Steel,Mining..etc].<br>\n",
"Now , our program will return the <b>most similar sector</b> out of these. <br>\n",
"If we are given with the list of <i>fundamentals</i> , we will compare the cosine similarity score , and if the word for which similairty score is maximum, belongs to <i>sectors</i> , we will parse the input as a part of Sector , OR the word for which similairty score is maximum, belongs to <i>fundamentals</i> , we will parse the input as a part of Fundamental.<br>\n",
"<br>\n",
"If provided with the list of <i>fundamentals</i> , our approach would be similar to the solution , except we would have to create an additional quantity each (i.e. <i>vector_dictionary,fundamentals_list</i>, and at the end, it would just be an `if else ` comparison to chose between which category does the word belong to, i.e. <i>sectors/fundamentals/time-period</i><br>\n",
"\n",
"<u>Our approach would be like this:</u><br>\n",
"a) We would first convert the input string to lowercase.<br>\n",
"b) After that ,we need to tokenize it so that we check each token to which category does it belong.<br>\n",
"c) Now ,to process each token , we have two options :<br>\n",
" <br>\n",
" 1)<b> Contextual Similarity</b> <br>\n",
" 2)<b> Syntactical Similarity</b> <br> \n",
"<br>\n",
"<h3>Contextual Similarity</h3>\n",
"\n",
"For contextual Similarity , we use pre-trained Glove vectors. These are 300 dimensional vectors which have been trained on a large <i>Common Crawl</i> corpus. Here's a link from where I have downloaded these vectors : <a>https://nlp.stanford.edu/projects/glove/</a><br>\n",
"\n",
"For this task , I have used the 840B token , 300 dimensional vectors .<br>\n",
"( In order to run this file , you need to store the vectors in the same directory as that of this notebook.)\n",
"Now , we create a dictionary which contins the sectors as the keys and the vectors for each key as its values.<br>\n",
"After the dictionary for the sectors has been created , we just need to find out the cosine similarity between the input string's vector and the vector for each <i>sector</i> and return the sector for which we get maxium similarity.<br>\n",
"If we do not get any vector **OR** the user has entered the wrong spelling of some word , we use Syntactic Similarity for the procedure.<br><br>\n",
"\n",
"<h3>Syntactic Similarity</h3><br>\n",
"For Syntactic Similarity , we need to find out the edit distance between the two strings for all the sectors. <br>\n",
"Edit distance is the minimum changes which we need to make in one string so that we can convert this string to the other string. <br>\n",
"We do this for all the strings in the sector list with the input string and then , we return the best matching sector syntactically.<br><br>\n",
"<h3><b> Real World Considerations for deployment in prodcution</b> </h3><br>\n",
"For contextual similarity , we have used pre-trained Glove vectors which have a size of 2GB . The advantage of using this is that , since these are pre-trained , we dont need to train our model and dont need to take care about the training accuracy for vector representation as the vectors have been trained on a very large corpus.<br>We can store the dictionary OR the file of vectors in a <i>pkl</i> format .<br>\n",
"The lookup is in O(1) time and is pretty fast for production usage.<br> Also , if we need to update the sector list , we can do so with the help of just adding a key to the stored dictionary.<br>\n",
"For syntactic similarity , we need to run the fucntion and since our list of sectors is pretty small , it can be done in a fraction of second . <br>\n",
"Hence , the solution is fast for production and requires no training on the go .<br> Also , updates can be made to the existing list of categories , with virtually no/minimal changes in the approach.<br><br>\n",
"\n",
"<h3><b>Libraries to be used:</b></h3><br>\n",
"a) numpy<br>\n",
"b) nltk<br><br>\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "PUqd0L-cX2G6",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 121
},
"outputId": "1c5055ec-cd8a-4552-c162-8508dac09cd0"
},
"source": [
"\n",
"from google.colab import drive\n",
"drive.mount('/content/drive')"
],
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": [
"Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n",
"\n",
"Enter your authorization code:\n",
"··········\n",
"Mounted at /content/drive\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8c1npVju7M9x",
"colab_type": "text"
},
"source": [
"**Importing all the required libraries** .<br>\n",
"a) nltk is used to get stopwords (common words)<br>\n",
"b) re (regex) is used for cleaning the text.<br>\n",
"c) numpy is used for processing the vectors.<br>\n",
"d) pickle is used to store the loaded vectors.(Glove vectors)<br> "
]
},
{
"cell_type": "code",
"metadata": {
"id": "1fMKlglUWxMJ",
"colab_type": "code",
"colab": {}
},
"source": [
"import nltk\n",
"from nltk.corpus import stopwords\n",
"import re\n",
"import numpy as np\n",
"import pickle"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "IBo7Ty_n6_W0",
"colab_type": "text"
},
"source": [
"We define our sectors list as shown below. A sample sector list was taken from the given pre-defined sectors."
]
},
{
"cell_type": "code",
"metadata": {
"id": "7z7kzdxPWxMT",
"colab_type": "code",
"colab": {}
},
"source": [
"sector_list = ['Cement','revenue','Fertilizers','Trading','Castings, Forgings & Fastners','Cement – Products','quarters'] "
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y0yaDycf7u6j",
"colab_type": "text"
},
"source": [
"The **clean_sector** is used to remove any punctuation , special characters from the sector string. <br>\n",
"If the sector is a bi-gram or an n-gram (consisting of multiple words),the function returns a list of individual words of that sector."
]
},
{
"cell_type": "code",
"metadata": {
"id": "IaE4qdRSWxMb",
"colab_type": "code",
"colab": {}
},
"source": [
"def clean_sector(sector):\n",
" sector = sector.lower()\n",
" sector = re.sub(r'[^A-Za-z0-9]',' ',sector)\n",
" sector = sector.split()\n",
" return sector"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "fQsYH4iL8L4A",
"colab_type": "text"
},
"source": [
"Here, we load the 300 dimensional Glove vectors.Since the vectors were stored in **pkl** format , pickle library is used to load the vectors."
]
},
{
"cell_type": "code",
"metadata": {
"id": "2UCy1mZtYDg9",
"colab_type": "code",
"colab": {}
},
"source": [
"glove_dictionary = pickle.load(open('/content/drive/My Drive/Embeddings/glove.840B.300d.pkl','rb'))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nw639CkD8Wd2",
"colab_type": "text"
},
"source": [
"The loaded vectors are in the form of a dictionary .<br> For example:<br>\n",
"The word ***apple*** will have a 300 dimensional vector representing it. <br>\n",
"The word **banana** will have a 300 dimensional vector representing it.<br>\n",
"These vectors will be used later on to compare the contextual similarity between words.<br>\n",
"The **return_sector_vector** function is used to return a vector from the pre-loaded dictionary given an input word."
]
},
{
"cell_type": "code",
"metadata": {
"id": "tpaFkDJiXwCA",
"colab_type": "code",
"colab": {}
},
"source": [
"def return_sector_vector(sector):\n",
" try:\n",
" vector = glove_dictionary[sector]\n",
" return vector\n",
" except Exception as e:\n",
" return np.zeros(300)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "miaS8CKX9XnD",
"colab_type": "text"
},
"source": [
"The **generate_sector_dictionary** function takes an input as a list and creates a dictionary where *keys* are stored as the given sectors and the *values* as the vector representation of that key . <br>\n",
"If we do not get any vector for the input token , we keep it as *-1* so that the given sector can be checked for syntactic symilarity instead of contextual similarity.<br>\n",
"If the given sector consists of multiple words ,<br> For example: '*Cement – Products*', we take average of the vectors for both '*Cement*' and '*Products*' ."
]
},
{
"cell_type": "code",
"metadata": {
"id": "rK7BxvKMWxM2",
"colab_type": "code",
"colab": {}
},
"source": [
"def generate_sector_dictionary(sector_list): \n",
" vector_dictionary = {}\n",
" cleaned_dictionary = {}\n",
" for sector in sector_list:\n",
" cleaned_sector = clean_sector(sector)\n",
" cleaned_dictionary[sector] = ' '.join(cleaned_sector)\n",
" if len(cleaned_sector) == 1 : \n",
" vector = return_sector_vector(cleaned_sector[0])\n",
" if (vector == np.zeros(300)).all():\n",
" vector_dictionary[sector] = -1 \n",
" else:\n",
" vector_dictionary[sector] = vector \n",
" else:\n",
" vector = sum([return_sector_vector(one_sector) for one_sector in cleaned_sector])\n",
" vector/=len(cleaned_sector)\n",
" if (vector == np.zeros(300)).all():\n",
" vector_dictionary[sector] = -1\n",
" else:\n",
" vector_dictionary[sector] = vector\n",
" return cleaned_dictionary,vector_dictionary"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "zSoBTJSq-sAY",
"colab_type": "text"
},
"source": [
"The below function call is used to download a list of stopwords which is used for cleaning and pre-processing."
]
},
{
"cell_type": "code",
"metadata": {
"id": "fAJLjNTBhHju",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 67
},
"outputId": "b595652c-a8a7-45d1-c9db-cae1cbbee4ee"
},
"source": [
"nltk.download('stopwords')"
],
"execution_count": 50,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {
"tags": []
},
"execution_count": 50
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6yJqdiGn_Q5A",
"colab_type": "text"
},
"source": [
"<h1>Syntactic Similarity </h1>\n",
"\n",
"The **edit_distance** function is used to find out the syntactic similarity between given two strings . It basically calculates how many changes need to be done to convert one string to the another string(edit distance) . <br>\n",
"**For example** : <br>\n",
"Consider the two strings as <br>\n",
"**a)** rvnu <br>\n",
"**b)** revenue <br>\n",
"In order to correct the first string to the second one , we need to make **3** changes. <br>\n",
"This function is used because the user can make any errors while giving the input string .<br>\n",
"This **edit_distance** is an optimized solution since it used Dynamic Programming and not a recursive solution.<br>\n",
"The running time complexity of the function is *O(length1Xlength2)*."
]
},
{
"cell_type": "code",
"metadata": {
"id": "tDhLM6CDm7h_",
"colab_type": "code",
"colab": {}
},
"source": [
"def edit_distance(string1,string2):\n",
" length1 = len(string1)\n",
" length2 = len(string2)\n",
"\n",
" dp = [[0 for i in range(length2+1)] for i in range(length1+1)]\n",
" \n",
" for i in range(length1+1):\n",
" for j in range(length2+1):\n",
" if i==0:\n",
" dp[i][j] =j\n",
" elif j==0:\n",
" dp[i][j] = i\n",
" elif string1[i-1] == string2[j-1]:\n",
" dp[i][j] = dp[i-1][j-1]\n",
" else:\n",
" dp[i][j] = 1 + min(dp[i-1][j],dp[i][j-1],dp[i-1][j-1])\n",
"\n",
" return dp[length1][length2]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "rNENLX_6BMxs",
"colab_type": "text"
},
"source": [
"The **best_syntactic_similarity** makes use of the above ***edit_distance*** function to calculate the syntactic similarity of the **input word** with all the sectors in the list and returns the sector for which the edit distance is minimum."
]
},
{
"cell_type": "code",
"metadata": {
"id": "VepRYstCnuRo",
"colab_type": "code",
"colab": {}
},
"source": [
"def best_syntactic_similarity(input_string,sector_list):\n",
" edit_distance_score = {}\n",
" for sector in sector_list:\n",
" cleaned_sector = clean_sector(sector)\n",
" if len(cleaned_sector) == 1:\n",
" edit_distance_score[sector] = edit_distance(input_string,cleaned_sector[0])\n",
" else:\n",
" multi_scores = [edit_distance(input_string,sect) for sect in cleaned_sector]\n",
" min_score = min(multi_scores)\n",
" edit_distance_score[sector] = int(min_score)\n",
" \n",
" best_matching_sector = sorted(edit_distance_score.items(), key=lambda x: x[1])[0][0] \n",
"\n",
" return best_matching_sector"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "vQiRsh49xu0u",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "cc1b5ac7-e9f7-4f0f-f1ae-6d36b0898b3e"
},
"source": [
"best_syntactic_similarity('trade',sector_list)"
],
"execution_count": 54,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'Trading'"
]
},
"metadata": {
"tags": []
},
"execution_count": 54
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Et6R5AJ_x0EB",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "62862747-14a3-41b9-e87e-db51573ae478"
},
"source": [
"best_syntactic_similarity('qtrs',sector_list)"
],
"execution_count": 55,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'quarters'"
]
},
"metadata": {
"tags": []
},
"execution_count": 55
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YW6cPD8SBpzg",
"colab_type": "text"
},
"source": [
"<h1>Contextual Similarity</h1>\n",
"\n",
"In order to find the contextual similarity , we had loaded the vectors earlier as a dictionary.<br>\n",
"These vectors will now be used to find out how two words are contextually similar.<br>\n",
"\n",
"The **cosine_similarity** is takes an input of two vectors are calculates the cosine similarity between them . <br>\n",
"If the two vectors are ,say, **a** and **b** , then the cosine similarity between them is given by :<br>\n",
"\n",
"<h3>(dot product of <i>a</i> and <i>b</i>)/norm(<i>a</i>,<i>b</i>)</h3>"
]
},
{
"cell_type": "code",
"metadata": {
"id": "45vdv2I8yCK3",
"colab_type": "code",
"colab": {}
},
"source": [
"def cosine_similarity(vector_1,vector_2):\n",
" cosine_similarity = 0\n",
" try:\n",
" cosine_similarity = (np.dot(vector_1,vector_2)/(np.linalg.norm(vector_1)*np.linalg.norm(vector_2)))\n",
" except Exception as e :\n",
" pass\n",
" return cosine_similarity "
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "gkciAH6ODH2o",
"colab_type": "text"
},
"source": [
"The **find_relevant_sector** function takes an input as a single word or token and returns the best matching sector.<br>\n",
"The Approach would be as follows:<br>\n",
"a) First we find out the vector for the input string.<br>\n",
"b) If the vector is available , we calulate the cosine similarity of the input vector with the vectors of all the sectors and then return the sector which has the highest cosine similarity with the input vector.<br>\n",
"c) If the vector for the input word is not available , we use the **best_syntactic_similarity** to find the most syntactically available sector for the given word."
]
},
{
"cell_type": "code",
"metadata": {
"id": "jyEYZXi6h9yP",
"colab_type": "code",
"colab": {}
},
"source": [
"def find_relevant_sector(input_string,vector_dictionary,sector_list):\n",
" if (return_sector_vector(input_string) == np.zeros(300)).all():\n",
" return best_syntactic_similarity(input_string,sector_list)\n",
" \n",
" input_vector = return_sector_vector(input_string)\n",
" sector_scores = {}\n",
" for key,value in vector_dictionary.items():\n",
" if not isinstance(value,list):\n",
" sector_scores[key] = cosine_similarity(value,input_vector)\n",
" \n",
" best_conextual_similar_sector = sorted(sector_scores.items(),reverse=True, key=lambda x: x[1])[0][0]\n",
" return best_conextual_similar_sector"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "2_str_QpuCV1",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "b1ad4c11-983d-488f-b984-38fc08672d68"
},
"source": [
"print(find_relevant_sector('money',vector_dictionary,cleaned_dictionary,sector_list))"
],
"execution_count": 63,
"outputs": [
{
"output_type": "stream",
"text": [
"revenue\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "u4oSJo_KF07h",
"colab_type": "text"
},
"source": [
"This is the driver function which is takes the input string as a whole , splits it into words and then , parses it into components using different **best_matching_sector function** ."
]
},
{
"cell_type": "code",
"metadata": {
"id": "fLw0TX7buOmQ",
"colab_type": "code",
"colab": {}
},
"source": [
"def evaluate_user_input(input_data):\n",
" input_data = input_data.lower()\n",
" input_data = input_data.split()\n",
" input_data = [word for word in input_data if word not in stopwords.words('english')]\n",
" for token in input_data:\n",
" best_matching_sector = find_relevant_sector(token,vector_dictionary,sector_list)\n",
" print(best_matching_sector+\": \"+token+\" \")"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "gMf-aCfnyKkY",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 185
},
"outputId": "f7e72eea-ed17-47ff-b119-a1f412303355"
},
"source": [
"evaluate_user_input('Output Revenue, EBITDA margin for Steel and Metal stocks for past 10 qtrs')"
],
"execution_count": 72,
"outputs": [
{
"output_type": "stream",
"text": [
"revenue: output \n",
"revenue: revenue, \n",
"revenue: ebitda \n",
"revenue: margin \n",
"Castings, Forgings & Fastners: steel \n",
"Cement – Products: metal \n",
"Trading: stocks \n",
"quarters: past \n",
"revenue: 10 \n",
"quarters: qtrs \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "_TSBhIuVySHb",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment