Created May 26, 2020 18:43
"<h1>Approach for solving the problem.</h1>\n",
We are also given a pre-defined list of sectors . <br>
"We are also given a pre-defined list of sectors . <br>\n",
Now , our program will return the <b>most similar sector</b> out of these. <br>
"Now , our program will return the <b>most similar sector</b> out of these. <br>\n",
If provided with the list of <i>fundamentals</i> , our approach would be similar to the solution , except we would have to create an additional quantity each (i.e. <i>vector_dictionary,fundamentals_list</i>, and at the end, it would just be an `if else ` comparison to chose between which category does the word belong to, i.e. <i>sectors/fundamentals/time-period</i><br>
<u>Our approach would be like this:</u><br>
a) We would first convert the input string to lowercase.<br>
"a) We would first convert the input string to lowercase.<br>\n",
c) Now ,to process each token , we have two options :<br>
<br>
" <br>\n",
2)<b> Syntactical Similarity</b> <br>
<h3>Contextual Similarity</h3>
"<h3>Contextual Similarity</h3>\n",
For this task , I have used the 840B token , 300 dimensional vectors .<br>
( In order to run this file , you need to store the vectors in the same directory as that of this notebook.)
Now , we create a dictionary which contins the sectors as the keys and the vectors for each key as its values.<br>
"Now , we create a dictionary which contins the sectors as the keys and the vectors for each key as its values.<br>\n",
If we do not get any vector **OR** the user has entered the wrong spelling of some word , we use Syntactic Similarity for the procedure.<br><br>
<h3>Syntactic Similarity</h3><br>
"<h3>Syntactic Similarity</h3><br>\n",
Edit distance is the minimum changes which we need to make in one string so that we can convert this string to the other string. <br>
We do this for all the strings in the sector list with the input string and then , we return the best matching sector syntactically.<br><br>
<h3><b> Real World Considerations for deployment in prodcution</b> </h3><br>
"<h3><b> Real World Considerations for deployment in prodcution</b> </h3><br>\n",
The lookup is in O(1) time and is pretty fast for production usage.<br> Also , if we need to update the sector list , we can do so with the help of just adding a key to the stored dictionary.<br>
For syntactic similarity , we need to run the fucntion and since our list of sectors is pretty small , it can be done in a fraction of second . <br>
Hence , the solution is fast for production and requires no training on the go .<br> Also , updates can be made to the existing list of categories , with virtually no/minimal changes in the approach.<br><br>
<h3><b>Libraries to be used:</b></h3><br>
a) numpy<br>
b) nltk<br><br>
"b) nltk<br><br>\n"
"**Importing all the required libraries** .<br>\n",
"a) nltk is used to get stopwords (common words)<br>\n",
"b) re (regex) is used for cleaning the text.<br>\n",
"c) numpy is used for processing the vectors.<br>\n",
"d) pickle is used to store the loaded vectors.(Glove vectors)<br> "
We define our sectors list as shown below. A sample sector list was taken from the given pre-defined sectors.
"The **clean_sector** is used to remove any punctuation , special characters from the sector string. <br>\n",
"If the sector is a bi-gram or an n-gram (consisting of multiple words),the function returns a list of individual words of that sector."
Here, we load the 300 dimensional Glove vectors.Since the vectors were stored in **pkl** format , pickle library is used to load the vectors.
"The loaded vectors are in the form of a dictionary .<br> For example:<br>\n",
"The word ***apple*** will have a 300 dimensional vector representing it. <br>\n",
"The word **banana** will have a 300 dimensional vector representing it.<br>\n",
"These vectors will be used later on to compare the contextual similarity between words.<br>\n",
"The **return_sector_vector** function is used to return a vector from the pre-loaded dictionary given an input word."
"The **generate_sector_dictionary** function takes an input as a list and creates a dictionary where *keys* are stored as the given sectors and the *values* as the vector representation of that key . <br>\n",
"If we do not get any vector for the input token , we keep it as *-1* so that the given sector can be checked for syntactic symilarity instead of contextual similarity.<br>\n",
"If the given sector consists of multiple words ,<br> For example: '*Cement – Products*', we take average of the vectors for both '*Cement*' and '*Products*' ."
The below function call is used to download a list of stopwords which is used for cleaning and pre-processing.
"<h1>Syntactic Similarity </h1>\n",
"The **edit_distance** function is used to find out the syntactic similarity between given two strings . It basically calculates how many changes need to be done to convert one string to the another string(edit distance) . <br>\n",
"**For example** : <br>\n",
"Consider the two strings as <br>\n",
"**a)** rvnu <br>\n",
"**b)** revenue <br>\n",
"In order to correct the first string to the second one , we need to make **3** changes. <br>\n",
"This function is used because the user can make any errors while giving the input string .<br>\n",
"This **edit_distance** is an optimized solution since it used Dynamic Programming and not a recursive solution.<br>\n",
"The running time complexity of the function is *O(length1Xlength2)*."
The **best_syntactic_similarity** makes use of the above ***edit_distance*** function to calculate the syntactic similarity of the **input word** with all the sectors in the list and returns the sector for which the edit distance is minimum.
"<h1>Contextual Similarity</h1>\n",
"In order to find the contextual similarity , we had loaded the vectors earlier as a dictionary.<br>\n",
"These vectors will now be used to find out how two words are contextually similar.<br>\n",
"The **cosine_similarity** is takes an input of two vectors are calculates the cosine similarity between them . <br>\n",
"If the two vectors are ,say, **a** and **b** , then the cosine similarity between them is given by :<br>\n",
"<h3>(dot product of <i>a</i> and <i>b</i>)/norm(<i>a</i>,<i>b</i>)</h3>"
"The **find_relevant_sector** function takes an input as a single word or token and returns the best matching sector.<br>\n",
"The Approach would be as follows:<br>\n",
"a) First we find out the vector for the input string.<br>\n",
"b) If the vector is available , we calulate the cosine similarity of the input vector with the vectors of all the sectors and then return the sector which has the highest cosine similarity with the input vector.<br>\n",
"c) If the vector for the input word is not available , we use the **best_syntactic_similarity** to find the most syntactically available sector for the given word."
This is the driver function which is takes the input string as a whole , splits it into words and then , parses it into components using different **best_matching_sector function** .
