cosminpopescu14/keyword_extract_with_weight.ipynb

## keyword_extract_with_weight.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preliminaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import PyPDF2\n",
    "import textract\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading Text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- converted PDF file to txt format for better pre-processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename ='JavaBasics-notes.pdf' \n",
    "\n",
    "pdfFileObj = open(filename,'rb')               #open allows you to read the file\n",
    "pdfReader = PyPDF2.PdfFileReader(pdfFileObj)   #The pdfReader variable is a readable object that will be parsed\n",
    "num_pages = pdfReader.numPages                 #discerning the number of pages will allow us to parse through all the pages\n",
    "\n",
    "\n",
    "count = 0\n",
    "text = \"\"\n",
    "                                                            \n",
    "while count < num_pages:                       #The while loop will read each page\n",
    "    pageObj = pdfReader.getPage(count)\n",
    "    count +=1\n",
    "    text += pageObj.extractText()\n",
    "    \n",
    "#Below if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.\n",
    "\n",
    "if text != \"\":\n",
    "    text = text\n",
    "    \n",
    "#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text\n",
    "\n",
    "else:\n",
    "    text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng')\n",
    "\n",
    "    # Now we have a text variable which contains all the text derived from our PDF file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = text.encode('ascii','ignore').lower() #Lowercasing each word"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Keywords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3410"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keywords = re.findall(r'[a-zA-Z]\\w+',text)\n",
    "len(keywords)                               #Total keywords in document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(list(set(keywords)),columns=['keywords'])  #Dataframe with unique keywords to avoid repetition in rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculating Weightage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " - In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- __TF: Term Frequency__, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: \n",
    "\n",
    "__TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).__\n",
    "\n",
    "- __IDF: Inverse Document Frequency__, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as \"is\", \"of\", and \"that\", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: \n",
    "\n",
    "__IDF(t) = log_e(Total number of documents / Number of documents with term t in it).__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def weightage(word,text,number_of_documents=1):\n",
    "    word_list = re.findall(word,text)\n",
    "    number_of_times_word_appeared =len(word_list)\n",
    "    tf = number_of_times_word_appeared/float(len(text))\n",
    "    idf = np.log((number_of_documents)/float(number_of_times_word_appeared))\n",
    "    tf_idf = tf*idf\n",
    "    return number_of_times_word_appeared,tf,idf ,tf_idf    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: weightage(x,text)[0])\n",
    "df['tf'] = df['keywords'].apply(lambda x: weightage(x,text)[1])\n",
    "df['idf'] = df['keywords'].apply(lambda x: weightage(x,text)[2])\n",
    "df['tf_idf'] = df['keywords'].apply(lambda x: weightage(x,text)[3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>keywords</th>\n",
       "      <th>number_of_times_word_appeared</th>\n",
       "      <th>tf</th>\n",
       "      <th>idf</th>\n",
       "      <th>tf_idf</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>194</th>\n",
       "      <td>in</td>\n",
       "      <td>369</td>\n",
       "      <td>0.014913</td>\n",
       "      <td>-5.910797</td>\n",
       "      <td>-0.088146</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>317</th>\n",
       "      <td>re</td>\n",
       "      <td>258</td>\n",
       "      <td>0.010427</td>\n",
       "      <td>-5.552960</td>\n",
       "      <td>-0.057899</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>880</th>\n",
       "      <td>at</td>\n",
       "      <td>247</td>\n",
       "      <td>0.009982</td>\n",
       "      <td>-5.509388</td>\n",
       "      <td>-0.054996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>783</th>\n",
       "      <td>on</td>\n",
       "      <td>243</td>\n",
       "      <td>0.009821</td>\n",
       "      <td>-5.493061</td>\n",
       "      <td>-0.053945</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>690</th>\n",
       "      <td>the</td>\n",
       "      <td>203</td>\n",
       "      <td>0.008204</td>\n",
       "      <td>-5.313206</td>\n",
       "      <td>-0.043590</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>876</th>\n",
       "      <td>an</td>\n",
       "      <td>199</td>\n",
       "      <td>0.008042</td>\n",
       "      <td>-5.293305</td>\n",
       "      <td>-0.042571</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>to</td>\n",
       "      <td>190</td>\n",
       "      <td>0.007679</td>\n",
       "      <td>-5.247024</td>\n",
       "      <td>-0.040290</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>799</th>\n",
       "      <td>or</td>\n",
       "      <td>167</td>\n",
       "      <td>0.006749</td>\n",
       "      <td>-5.117994</td>\n",
       "      <td>-0.034542</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>878</th>\n",
       "      <td>as</td>\n",
       "      <td>157</td>\n",
       "      <td>0.006345</td>\n",
       "      <td>-5.056246</td>\n",
       "      <td>-0.032082</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>588</th>\n",
       "      <td>java</td>\n",
       "      <td>135</td>\n",
       "      <td>0.005456</td>\n",
       "      <td>-4.905275</td>\n",
       "      <td>-0.026763</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>636</th>\n",
       "      <td>it</td>\n",
       "      <td>122</td>\n",
       "      <td>0.004930</td>\n",
       "      <td>-4.804021</td>\n",
       "      <td>-0.023686</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>635</th>\n",
       "      <td>is</td>\n",
       "      <td>110</td>\n",
       "      <td>0.004446</td>\n",
       "      <td>-4.700480</td>\n",
       "      <td>-0.020896</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>345</th>\n",
       "      <td>int</td>\n",
       "      <td>104</td>\n",
       "      <td>0.004203</td>\n",
       "      <td>-4.644391</td>\n",
       "      <td>-0.019521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>all</td>\n",
       "      <td>88</td>\n",
       "      <td>0.003556</td>\n",
       "      <td>-4.477337</td>\n",
       "      <td>-0.015923</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>789</th>\n",
       "      <td>of</td>\n",
       "      <td>75</td>\n",
       "      <td>0.003031</td>\n",
       "      <td>-4.317488</td>\n",
       "      <td>-0.013086</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>570</th>\n",
       "      <td>and</td>\n",
       "      <td>71</td>\n",
       "      <td>0.002869</td>\n",
       "      <td>-4.262680</td>\n",
       "      <td>-0.012231</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>no</td>\n",
       "      <td>70</td>\n",
       "      <td>0.002829</td>\n",
       "      <td>-4.248495</td>\n",
       "      <td>-0.012019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>568</th>\n",
       "      <td>com</td>\n",
       "      <td>67</td>\n",
       "      <td>0.002708</td>\n",
       "      <td>-4.204693</td>\n",
       "      <td>-0.011385</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>761</th>\n",
       "      <td>for</td>\n",
       "      <td>65</td>\n",
       "      <td>0.002627</td>\n",
       "      <td>-4.174387</td>\n",
       "      <td>-0.010966</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>228</th>\n",
       "      <td>data</td>\n",
       "      <td>62</td>\n",
       "      <td>0.002506</td>\n",
       "      <td>-4.127134</td>\n",
       "      <td>-0.010341</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>527</th>\n",
       "      <td>are</td>\n",
       "      <td>60</td>\n",
       "      <td>0.002425</td>\n",
       "      <td>-4.094345</td>\n",
       "      <td>-0.009928</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>404</th>\n",
       "      <td>applet</td>\n",
       "      <td>57</td>\n",
       "      <td>0.002304</td>\n",
       "      <td>-4.043051</td>\n",
       "      <td>-0.009314</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>846</th>\n",
       "      <td>hi</td>\n",
       "      <td>56</td>\n",
       "      <td>0.002263</td>\n",
       "      <td>-4.025352</td>\n",
       "      <td>-0.009110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>168</th>\n",
       "      <td>obj</td>\n",
       "      <td>56</td>\n",
       "      <td>0.002263</td>\n",
       "      <td>-4.025352</td>\n",
       "      <td>-0.009110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>844</th>\n",
       "      <td>but</td>\n",
       "      <td>55</td>\n",
       "      <td>0.002223</td>\n",
       "      <td>-4.007333</td>\n",
       "      <td>-0.008907</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    keywords  number_of_times_word_appeared        tf       idf    tf_idf\n",
       "194       in                            369  0.014913 -5.910797 -0.088146\n",
       "317       re                            258  0.010427 -5.552960 -0.057899\n",
       "880       at                            247  0.009982 -5.509388 -0.054996\n",
       "783       on                            243  0.009821 -5.493061 -0.053945\n",
       "690      the                            203  0.008204 -5.313206 -0.043590\n",
       "876       an                            199  0.008042 -5.293305 -0.042571\n",
       "25        to                            190  0.007679 -5.247024 -0.040290\n",
       "799       or                            167  0.006749 -5.117994 -0.034542\n",
       "878       as                            157  0.006345 -5.056246 -0.032082\n",
       "588     java                            135  0.005456 -4.905275 -0.026763\n",
       "636       it                            122  0.004930 -4.804021 -0.023686\n",
       "635       is                            110  0.004446 -4.700480 -0.020896\n",
       "345      int                            104  0.004203 -4.644391 -0.019521\n",
       "2        all                             88  0.003556 -4.477337 -0.015923\n",
       "789       of                             75  0.003031 -4.317488 -0.013086\n",
       "570      and                             71  0.002869 -4.262680 -0.012231\n",
       "887       no                             70  0.002829 -4.248495 -0.012019\n",
       "568      com                             67  0.002708 -4.204693 -0.011385\n",
       "761      for                             65  0.002627 -4.174387 -0.010966\n",
       "228     data                             62  0.002506 -4.127134 -0.010341\n",
       "527      are                             60  0.002425 -4.094345 -0.009928\n",
       "404   applet                             57  0.002304 -4.043051 -0.009314\n",
       "846       hi                             56  0.002263 -4.025352 -0.009110\n",
       "168      obj                             56  0.002263 -4.025352 -0.009110\n",
       "844      but                             55  0.002223 -4.007333 -0.008907"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df.sort_values('tf_idf',ascending=True)\n",
    "df.to_csv('Keywords.csv')\n",
    "df.head(25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Second Method - Using Gensim library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "RequestsDependencyWarning: urllib3 (1.22) or chardet (2.3.0) doesn't match a supported version! [__init__.py:80]\n",
      "UserWarning: detected Windows; aliasing chunkize to chunkize_serial [utils.py:1197]\n"
     ]
    }
   ],
   "source": [
    "from gensim.summarization import keywords\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "values = keywords(text=text,split='\\n',scores=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>keyword</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>java basics</td>\n",
       "      <td>0.314014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>methods</td>\n",
       "      <td>0.247325</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>method</td>\n",
       "      <td>0.247325</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>applets</td>\n",
       "      <td>0.241786</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>applet</td>\n",
       "      <td>0.241786</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>class</td>\n",
       "      <td>0.219800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>classes</td>\n",
       "      <td>0.219800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>objects</td>\n",
       "      <td>0.190636</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>object</td>\n",
       "      <td>0.190636</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>programs</td>\n",
       "      <td>0.163243</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       keyword     score\n",
       "0  java basics  0.314014\n",
       "2      methods  0.247325\n",
       "1       method  0.247325\n",
       "3      applets  0.241786\n",
       "4       applet  0.241786\n",
       "5        class  0.219800\n",
       "6      classes  0.219800\n",
       "7      objects  0.190636\n",
       "8       object  0.190636\n",
       "9     programs  0.163243"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame(values,columns=['keyword','score'])\n",
    "data = data.sort_values('score',ascending=False)\n",
    "data.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Third Approach - Using RAKE (Rapid Automatic Keyword Extraction)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rake_nltk import Rake"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = Rake()\n",
    "r.extract_keywords_from_text(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "phrases = r.get_ranked_phrases_with_scores()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "table = pd.DataFrame(phrases,columns=['score','Phrase'])\n",
    "table = table.sort_values('score',ascending=False)\n",
    "# table.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Preliminaries"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"import PyPDF2\n",
	"import textract\n",
	"import re"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Reading Text"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"- converted PDF file to txt format for better pre-processing"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"filename ='JavaBasics-notes.pdf' \n",
	"\n",
	"pdfFileObj = open(filename,'rb') #open allows you to read the file\n",
	"pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #The pdfReader variable is a readable object that will be parsed\n",
	"num_pages = pdfReader.numPages #discerning the number of pages will allow us to parse through all the pages\n",
	"\n",
	"\n",
	"count = 0\n",
	"text = \"\"\n",
	" \n",
	"while count < num_pages: #The while loop will read each page\n",
	" pageObj = pdfReader.getPage(count)\n",
	" count +=1\n",
	" text += pageObj.extractText()\n",
	" \n",
	"#Below if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.\n",
	"\n",
	"if text != \"\":\n",
	" text = text\n",
	" \n",
	"#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text\n",
	"\n",
	"else:\n",
	" text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng')\n",
	"\n",
	" # Now we have a text variable which contains all the text derived from our PDF file."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"text = text.encode('ascii','ignore').lower() #Lowercasing each word"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Extracting Keywords"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3410"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"keywords = re.findall(r'[a-zA-Z]\\w+',text)\n",
	"len(keywords) #Total keywords in document"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"df = pd.DataFrame(list(set(keywords)),columns=['keywords']) #Dataframe with unique keywords to avoid repetition in rows"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Calculating Weightage"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	" - In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"- __TF: Term Frequency__, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: \n",
	"\n",
	"__TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).__\n",
	"\n",
	"- __IDF: Inverse Document Frequency__, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as \"is\", \"of\", and \"that\", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: \n",
	"\n",
	"__IDF(t) = log_e(Total number of documents / Number of documents with term t in it).__"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"def weightage(word,text,number_of_documents=1):\n",
	" word_list = re.findall(word,text)\n",
	" number_of_times_word_appeared =len(word_list)\n",
	" tf = number_of_times_word_appeared/float(len(text))\n",
	" idf = np.log((number_of_documents)/float(number_of_times_word_appeared))\n",
	" tf_idf = tf*idf\n",
	" return number_of_times_word_appeared,tf,idf ,tf_idf "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: weightage(x,text)[0])\n",
	"df['tf'] = df['keywords'].apply(lambda x: weightage(x,text)[1])\n",
	"df['idf'] = df['keywords'].apply(lambda x: weightage(x,text)[2])\n",
	"df['tf_idf'] = df['keywords'].apply(lambda x: weightage(x,text)[3])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>keywords</th>\n",
	" <th>number_of_times_word_appeared</th>\n",
	" <th>tf</th>\n",
	" <th>idf</th>\n",
	" <th>tf_idf</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>194</th>\n",
	" <td>in</td>\n",
	" <td>369</td>\n",
	" <td>0.014913</td>\n",
	" <td>-5.910797</td>\n",
	" <td>-0.088146</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>317</th>\n",
	" <td>re</td>\n",
	" <td>258</td>\n",
	" <td>0.010427</td>\n",
	" <td>-5.552960</td>\n",
	" <td>-0.057899</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>880</th>\n",
	" <td>at</td>\n",
	" <td>247</td>\n",
	" <td>0.009982</td>\n",
	" <td>-5.509388</td>\n",
	" <td>-0.054996</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>783</th>\n",
	" <td>on</td>\n",
	" <td>243</td>\n",
	" <td>0.009821</td>\n",
	" <td>-5.493061</td>\n",
	" <td>-0.053945</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>690</th>\n",
	" <td>the</td>\n",
	" <td>203</td>\n",
	" <td>0.008204</td>\n",
	" <td>-5.313206</td>\n",
	" <td>-0.043590</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>876</th>\n",
	" <td>an</td>\n",
	" <td>199</td>\n",
	" <td>0.008042</td>\n",
	" <td>-5.293305</td>\n",
	" <td>-0.042571</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>25</th>\n",
	" <td>to</td>\n",
	" <td>190</td>\n",
	" <td>0.007679</td>\n",
	" <td>-5.247024</td>\n",
	" <td>-0.040290</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>799</th>\n",
	" <td>or</td>\n",
	" <td>167</td>\n",
	" <td>0.006749</td>\n",
	" <td>-5.117994</td>\n",
	" <td>-0.034542</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>878</th>\n",
	" <td>as</td>\n",
	" <td>157</td>\n",
	" <td>0.006345</td>\n",
	" <td>-5.056246</td>\n",
	" <td>-0.032082</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>588</th>\n",
	" <td>java</td>\n",
	" <td>135</td>\n",
	" <td>0.005456</td>\n",
	" <td>-4.905275</td>\n",
	" <td>-0.026763</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>636</th>\n",
	" <td>it</td>\n",
	" <td>122</td>\n",
	" <td>0.004930</td>\n",
	" <td>-4.804021</td>\n",
	" <td>-0.023686</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>635</th>\n",
	" <td>is</td>\n",
	" <td>110</td>\n",
	" <td>0.004446</td>\n",
	" <td>-4.700480</td>\n",
	" <td>-0.020896</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>345</th>\n",
	" <td>int</td>\n",
	" <td>104</td>\n",
	" <td>0.004203</td>\n",
	" <td>-4.644391</td>\n",
	" <td>-0.019521</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>all</td>\n",
	" <td>88</td>\n",
	" <td>0.003556</td>\n",
	" <td>-4.477337</td>\n",
	" <td>-0.015923</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>789</th>\n",
	" <td>of</td>\n",
	" <td>75</td>\n",
	" <td>0.003031</td>\n",
	" <td>-4.317488</td>\n",
	" <td>-0.013086</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>570</th>\n",
	" <td>and</td>\n",
	" <td>71</td>\n",
	" <td>0.002869</td>\n",
	" <td>-4.262680</td>\n",
	" <td>-0.012231</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>887</th>\n",
	" <td>no</td>\n",
	" <td>70</td>\n",
	" <td>0.002829</td>\n",
	" <td>-4.248495</td>\n",
	" <td>-0.012019</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>568</th>\n",
	" <td>com</td>\n",
	" <td>67</td>\n",
	" <td>0.002708</td>\n",
	" <td>-4.204693</td>\n",
	" <td>-0.011385</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>761</th>\n",
	" <td>for</td>\n",
	" <td>65</td>\n",
	" <td>0.002627</td>\n",
	" <td>-4.174387</td>\n",
	" <td>-0.010966</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>228</th>\n",
	" <td>data</td>\n",
	" <td>62</td>\n",
	" <td>0.002506</td>\n",
	" <td>-4.127134</td>\n",
	" <td>-0.010341</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>527</th>\n",
	" <td>are</td>\n",
	" <td>60</td>\n",
	" <td>0.002425</td>\n",
	" <td>-4.094345</td>\n",
	" <td>-0.009928</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>404</th>\n",
	" <td>applet</td>\n",
	" <td>57</td>\n",
	" <td>0.002304</td>\n",
	" <td>-4.043051</td>\n",
	" <td>-0.009314</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>846</th>\n",
	" <td>hi</td>\n",
	" <td>56</td>\n",
	" <td>0.002263</td>\n",
	" <td>-4.025352</td>\n",
	" <td>-0.009110</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>168</th>\n",
	" <td>obj</td>\n",
	" <td>56</td>\n",
	" <td>0.002263</td>\n",
	" <td>-4.025352</td>\n",
	" <td>-0.009110</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>844</th>\n",
	" <td>but</td>\n",
	" <td>55</td>\n",
	" <td>0.002223</td>\n",
	" <td>-4.007333</td>\n",
	" <td>-0.008907</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" keywords number_of_times_word_appeared tf idf tf_idf\n",
	"194 in 369 0.014913 -5.910797 -0.088146\n",
	"317 re 258 0.010427 -5.552960 -0.057899\n",
	"880 at 247 0.009982 -5.509388 -0.054996\n",
	"783 on 243 0.009821 -5.493061 -0.053945\n",
	"690 the 203 0.008204 -5.313206 -0.043590\n",
	"876 an 199 0.008042 -5.293305 -0.042571\n",
	"25 to 190 0.007679 -5.247024 -0.040290\n",
	"799 or 167 0.006749 -5.117994 -0.034542\n",
	"878 as 157 0.006345 -5.056246 -0.032082\n",
	"588 java 135 0.005456 -4.905275 -0.026763\n",
	"636 it 122 0.004930 -4.804021 -0.023686\n",
	"635 is 110 0.004446 -4.700480 -0.020896\n",
	"345 int 104 0.004203 -4.644391 -0.019521\n",
	"2 all 88 0.003556 -4.477337 -0.015923\n",
	"789 of 75 0.003031 -4.317488 -0.013086\n",
	"570 and 71 0.002869 -4.262680 -0.012231\n",
	"887 no 70 0.002829 -4.248495 -0.012019\n",
	"568 com 67 0.002708 -4.204693 -0.011385\n",
	"761 for 65 0.002627 -4.174387 -0.010966\n",
	"228 data 62 0.002506 -4.127134 -0.010341\n",
	"527 are 60 0.002425 -4.094345 -0.009928\n",
	"404 applet 57 0.002304 -4.043051 -0.009314\n",
	"846 hi 56 0.002263 -4.025352 -0.009110\n",
	"168 obj 56 0.002263 -4.025352 -0.009110\n",
	"844 but 55 0.002223 -4.007333 -0.008907"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"df = df.sort_values('tf_idf',ascending=True)\n",
	"df.to_csv('Keywords.csv')\n",
	"df.head(25)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"***"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Second Method - Using Gensim library"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"RequestsDependencyWarning: urllib3 (1.22) or chardet (2.3.0) doesn't match a supported version! [__init__.py:80]\n",
	"UserWarning: detected Windows; aliasing chunkize to chunkize_serial [utils.py:1197]\n"
	]
	}
	],
	"source": [
	"from gensim.summarization import keywords\n",
	"import warnings\n",
	"warnings.filterwarnings(\"ignore\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"values = keywords(text=text,split='\\n',scores=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>keyword</th>\n",
	" <th>score</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>java basics</td>\n",
	" <td>0.314014</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>methods</td>\n",
	" <td>0.247325</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>method</td>\n",
	" <td>0.247325</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>applets</td>\n",
	" <td>0.241786</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>applet</td>\n",
	" <td>0.241786</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>5</th>\n",
	" <td>class</td>\n",
	" <td>0.219800</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>6</th>\n",
	" <td>classes</td>\n",
	" <td>0.219800</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>7</th>\n",
	" <td>objects</td>\n",
	" <td>0.190636</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>8</th>\n",
	" <td>object</td>\n",
	" <td>0.190636</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>9</th>\n",
	" <td>programs</td>\n",
	" <td>0.163243</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" keyword score\n",
	"0 java basics 0.314014\n",
	"2 methods 0.247325\n",
	"1 method 0.247325\n",
	"3 applets 0.241786\n",
	"4 applet 0.241786\n",
	"5 class 0.219800\n",
	"6 classes 0.219800\n",
	"7 objects 0.190636\n",
	"8 object 0.190636\n",
	"9 programs 0.163243"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"data = pd.DataFrame(values,columns=['keyword','score'])\n",
	"data = data.sort_values('score',ascending=False)\n",
	"data.head(10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"***"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Third Approach - Using RAKE (Rapid Automatic Keyword Extraction)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"from rake_nltk import Rake"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"r = Rake()\n",
	"r.extract_keywords_from_text(text)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [],
	"source": [
	"phrases = r.get_ranked_phrases_with_scores()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [],
	"source": [
	"table = pd.DataFrame(phrases,columns=['score','Phrase'])\n",
	"table = table.sort_values('score',ascending=False)\n",
	"# table.head(10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"***"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.15"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}