Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cosminpopescu14/cd5d9459858e28ec7d6bd0321695ef6e to your computer and use it in GitHub Desktop.
Save cosminpopescu14/cd5d9459858e28ec7d6bd0321695ef6e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preliminaries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import PyPDF2\n",
"import textract\n",
"import re"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading Text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- converted PDF file to txt format for better pre-processing"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"filename ='JavaBasics-notes.pdf' \n",
"\n",
"pdfFileObj = open(filename,'rb') #open allows you to read the file\n",
"pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #The pdfReader variable is a readable object that will be parsed\n",
"num_pages = pdfReader.numPages #discerning the number of pages will allow us to parse through all the pages\n",
"\n",
"\n",
"count = 0\n",
"text = \"\"\n",
" \n",
"while count < num_pages: #The while loop will read each page\n",
" pageObj = pdfReader.getPage(count)\n",
" count +=1\n",
" text += pageObj.extractText()\n",
" \n",
"#Below if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.\n",
"\n",
"if text != \"\":\n",
" text = text\n",
" \n",
"#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text\n",
"\n",
"else:\n",
" text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng')\n",
"\n",
" # Now we have a text variable which contains all the text derived from our PDF file."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"text = text.encode('ascii','ignore').lower() #Lowercasing each word"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extracting Keywords"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3410"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"keywords = re.findall(r'[a-zA-Z]\\w+',text)\n",
"len(keywords) #Total keywords in document"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(list(set(keywords)),columns=['keywords']) #Dataframe with unique keywords to avoid repetition in rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculating Weightage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" - In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- __TF: Term Frequency__, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: \n",
"\n",
"__TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).__\n",
"\n",
"- __IDF: Inverse Document Frequency__, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as \"is\", \"of\", and \"that\", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: \n",
"\n",
"__IDF(t) = log_e(Total number of documents / Number of documents with term t in it).__"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def weightage(word,text,number_of_documents=1):\n",
" word_list = re.findall(word,text)\n",
" number_of_times_word_appeared =len(word_list)\n",
" tf = number_of_times_word_appeared/float(len(text))\n",
" idf = np.log((number_of_documents)/float(number_of_times_word_appeared))\n",
" tf_idf = tf*idf\n",
" return number_of_times_word_appeared,tf,idf ,tf_idf "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: weightage(x,text)[0])\n",
"df['tf'] = df['keywords'].apply(lambda x: weightage(x,text)[1])\n",
"df['idf'] = df['keywords'].apply(lambda x: weightage(x,text)[2])\n",
"df['tf_idf'] = df['keywords'].apply(lambda x: weightage(x,text)[3])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>keywords</th>\n",
" <th>number_of_times_word_appeared</th>\n",
" <th>tf</th>\n",
" <th>idf</th>\n",
" <th>tf_idf</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>194</th>\n",
" <td>in</td>\n",
" <td>369</td>\n",
" <td>0.014913</td>\n",
" <td>-5.910797</td>\n",
" <td>-0.088146</td>\n",
" </tr>\n",
" <tr>\n",
" <th>317</th>\n",
" <td>re</td>\n",
" <td>258</td>\n",
" <td>0.010427</td>\n",
" <td>-5.552960</td>\n",
" <td>-0.057899</td>\n",
" </tr>\n",
" <tr>\n",
" <th>880</th>\n",
" <td>at</td>\n",
" <td>247</td>\n",
" <td>0.009982</td>\n",
" <td>-5.509388</td>\n",
" <td>-0.054996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>783</th>\n",
" <td>on</td>\n",
" <td>243</td>\n",
" <td>0.009821</td>\n",
" <td>-5.493061</td>\n",
" <td>-0.053945</td>\n",
" </tr>\n",
" <tr>\n",
" <th>690</th>\n",
" <td>the</td>\n",
" <td>203</td>\n",
" <td>0.008204</td>\n",
" <td>-5.313206</td>\n",
" <td>-0.043590</td>\n",
" </tr>\n",
" <tr>\n",
" <th>876</th>\n",
" <td>an</td>\n",
" <td>199</td>\n",
" <td>0.008042</td>\n",
" <td>-5.293305</td>\n",
" <td>-0.042571</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>to</td>\n",
" <td>190</td>\n",
" <td>0.007679</td>\n",
" <td>-5.247024</td>\n",
" <td>-0.040290</td>\n",
" </tr>\n",
" <tr>\n",
" <th>799</th>\n",
" <td>or</td>\n",
" <td>167</td>\n",
" <td>0.006749</td>\n",
" <td>-5.117994</td>\n",
" <td>-0.034542</td>\n",
" </tr>\n",
" <tr>\n",
" <th>878</th>\n",
" <td>as</td>\n",
" <td>157</td>\n",
" <td>0.006345</td>\n",
" <td>-5.056246</td>\n",
" <td>-0.032082</td>\n",
" </tr>\n",
" <tr>\n",
" <th>588</th>\n",
" <td>java</td>\n",
" <td>135</td>\n",
" <td>0.005456</td>\n",
" <td>-4.905275</td>\n",
" <td>-0.026763</td>\n",
" </tr>\n",
" <tr>\n",
" <th>636</th>\n",
" <td>it</td>\n",
" <td>122</td>\n",
" <td>0.004930</td>\n",
" <td>-4.804021</td>\n",
" <td>-0.023686</td>\n",
" </tr>\n",
" <tr>\n",
" <th>635</th>\n",
" <td>is</td>\n",
" <td>110</td>\n",
" <td>0.004446</td>\n",
" <td>-4.700480</td>\n",
" <td>-0.020896</td>\n",
" </tr>\n",
" <tr>\n",
" <th>345</th>\n",
" <td>int</td>\n",
" <td>104</td>\n",
" <td>0.004203</td>\n",
" <td>-4.644391</td>\n",
" <td>-0.019521</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>all</td>\n",
" <td>88</td>\n",
" <td>0.003556</td>\n",
" <td>-4.477337</td>\n",
" <td>-0.015923</td>\n",
" </tr>\n",
" <tr>\n",
" <th>789</th>\n",
" <td>of</td>\n",
" <td>75</td>\n",
" <td>0.003031</td>\n",
" <td>-4.317488</td>\n",
" <td>-0.013086</td>\n",
" </tr>\n",
" <tr>\n",
" <th>570</th>\n",
" <td>and</td>\n",
" <td>71</td>\n",
" <td>0.002869</td>\n",
" <td>-4.262680</td>\n",
" <td>-0.012231</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>no</td>\n",
" <td>70</td>\n",
" <td>0.002829</td>\n",
" <td>-4.248495</td>\n",
" <td>-0.012019</td>\n",
" </tr>\n",
" <tr>\n",
" <th>568</th>\n",
" <td>com</td>\n",
" <td>67</td>\n",
" <td>0.002708</td>\n",
" <td>-4.204693</td>\n",
" <td>-0.011385</td>\n",
" </tr>\n",
" <tr>\n",
" <th>761</th>\n",
" <td>for</td>\n",
" <td>65</td>\n",
" <td>0.002627</td>\n",
" <td>-4.174387</td>\n",
" <td>-0.010966</td>\n",
" </tr>\n",
" <tr>\n",
" <th>228</th>\n",
" <td>data</td>\n",
" <td>62</td>\n",
" <td>0.002506</td>\n",
" <td>-4.127134</td>\n",
" <td>-0.010341</td>\n",
" </tr>\n",
" <tr>\n",
" <th>527</th>\n",
" <td>are</td>\n",
" <td>60</td>\n",
" <td>0.002425</td>\n",
" <td>-4.094345</td>\n",
" <td>-0.009928</td>\n",
" </tr>\n",
" <tr>\n",
" <th>404</th>\n",
" <td>applet</td>\n",
" <td>57</td>\n",
" <td>0.002304</td>\n",
" <td>-4.043051</td>\n",
" <td>-0.009314</td>\n",
" </tr>\n",
" <tr>\n",
" <th>846</th>\n",
" <td>hi</td>\n",
" <td>56</td>\n",
" <td>0.002263</td>\n",
" <td>-4.025352</td>\n",
" <td>-0.009110</td>\n",
" </tr>\n",
" <tr>\n",
" <th>168</th>\n",
" <td>obj</td>\n",
" <td>56</td>\n",
" <td>0.002263</td>\n",
" <td>-4.025352</td>\n",
" <td>-0.009110</td>\n",
" </tr>\n",
" <tr>\n",
" <th>844</th>\n",
" <td>but</td>\n",
" <td>55</td>\n",
" <td>0.002223</td>\n",
" <td>-4.007333</td>\n",
" <td>-0.008907</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" keywords number_of_times_word_appeared tf idf tf_idf\n",
"194 in 369 0.014913 -5.910797 -0.088146\n",
"317 re 258 0.010427 -5.552960 -0.057899\n",
"880 at 247 0.009982 -5.509388 -0.054996\n",
"783 on 243 0.009821 -5.493061 -0.053945\n",
"690 the 203 0.008204 -5.313206 -0.043590\n",
"876 an 199 0.008042 -5.293305 -0.042571\n",
"25 to 190 0.007679 -5.247024 -0.040290\n",
"799 or 167 0.006749 -5.117994 -0.034542\n",
"878 as 157 0.006345 -5.056246 -0.032082\n",
"588 java 135 0.005456 -4.905275 -0.026763\n",
"636 it 122 0.004930 -4.804021 -0.023686\n",
"635 is 110 0.004446 -4.700480 -0.020896\n",
"345 int 104 0.004203 -4.644391 -0.019521\n",
"2 all 88 0.003556 -4.477337 -0.015923\n",
"789 of 75 0.003031 -4.317488 -0.013086\n",
"570 and 71 0.002869 -4.262680 -0.012231\n",
"887 no 70 0.002829 -4.248495 -0.012019\n",
"568 com 67 0.002708 -4.204693 -0.011385\n",
"761 for 65 0.002627 -4.174387 -0.010966\n",
"228 data 62 0.002506 -4.127134 -0.010341\n",
"527 are 60 0.002425 -4.094345 -0.009928\n",
"404 applet 57 0.002304 -4.043051 -0.009314\n",
"846 hi 56 0.002263 -4.025352 -0.009110\n",
"168 obj 56 0.002263 -4.025352 -0.009110\n",
"844 but 55 0.002223 -4.007333 -0.008907"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df.sort_values('tf_idf',ascending=True)\n",
"df.to_csv('Keywords.csv')\n",
"df.head(25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Second Method - Using Gensim library"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"RequestsDependencyWarning: urllib3 (1.22) or chardet (2.3.0) doesn't match a supported version! [__init__.py:80]\n",
"UserWarning: detected Windows; aliasing chunkize to chunkize_serial [utils.py:1197]\n"
]
}
],
"source": [
"from gensim.summarization import keywords\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"values = keywords(text=text,split='\\n',scores=True)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>keyword</th>\n",
" <th>score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>java basics</td>\n",
" <td>0.314014</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>methods</td>\n",
" <td>0.247325</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>method</td>\n",
" <td>0.247325</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>applets</td>\n",
" <td>0.241786</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>applet</td>\n",
" <td>0.241786</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>class</td>\n",
" <td>0.219800</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>classes</td>\n",
" <td>0.219800</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>objects</td>\n",
" <td>0.190636</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>object</td>\n",
" <td>0.190636</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>programs</td>\n",
" <td>0.163243</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" keyword score\n",
"0 java basics 0.314014\n",
"2 methods 0.247325\n",
"1 method 0.247325\n",
"3 applets 0.241786\n",
"4 applet 0.241786\n",
"5 class 0.219800\n",
"6 classes 0.219800\n",
"7 objects 0.190636\n",
"8 object 0.190636\n",
"9 programs 0.163243"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame(values,columns=['keyword','score'])\n",
"data = data.sort_values('score',ascending=False)\n",
"data.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Third Approach - Using RAKE (Rapid Automatic Keyword Extraction)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"from rake_nltk import Rake"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"r = Rake()\n",
"r.extract_keywords_from_text(text)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"phrases = r.get_ranked_phrases_with_scores()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"table = pd.DataFrame(phrases,columns=['score','Phrase'])\n",
"table = table.sort_values('score',ascending=False)\n",
"# table.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.15"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment