Skip to content

Instantly share code, notes, and snippets.

@juanshishido
Created October 12, 2015 16:26
Show Gist options
  • Save juanshishido/3ffe988f300c3cbaa901 to your computer and use it in GitHub Desktop.
Save juanshishido/3ffe988f300c3cbaa901 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "import re\n\nimport requests\nimport numpy as np\nimport nltk\nfrom nltk.util import ngrams\nfrom nltk.corpus import brown\nfrom nltk.corpus import stopwords\nfrom nltk.corpus import wordnet as wordnet\nfrom nltk.tokenize import regexp_tokenize\nfrom nltk.stem.wordnet import WordNetLemmatizer",
"execution_count": 1,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Algorithm\n\nA description of the keyphrase extractor.\n\n#### Original\n\n1. Load the data\n2. Remove abbreviation periods.\n3. Tokenize, by sentence.\n4. Remove punctuation in each sentence.\n5. Tokenize, by word.\n6. POS Tag all the words.\n7. Lemmatize (using POS) all the words.\n * Keep track of mapping.\n8. Normalize text.\n * Remove stop words in each sentence.\n * Check whether *other* words should be removed.\n * Lowercase.\n10. TFIDF. A document, in this case, will be a sentence.\n11. Identify *sentences* in the original corpus with these words.\n12. Chunk the sentence to pull out keyphrases.\n\n#### Final\n\n1. Load the data.\n2. Tokenize, by sentence.\n3. Remove punctuation in each sentence.\n4. Tokenize, by word.\n5. Normalize the words.\n\t* Remove stop words, numbers, and short words. Also, lowercases text.\n6. TFIDF. Words are terms. Sentences are documents.\n7. Identify key words based on TFIDF.\n8. Identify key \"phrases\" in the normalized text.\n9. Identify the top ngrams. Default is top 50 bigrams.\n10. Identify the top sentences in the original text that include the top ngrams. Default is top 5 sentences.\n11. Chunk the top sentences from 6 for noun phrases.\n12. Return the top ngrams, noun phrases, and the top sentence, if less than 2,000 characters."
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "def load_text(file_path):\n \"\"\"Given a file path, loads a .txt file.\n Also removes chapter and section headings.\n Returns a single string.\n \"\"\"\n with open (file_path, 'r', encoding='utf-8') as jsm:\n text = jsm.read()\n \n return re.sub('\\s+', ' ',\n re.sub(r'[A-Z]{2,}', '',\n re.sub('((?<=[A-Z])\\sI | I\\s(?=[A-Z]))', ' ', text)))",
"execution_count": 2,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def split_on_sentence(text):\n \"\"\"Tokenize the text on sentences.\n Returns a list of strings (sentences).\n \"\"\"\n sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')\n return sent_tokenizer.tokenize(text)",
"execution_count": 3,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def re_punc(text):\n \"\"\"Remove all punctuation. Keep apostrophes.\"\"\"\n return re.sub(r'[!\"#$%&()*+,\\-\\./:;<=>?@\\[\\]^_`\\\\{\\|}]+', '', text)",
"execution_count": 4,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "def remove_punctuation(sentences):\n \"\"\"Remove punctuation based on `re_punc`.\n Returns either a list of string or a single string,\n based on the input type.\n \"\"\"\n if type(sentences) is list:\n return [re_punc(sentence).strip() for sentence in sentences]\n else:\n return re_punc(sentences).strip()",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "def split_on_word(text):\n \"\"\"Use regular expression tokenizer.\n Keep apostrophes.\n Returns a list of lists, one list for each sentence:\n [[word, word], [word, word, ..., word], ...].\n \"\"\"\n if type(text) is list:\n return [regexp_tokenize(sentence, pattern=\"\\w+(?:[-']\\w+)*\") for sentence in text]\n else:\n return regexp_tokenize(text, pattern=\"\\w+(?:[-']\\w+)*\")",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def normalize(tokenized_words):\n \"\"\"Removes stop words, numbers, short words, and lowercases text.\n Returns a list of lists, one list for each sentence:\n [[word, word], [word, word, ..., word], ...].\n \"\"\"\n stop_words = stopwords.words('english')\n return [[w.lower() for w in sent\n if (w.lower() not in stop_words) and\\\n (not(w.lower().isnumeric())) and\\\n (len(w) > 2)]\n for sent in tokenized_words]",
"execution_count": 7,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def tdm(unique_words, normalized_words):\n \"\"\"Create a term document matrix.\n Return the m (unique words, sorted) by n (normalized words)\n matrix, M.\"\"\"\n M = np.zeros([len(unique_words), len(normalized_words)])\n for m, term in enumerate(unique_words):\n for n, doc in enumerate(normalized_words):\n M[m, n] = doc.count(term)\n return M",
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def tfidf_weights(tdm):\n \"\"\"Calculate the term frequency-inverse document frequency.\n Return an ndarray of the tf-idf weights.\n \"\"\"\n tf = np.sum(tdm, axis=1)\n idf = float(tdm.shape[1]) / np.sum(tdm > 0, axis=1)\n return (1 + np.log10(tf)) * np.log(idf)",
"execution_count": 9,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def key_words(unique_words, weights):\n \"\"\"Return a list of the 'top' words.\"\"\"\n cutoff = weights.max() * 0.95\n return list(np.array(unique_words)[weights > cutoff])",
"execution_count": 10,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def key_phrases(key_words, normalized_words):\n \"\"\"Identify key phrases in the normalized text.\n Returns a list of lists, one list for each sentence:\n [[word, word], [word, word, ..., word], ...].\n \"\"\"\n kp = []\n for w in key_words:\n for sent in normalized_words:\n if w in sent:\n kp.append(sent)\n return kp",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def kp_ngrams(kp, n=2, top=50):\n \"\"\"Identify the top ngrams in the key phrases.\n Return a list of ngrams.\n \"\"\"\n tokenized_kp = [w for sent in kp for w in sent]\n ng = [bg for bg in ngrams(tokenized_kp, n)]\n ng_fd = nltk.FreqDist(ng)\n mc = ng_fd.most_common(top)\n return [' '.join(t[0]) for t in mc]",
"execution_count": 12,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def top_sents(top_ngrams, words, n_sents=5):\n \"\"\"Identify the sentences with the most ngrams.\n Return a list of sentences (of size `n_sents`).\"\"\"\n t_ng_s = []\n for g in top_ngrams:\n for sent in words:\n if ' '.join(sent).find(g) + 1:\n t_ng_s.append(' '.join(sent))\n ts = []\n for s in nltk.FreqDist(t_ng_s).most_common(n_sents):\n ts.append(s[0])\n return ts",
"execution_count": 13,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def chunker(tagged_sentences):\n grammar = \"NP: {<DT>?<JJ.*>*<NN.*>+}\"\n cp = nltk.RegexpParser(grammar)\n ch = []\n for sent in tagged_sentences:\n tree = cp.parse(sent)\n for subtree in tree.subtrees():\n if subtree.label() == 'NP':\n ch.append(' '.join([t[0].lower() for t in subtree[:3]]))\n return sorted(list(set(ch)))",
"execution_count": 14,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "def keyphrase_extraction(kind='brown', ngram=2, url=None):\n \"\"\"Key phrase extraction.\n Returns either a list or a string, based on the length\n of the results.\"\"\"\n \n # load data\n if kind == 'brown':\n brown_sents = [' '.join(sent) for sent in brown.sents(categories='news')]\n words = split_on_word(remove_punctuation(list(brown_sents)))\n elif kind == 'jsm':\n text = load_text('/Users/JS/Code/_INFO256/text-collection/jsm-collection.txt')\n sentences = split_on_sentence(text)\n words = split_on_word(remove_punctuation(sentences))\n elif kind == 'mystery':\n assert url is not None, 'URL must be supplied.'\n r = requests.get(url)\n if r.reason == 'OK':\n sentences = split_on_sentence(r.text)\n words = split_on_word(remove_punctuation(sentences))\n else:\n raise ValueError('Invalid URL or server issue.')\n \n # container for key phrases\n summ = []\n # prepare words\n words_norm = normalize(words)\n words_norm_uniq = sorted(list(set([word for sent in words_norm for word in sent])))\n # tfidf\n M = tdm(words_norm_uniq, words_norm)\n weights = tfidf_weights(M)\n # key words\n kw = key_words(words_norm_uniq, weights)\n # key phrases\n kp = key_phrases(kw, words_norm)\n # top ngrams\n top_ngrams = kp_ngrams(kp, ngram, 100)\n summ.append(top_ngrams)\n # top ngram sentences\n ts = top_sents(top_ngrams, words)\n tsw = [nltk.pos_tag(word) for word in split_on_word(remove_punctuation(ts))]\n # noun phrases\n summ.append(chunker(tsw))\n # first sentence\n summ.append(ts[0])\n # flatten\n summary = [token for group in summ[:2] for token in group] + [summ[2]]\n # check size\n if len(' '.join(summary)) <= 2000:\n return summary\n else:\n return ' '.join(summary)[:2000]",
"execution_count": 15,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "---"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "keyphrase_extraction(kind='brown')",
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "['farm equipment',\n 'last year',\n 'soviet union',\n 'farm machinery',\n 'new york',\n 'textile industry',\n 'gin machinery',\n 'potato chip',\n 'medical dental',\n 'would get',\n 'welsh coal',\n 'narcotics unit',\n 'baltimore ohio',\n 'chicago narcotics',\n 'plan would',\n '23d ward',\n 'get know',\n 'dental schools',\n 'first year',\n 'chip industry',\n 'diversified growth',\n 'growth stock',\n 'british government',\n 'coal industry',\n 'unprofitable large',\n 'coal british',\n \"can't sell\",\n \"stocks can't\",\n 'san francisco',\n 'precinct 23d',\n 'ohio railroad',\n 'stock fund',\n 'british coal',\n 'british french',\n 'united states',\n 'steel company',\n 'chicago white',\n 'cotton gin',\n 'industry unprofitable',\n 'union members',\n 'year earlier',\n 'north plains',\n 'coal stocks',\n 'coal seams',\n 'shaw hillsboro',\n 'new medical',\n 'fulton county',\n 'billion dollars',\n 'white sox',\n 'virginia coal',\n 'large coal',\n 'teamsters union',\n 'morton foods',\n 'union oil',\n 'fire fighters',\n 'plus cash',\n 'medical care',\n 'driving also',\n 'investment firms',\n \"gulf's holdings\",\n 'jossy north',\n 'line gin',\n 'farm output',\n 'industry shrink',\n \"finance government's\",\n 'fighters association',\n 'medical bills',\n 'passing yardage',\n 'sales extends',\n 'certain areas',\n 'fund basis',\n 'medical plan',\n 'detectives product',\n 'fund drive',\n 'sold would',\n 'billion finance',\n 'total offense',\n 'district court',\n 'miles southwest',\n 'boost medical',\n 'bought shares',\n 'chesapeake ohio',\n 'chairman howard',\n 'billion billion',\n 'indictment three',\n 'billion economy',\n 'coal cheap',\n 'buy shares',\n 'whether let',\n 'side side',\n 'fund oct',\n 'gin saws',\n 'shares mutual',\n 'common upon',\n 'james corcoran',\n 'per cent',\n 'malice individual',\n 'holdings could',\n 'offense yards',\n 'since depression',\n 'a year',\n 'august',\n 'cash',\n 'dealers',\n 'debentures',\n 'dissension',\n 'dollars',\n \"gulf's holdings\",\n 'international harvester co',\n 'large coal stocks',\n 'last year',\n 'mark v keeler',\n 'medical benefits',\n 'months',\n 'more farm machinery',\n 'officials',\n 'september retail sales',\n 'shares',\n 'surrender',\n 'the annual tax',\n 'the british coal',\n 'the fight',\n 'the fire department',\n 'the fire fighters',\n 'the first year',\n 'the medical plan',\n 'the teamsters union',\n 'top personnel',\n 'union',\n 'union oil',\n \"The British coal industry is unprofitable has large coal stocks it can't sell\"]"
},
"metadata": {},
"execution_count": 16
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "keyphrase_extraction(kind='jsm')",
"execution_count": 17,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\"per cent united states one another days labor straight lines men mortal yards cloth yards linen one hundred hydrogen oxygen sensation white two straight cost production que nous socrates mortal attribute mortality attributes man ten yards straight line days labour nervous system general proposition les plus gold silver gold dollar particulars particulars bushels wheat every one genus species hundred dollars socrates man silver dollar carbonic acid proposition men tous les heat electricity mark mark inclose space greater number right angles man therefore ten days lines angles labor capital duke wellington oxygen hydrogen general propositions grains silver bills exchange species genus lines inclose snow white grains pure quality whiteness pure silver times yards two hundred would cost deposition dew quod facile sur les new york attribute whiteness brought within per week les autres one two proposition socrates let suppose method difference human beings rate interest man mark like manner produces sensation qu'elles sont nous savons political economy seventeen yards minus multiplied hundred quarters crows black natural groups iron coal que ces minor premise acted upon general laws ces causes chemical action one per takes place sensation color years ago sun moon sulphuric acid dollar grains produce one number dollars twenty bushels a country a footrule a single straight a space a sufficient number all lines angles angles characteristic properties contraires d'ailleurs l'essence produisent declare definitions divers enunciation those express formula things inductive origin infinity long straight lines means measuring même les nous mêmes nous ne connaissons nous ne pouvons nous parceque nous savons qu'elles rencontrent telle qu'il existe quelque que ces causes reality space spaces straight lines such propositions the exact measurement the face the general proposition the logical aid the measurement the only ones the physical aid the plan the same thing the trigonometrical \""
},
"metadata": {},
"execution_count": 17
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "keyphrase_extraction(kind='mystery', url='')",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "markdown",
"source": "## Algorithm Description"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Once the data is loaded as a string, the algorithm uses the Punkt tokenizer to split the sentences. For each sentence, all punctuation is removed (except for apostrophes). The sentences are then split on words using `regexp_tokenize`. Contraction words, such as \"I'm\" are recognized as a single token. The resulting text is then normalized. English stop words are removed as are numbers and tokens with a length of one or two. The remaining words are also lowercased. The reasoning was to remove some of the noise when identifying important key words. Next, a term-document matrix is constructed. Here, the documents are the sentences. My collection has several *documents*, but I did not want to assume the same for the mystery collection, so I decided to use sentences. The weights are then calculated for each term. Key *words* are identified as those with a score greater than 95% of the maximum score. Then, key \"phrases\" are extracted from the normalized text. These are the sentences with the stop words, numbers, and short words removed. Using this text, the top ngrams are identified. The default is to search for the top 50 bigrams. Next, we return to the sentences that have *not* been normalized. These are the sentences that have had the punctuation removed and that have been split using `regexp_tokenize`. With this, we identify the top sentences, based on the number of times a key phrase appears in it. We extract the top five sentences. Finally, we use a chunker to extract noun phrases from the top five sentences. The algorithm returns the top ngrams, noun phrases, and the single \"top\" sentence and checks to make sure the output is less than 2,000 characters."
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"nbconvert_exporter": "python",
"name": "python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"file_extension": ".py",
"mimetype": "text/x-python",
"version": "3.4.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment