Created
November 19, 2018 10:16
-
-
Save maxibor/07bafc2d0610bc8ed15a6b5f57821ce9 to your computer and use it in GitHub Desktop.
Text summarization using Python NLTK
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Inspired from [stackabuse.com/text-summarization-with-nltk-in-python](https://stackabuse.com/text-summarization-with-nltk-in-python/)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Reading the text from file" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"raw_text = \"\"\n", | |
"\n", | |
"inputfile = \"text.txt\"\n", | |
"\n", | |
"with open(inputfile,\"r\") as f:\n", | |
" for line in f:\n", | |
" line = line.rstrip()\n", | |
" raw_text += line" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import re" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Removing Square Brackets and Extra Spaces\n", | |
"raw_test = re.sub(r'\\[[0-9]*\\]', ' ', raw_text) \n", | |
"raw_text = re.sub(r'\\s+', ' ', raw_text) " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Removing special characters and digits\n", | |
"formatted_article_text = re.sub('[^a-zA-Z]', ' ', raw_text) " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Splitting by sentence" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import nltk\n", | |
"sentence_list = nltk.sent_tokenize(raw_text) " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Counting (non-common) word" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"stopwords = nltk.corpus.stopwords.words('english')\n", | |
"\n", | |
"word_frequencies = {} \n", | |
"for word in nltk.word_tokenize(formatted_article_text): \n", | |
" if word not in stopwords:\n", | |
" if word not in word_frequencies.keys():\n", | |
" word_frequencies[word] = 1\n", | |
" else:\n", | |
" word_frequencies[word] += 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### From count to frequency" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 107, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"maximum_frequncy = max(word_frequencies.values())\n", | |
"\n", | |
"for word in word_frequencies.keys(): \n", | |
" word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Scoring sentences bases on word frequencies" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 108, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"sentence_scores = {} \n", | |
"for sent in sentence_list: \n", | |
" for word in nltk.word_tokenize(sent.lower()):\n", | |
" if word in word_frequencies.keys():\n", | |
" if len(sent.split(' ')) < 30:\n", | |
" if sent not in sentence_scores.keys():\n", | |
" sentence_scores[sent] = word_frequencies[word]\n", | |
" else:\n", | |
" sentence_scores[sent] += word_frequencies[word]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Printing top 5 sentences based on score" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 109, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def get_top_sentences(sentence_dict,n):\n", | |
" res = {k: v for k, v in sorted(sentence_dict.items(), key=lambda x: x[1], reverse=True)}\n", | |
" return(list(res.keys())[:n])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 111, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The human oral microbiome comprises more than 2,000 bacterial taxa, including a large number of opportunistic pathogens involved in periodontal, respiratory, cardiovascular, and systemic disease3-7. We reconstruct the genome of a major periodontal pathogen, and we present the first direct evidence of dietary biomolecules to be recovered from ancient dental calculus. Finally, we further validate our findings by applying multiple microscopic, genetic, and proteomic analyses in parallel, providing a systematic biomolecular evaluation of ancient dental calculus preservation, taphonomy, and contamination. We confirm the long-term role of host immune activity and “red complex” pathogen virulence in periodontal pathogenesis, despite major changes in lifestyle, hygiene, and diet over the past millennium.\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\" \".join(get_top_sentences(sentence_scores,5)))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment