maxibor/text_summarization.ipynb

## text_summarization.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inspired from [stackabuse.com/text-summarization-with-nltk-in-python](https://stackabuse.com/text-summarization-with-nltk-in-python/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading the text from file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_text = \"\"\n",
    "\n",
    "inputfile = \"text.txt\"\n",
    "\n",
    "with open(inputfile,\"r\") as f:\n",
    "    for line in f:\n",
    "        line = line.rstrip()\n",
    "        raw_text += line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Removing Square Brackets and Extra Spaces\n",
    "raw_test = re.sub(r'\\[[0-9]*\\]', ' ', raw_text)  \n",
    "raw_text = re.sub(r'\\s+', ' ', raw_text)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Removing special characters and digits\n",
    "formatted_article_text = re.sub('[^a-zA-Z]', ' ', raw_text)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Splitting by sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "sentence_list = nltk.sent_tokenize(raw_text)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Counting (non-common) word"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "stopwords = nltk.corpus.stopwords.words('english')\n",
    "\n",
    "word_frequencies = {}  \n",
    "for word in nltk.word_tokenize(formatted_article_text):  \n",
    "    if word not in stopwords:\n",
    "        if word not in word_frequencies.keys():\n",
    "            word_frequencies[word] = 1\n",
    "        else:\n",
    "            word_frequencies[word] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### From count to frequency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [],
   "source": [
    "maximum_frequncy = max(word_frequencies.values())\n",
    "\n",
    "for word in word_frequencies.keys():  \n",
    "    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Scoring sentences bases on word frequencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence_scores = {}  \n",
    "for sent in sentence_list:  \n",
    "    for word in nltk.word_tokenize(sent.lower()):\n",
    "        if word in word_frequencies.keys():\n",
    "            if len(sent.split(' ')) < 30:\n",
    "                if sent not in sentence_scores.keys():\n",
    "                    sentence_scores[sent] = word_frequencies[word]\n",
    "                else:\n",
    "                    sentence_scores[sent] += word_frequencies[word]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Printing top 5 sentences based on score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_top_sentences(sentence_dict,n):\n",
    "    res = {k: v for k, v in sorted(sentence_dict.items(), key=lambda x: x[1], reverse=True)}\n",
    "    return(list(res.keys())[:n])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The human oral microbiome comprises more than 2,000 bacterial taxa, including a large number of opportunistic pathogens involved in periodontal, respiratory, cardiovascular, and systemic disease3-7. We reconstruct the genome of a major periodontal pathogen, and we present the first direct evidence of dietary biomolecules to be recovered from ancient dental calculus. Finally, we further validate our findings by applying multiple microscopic, genetic, and proteomic analyses in parallel, providing a systematic biomolecular evaluation of ancient dental calculus preservation, taphonomy, and contamination. We confirm the long-term role of host immune activity and “red complex” pathogen virulence in periodontal pathogenesis, despite major changes in lifestyle, hygiene, and diet over the past millennium.\n"
     ]
    }
   ],
   "source": [
    "print(\" \".join(get_top_sentences(sentence_scores,5)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Inspired from [stackabuse.com/text-summarization-with-nltk-in-python](https://stackabuse.com/text-summarization-with-nltk-in-python/)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Reading the text from file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"raw_text = \"\"\n",
	"\n",
	"inputfile = \"text.txt\"\n",
	"\n",
	"with open(inputfile,\"r\") as f:\n",
	" for line in f:\n",
	" line = line.rstrip()\n",
	" raw_text += line"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"import re"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Removing Square Brackets and Extra Spaces\n",
	"raw_test = re.sub(r'\\[[0-9]*\\]', ' ', raw_text) \n",
	"raw_text = re.sub(r'\\s+', ' ', raw_text) "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Removing special characters and digits\n",
	"formatted_article_text = re.sub('[^a-zA-Z]', ' ', raw_text) "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Splitting by sentence"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"import nltk\n",
	"sentence_list = nltk.sent_tokenize(raw_text) "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Counting (non-common) word"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"stopwords = nltk.corpus.stopwords.words('english')\n",
	"\n",
	"word_frequencies = {} \n",
	"for word in nltk.word_tokenize(formatted_article_text): \n",
	" if word not in stopwords:\n",
	" if word not in word_frequencies.keys():\n",
	" word_frequencies[word] = 1\n",
	" else:\n",
	" word_frequencies[word] += 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### From count to frequency"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 107,
	"metadata": {},
	"outputs": [],
	"source": [
	"maximum_frequncy = max(word_frequencies.values())\n",
	"\n",
	"for word in word_frequencies.keys(): \n",
	" word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Scoring sentences bases on word frequencies"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 108,
	"metadata": {},
	"outputs": [],
	"source": [
	"sentence_scores = {} \n",
	"for sent in sentence_list: \n",
	" for word in nltk.word_tokenize(sent.lower()):\n",
	" if word in word_frequencies.keys():\n",
	" if len(sent.split(' ')) < 30:\n",
	" if sent not in sentence_scores.keys():\n",
	" sentence_scores[sent] = word_frequencies[word]\n",
	" else:\n",
	" sentence_scores[sent] += word_frequencies[word]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Printing top 5 sentences based on score"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 109,
	"metadata": {},
	"outputs": [],
	"source": [
	"def get_top_sentences(sentence_dict,n):\n",
	" res = {k: v for k, v in sorted(sentence_dict.items(), key=lambda x: x[1], reverse=True)}\n",
	" return(list(res.keys())[:n])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 111,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The human oral microbiome comprises more than 2,000 bacterial taxa, including a large number of opportunistic pathogens involved in periodontal, respiratory, cardiovascular, and systemic disease3-7. We reconstruct the genome of a major periodontal pathogen, and we present the first direct evidence of dietary biomolecules to be recovered from ancient dental calculus. Finally, we further validate our findings by applying multiple microscopic, genetic, and proteomic analyses in parallel, providing a systematic biomolecular evaluation of ancient dental calculus preservation, taphonomy, and contamination. We confirm the long-term role of host immune activity and “red complex” pathogen virulence in periodontal pathogenesis, despite major changes in lifestyle, hygiene, and diet over the past millennium.\n"
	]
	}
	],
	"source": [
	"print(\" \".join(get_top_sentences(sentence_scores,5)))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}