Skip to content

Instantly share code, notes, and snippets.

@shv07
Last active June 14, 2019 11:23
Show Gist options
  • Save shv07/b9c72805423ed9394eb19955a6217035 to your computer and use it in GitHub Desktop.
Save shv07/b9c72805423ed9394eb19955a6217035 to your computer and use it in GitHub Desktop.
Generative Text Summarization using the word frequencies : A basic approach to text summarization.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [],
"source": [
"#unsupervised generative text summarization using frequency of words.\n",
"#python3\n",
"#modules used nltk, re\n",
"#references: https://stackabuse.com/text-summarization-with-nltk-in-python/#disqus_thread\n",
"import nltk\n",
"from nltk.tokenize import sent_tokenize, word_tokenize\n",
"import re"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"#text for summarizing (From Zero Order Reverse Filtering Research Paper)\n",
"text=\"We make several important observations from the results in Table 1. First, DT PSNRs are generally larger than GT ones, which complies with the fact that our method is basically a feedback system based on DT errors. Second, a larger DT PSNR does not necessarily correspond to a larger GT PSNR. For lossy filters, such as median filter (MF) and local extrema filter (LE), the same output can be obtained from different inputs. Thus the defiltered image may not be the same as the original one, which makes perfect sense. To analyze the convergence of our method on different filters, we plot the PSNR-vs-iteration curves and the curves of standard deviation (SD) of mean square error (MSE) vs. iteration in Fig. 6. For filters that are well reversible, including Gaussian filter (GS), bilateral filter (BF), guided filter (GF), adaptive manifold filter (AMF), rolling guidance filter (RGF), BM3D and relative total variation (RTV), the PSNRs consistently increase. For filters that are partially reversible, such as bilateral grid (BFG), permutohedral lattice (BFPL), domain transform (RF), tree filter (TF), L0 smooth (L0) and weighted least square (WLS), PSNRs increase in early iterations, and then decrease or oscillate in later ones. This complies with our previous theoretical analysis that reversible components dominate images. Thus a good number of, by default 10, iterations can yield satisfying results for most filters. Finally, for filters that are discontinuous in many places, such as median filter (MF), weighted median filter (WMF), and local extrema filter (LE), our method does not work very well with slightly increased PSNRs in the first a few iterations.\"\n",
"#text_cleaning\n",
"#text_pre-processings\n",
"text_proc = re.sub(r'\\[[0-9]*\\]', ' ', text) #removing square brackets\n",
"text_proc= re.sub(r'\\s+', ' ', text_proc) #removing stop words"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'We make several important observations from the results in Table 1. First, DT PSNRs are generally larger than GT ones, which complies with the fact that our method is basically a feedback system based on DT errors. Second, a larger DT PSNR does not necessarily correspond to a larger GT PSNR. For lossy filters, such as median filter (MF) and local extrema filter (LE), the same output can be obtained from different inputs. Thus the defiltered image may not be the same as the original one, which makes perfect sense. To analyze the convergence of our method on different filters, we plot the PSNR-vs-iteration curves and the curves of standard deviation (SD) of mean square error (MSE) vs. iteration in Fig. 6. For filters that are well reversible, including Gaussian filter (GS), bilateral filter (BF), guided filter (GF), adaptive manifold filter (AMF), rolling guidance filter (RGF), BM3D and relative total variation (RTV), the PSNRs consistently increase. For filters that are partially reversible, such as bilateral grid (BFG), permutohedral lattice (BFPL), domain transform (RF), tree filter (TF), L0 smooth (L0) and weighted least square (WLS), PSNRs increase in early iterations, and then decrease or oscillate in later ones. This complies with our previous theoretical analysis that reversible components dominate images. Thus a good number of, by default 10, iterations can yield satisfying results for most filters. Finally, for filters that are discontinuous in many places, such as median filter (MF), weighted median filter (WMF), and local extrema filter (LE), our method does not work very well with slightly increased PSNRs in the first a few iterations.'"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_proc"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [],
"source": [
"token_sent = sent_tokenize(text_proc)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [],
"source": [
"#removing stop_words\n",
"formatted_article_text = re.sub('[^a-zA-Z]', ' ', text_proc ) \n",
"formatted_article_text = re.sub(r'\\s+', ' ', formatted_article_text)"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'We make several important observations from the results in Table First DT PSNRs are generally larger than GT ones which complies with the fact that our method is basically a feedback system based on DT errors Second a larger DT PSNR does not necessarily correspond to a larger GT PSNR For lossy filters such as median filter MF and local extrema filter LE the same output can be obtained from different inputs Thus the defiltered image may not be the same as the original one which makes perfect sense To analyze the convergence of our method on different filters we plot the PSNR vs iteration curves and the curves of standard deviation SD of mean square error MSE vs iteration in Fig For filters that are well reversible including Gaussian filter GS bilateral filter BF guided filter GF adaptive manifold filter AMF rolling guidance filter RGF BM D and relative total variation RTV the PSNRs consistently increase For filters that are partially reversible such as bilateral grid BFG permutohedral lattice BFPL domain transform RF tree filter TF L smooth L and weighted least square WLS PSNRs increase in early iterations and then decrease or oscillate in later ones This complies with our previous theoretical analysis that reversible components dominate images Thus a good number of by default iterations can yield satisfying results for most filters Finally for filters that are discontinuous in many places such as median filter MF weighted median filter WMF and local extrema filter LE our method does not work very well with slightly increased PSNRs in the first a few iterations '"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"formatted_article_text"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['We',\n",
" 'make',\n",
" 'several',\n",
" 'important',\n",
" 'observations',\n",
" 'from',\n",
" 'the',\n",
" 'results',\n",
" 'in',\n",
" 'Table',\n",
" 'First',\n",
" 'DT',\n",
" 'PSNRs',\n",
" 'are',\n",
" 'generally',\n",
" 'larger',\n",
" 'than',\n",
" 'GT',\n",
" 'ones',\n",
" 'which',\n",
" 'complies',\n",
" 'with',\n",
" 'the',\n",
" 'fact',\n",
" 'that',\n",
" 'our',\n",
" 'method',\n",
" 'is',\n",
" 'basically',\n",
" 'a',\n",
" 'feedback',\n",
" 'system',\n",
" 'based',\n",
" 'on',\n",
" 'DT',\n",
" 'errors',\n",
" 'Second',\n",
" 'a',\n",
" 'larger',\n",
" 'DT',\n",
" 'PSNR',\n",
" 'does',\n",
" 'not',\n",
" 'necessarily',\n",
" 'correspond',\n",
" 'to',\n",
" 'a',\n",
" 'larger',\n",
" 'GT',\n",
" 'PSNR',\n",
" 'For',\n",
" 'lossy',\n",
" 'filters',\n",
" 'such',\n",
" 'as',\n",
" 'median',\n",
" 'filter',\n",
" 'MF',\n",
" 'and',\n",
" 'local',\n",
" 'extrema',\n",
" 'filter',\n",
" 'LE',\n",
" 'the',\n",
" 'same',\n",
" 'output',\n",
" 'can',\n",
" 'be',\n",
" 'obtained',\n",
" 'from',\n",
" 'different',\n",
" 'inputs',\n",
" 'Thus',\n",
" 'the',\n",
" 'defiltered',\n",
" 'image',\n",
" 'may',\n",
" 'not',\n",
" 'be',\n",
" 'the',\n",
" 'same',\n",
" 'as',\n",
" 'the',\n",
" 'original',\n",
" 'one',\n",
" 'which',\n",
" 'makes',\n",
" 'perfect',\n",
" 'sense',\n",
" 'To',\n",
" 'analyze',\n",
" 'the',\n",
" 'convergence',\n",
" 'of',\n",
" 'our',\n",
" 'method',\n",
" 'on',\n",
" 'different',\n",
" 'filters',\n",
" 'we',\n",
" 'plot',\n",
" 'the',\n",
" 'PSNR',\n",
" 'vs',\n",
" 'iteration',\n",
" 'curves',\n",
" 'and',\n",
" 'the',\n",
" 'curves',\n",
" 'of',\n",
" 'standard',\n",
" 'deviation',\n",
" 'SD',\n",
" 'of',\n",
" 'mean',\n",
" 'square',\n",
" 'error',\n",
" 'MSE',\n",
" 'vs',\n",
" 'iteration',\n",
" 'in',\n",
" 'Fig',\n",
" 'For',\n",
" 'filters',\n",
" 'that',\n",
" 'are',\n",
" 'well',\n",
" 'reversible',\n",
" 'including',\n",
" 'Gaussian',\n",
" 'filter',\n",
" 'GS',\n",
" 'bilateral',\n",
" 'filter',\n",
" 'BF',\n",
" 'guided',\n",
" 'filter',\n",
" 'GF',\n",
" 'adaptive',\n",
" 'manifold',\n",
" 'filter',\n",
" 'AMF',\n",
" 'rolling',\n",
" 'guidance',\n",
" 'filter',\n",
" 'RGF',\n",
" 'BM',\n",
" 'D',\n",
" 'and',\n",
" 'relative',\n",
" 'total',\n",
" 'variation',\n",
" 'RTV',\n",
" 'the',\n",
" 'PSNRs',\n",
" 'consistently',\n",
" 'increase',\n",
" 'For',\n",
" 'filters',\n",
" 'that',\n",
" 'are',\n",
" 'partially',\n",
" 'reversible',\n",
" 'such',\n",
" 'as',\n",
" 'bilateral',\n",
" 'grid',\n",
" 'BFG',\n",
" 'permutohedral',\n",
" 'lattice',\n",
" 'BFPL',\n",
" 'domain',\n",
" 'transform',\n",
" 'RF',\n",
" 'tree',\n",
" 'filter',\n",
" 'TF',\n",
" 'L',\n",
" 'smooth',\n",
" 'L',\n",
" 'and',\n",
" 'weighted',\n",
" 'least',\n",
" 'square',\n",
" 'WLS',\n",
" 'PSNRs',\n",
" 'increase',\n",
" 'in',\n",
" 'early',\n",
" 'iterations',\n",
" 'and',\n",
" 'then',\n",
" 'decrease',\n",
" 'or',\n",
" 'oscillate',\n",
" 'in',\n",
" 'later',\n",
" 'ones',\n",
" 'This',\n",
" 'complies',\n",
" 'with',\n",
" 'our',\n",
" 'previous',\n",
" 'theoretical',\n",
" 'analysis',\n",
" 'that',\n",
" 'reversible',\n",
" 'components',\n",
" 'dominate',\n",
" 'images',\n",
" 'Thus',\n",
" 'a',\n",
" 'good',\n",
" 'number',\n",
" 'of',\n",
" 'by',\n",
" 'default',\n",
" 'iterations',\n",
" 'can',\n",
" 'yield',\n",
" 'satisfying',\n",
" 'results',\n",
" 'for',\n",
" 'most',\n",
" 'filters',\n",
" 'Finally',\n",
" 'for',\n",
" 'filters',\n",
" 'that',\n",
" 'are',\n",
" 'discontinuous',\n",
" 'in',\n",
" 'many',\n",
" 'places',\n",
" 'such',\n",
" 'as',\n",
" 'median',\n",
" 'filter',\n",
" 'MF',\n",
" 'weighted',\n",
" 'median',\n",
" 'filter',\n",
" 'WMF',\n",
" 'and',\n",
" 'local',\n",
" 'extrema',\n",
" 'filter',\n",
" 'LE',\n",
" 'our',\n",
" 'method',\n",
" 'does',\n",
" 'not',\n",
" 'work',\n",
" 'very',\n",
" 'well',\n",
" 'with',\n",
" 'slightly',\n",
" 'increased',\n",
" 'PSNRs',\n",
" 'in',\n",
" 'the',\n",
" 'first',\n",
" 'a',\n",
" 'few',\n",
" 'iterations']"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"token_word=word_tokenize(formatted_article_text)\n",
"token_word"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'AMF': 1,\n",
" 'BF': 1,\n",
" 'BFG': 1,\n",
" 'BFPL': 1,\n",
" 'BM': 1,\n",
" 'D': 1,\n",
" 'DT': 3,\n",
" 'Fig': 1,\n",
" 'Finally': 1,\n",
" 'First': 1,\n",
" 'For': 3,\n",
" 'GF': 1,\n",
" 'GS': 1,\n",
" 'GT': 2,\n",
" 'Gaussian': 1,\n",
" 'L': 2,\n",
" 'LE': 2,\n",
" 'MF': 2,\n",
" 'MSE': 1,\n",
" 'PSNR': 3,\n",
" 'PSNRs': 4,\n",
" 'RF': 1,\n",
" 'RGF': 1,\n",
" 'RTV': 1,\n",
" 'SD': 1,\n",
" 'Second': 1,\n",
" 'TF': 1,\n",
" 'Table': 1,\n",
" 'This': 1,\n",
" 'Thus': 2,\n",
" 'To': 1,\n",
" 'WLS': 1,\n",
" 'WMF': 1,\n",
" 'We': 1,\n",
" 'a': 5,\n",
" 'adaptive': 1,\n",
" 'analysis': 1,\n",
" 'analyze': 1,\n",
" 'and': 6,\n",
" 'are': 4,\n",
" 'as': 4,\n",
" 'based': 1,\n",
" 'basically': 1,\n",
" 'be': 2,\n",
" 'bilateral': 2,\n",
" 'by': 1,\n",
" 'can': 2,\n",
" 'complies': 2,\n",
" 'components': 1,\n",
" 'consistently': 1,\n",
" 'convergence': 1,\n",
" 'correspond': 1,\n",
" 'curves': 2,\n",
" 'decrease': 1,\n",
" 'default': 1,\n",
" 'defiltered': 1,\n",
" 'deviation': 1,\n",
" 'different': 2,\n",
" 'discontinuous': 1,\n",
" 'does': 2,\n",
" 'domain': 1,\n",
" 'dominate': 1,\n",
" 'early': 1,\n",
" 'error': 1,\n",
" 'errors': 1,\n",
" 'extrema': 2,\n",
" 'fact': 1,\n",
" 'feedback': 1,\n",
" 'few': 1,\n",
" 'filter': 11,\n",
" 'filters': 6,\n",
" 'first': 1,\n",
" 'for': 2,\n",
" 'from': 2,\n",
" 'generally': 1,\n",
" 'good': 1,\n",
" 'grid': 1,\n",
" 'guidance': 1,\n",
" 'guided': 1,\n",
" 'image': 1,\n",
" 'images': 1,\n",
" 'important': 1,\n",
" 'in': 6,\n",
" 'including': 1,\n",
" 'increase': 2,\n",
" 'increased': 1,\n",
" 'inputs': 1,\n",
" 'is': 1,\n",
" 'iteration': 2,\n",
" 'iterations': 3,\n",
" 'larger': 3,\n",
" 'later': 1,\n",
" 'lattice': 1,\n",
" 'least': 1,\n",
" 'local': 2,\n",
" 'lossy': 1,\n",
" 'make': 1,\n",
" 'makes': 1,\n",
" 'manifold': 1,\n",
" 'many': 1,\n",
" 'may': 1,\n",
" 'mean': 1,\n",
" 'median': 3,\n",
" 'method': 3,\n",
" 'most': 1,\n",
" 'necessarily': 1,\n",
" 'not': 3,\n",
" 'number': 1,\n",
" 'observations': 1,\n",
" 'obtained': 1,\n",
" 'of': 4,\n",
" 'on': 2,\n",
" 'one': 1,\n",
" 'ones': 2,\n",
" 'or': 1,\n",
" 'original': 1,\n",
" 'oscillate': 1,\n",
" 'our': 4,\n",
" 'output': 1,\n",
" 'partially': 1,\n",
" 'perfect': 1,\n",
" 'permutohedral': 1,\n",
" 'places': 1,\n",
" 'plot': 1,\n",
" 'previous': 1,\n",
" 'relative': 1,\n",
" 'results': 2,\n",
" 'reversible': 3,\n",
" 'rolling': 1,\n",
" 'same': 2,\n",
" 'satisfying': 1,\n",
" 'sense': 1,\n",
" 'several': 1,\n",
" 'slightly': 1,\n",
" 'smooth': 1,\n",
" 'square': 2,\n",
" 'standard': 1,\n",
" 'such': 3,\n",
" 'system': 1,\n",
" 'than': 1,\n",
" 'that': 5,\n",
" 'the': 11,\n",
" 'then': 1,\n",
" 'theoretical': 1,\n",
" 'to': 1,\n",
" 'total': 1,\n",
" 'transform': 1,\n",
" 'tree': 1,\n",
" 'variation': 1,\n",
" 'very': 1,\n",
" 'vs': 2,\n",
" 'we': 1,\n",
" 'weighted': 2,\n",
" 'well': 2,\n",
" 'which': 2,\n",
" 'with': 3,\n",
" 'work': 1,\n",
" 'yield': 1}"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#assigning scoresto each sentance based on frequency of most occueing words\n",
"vocabulary=set(token_word)\n",
"frequency={}\n",
"for i in vocabulary:\n",
" frequency[i]=0\n",
"for word in token_word:\n",
" if word in vocabulary:\n",
" frequency[word]+=1\n",
"frequency"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[2.2727272727272725,\n",
" 5.181818181818181,\n",
" 2.3636363636363638,\n",
" 3.818181818181818,\n",
" 2.727272727272727,\n",
" 4.636363636363636,\n",
" 0,\n",
" 7.363636363636361,\n",
" 10.272727272727272,\n",
" 2.727272727272727,\n",
" 2.2727272727272725,\n",
" 6.545454545454543]"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max_f=max(list(frequency.values()))\n",
"sent_score=[]\n",
"for sentance in token_sent:\n",
" score=0\n",
" for word in sentance:\n",
" if word in frequency:\n",
" score+=frequency[word]/max_f\n",
" #else: \n",
" #score+=0\n",
" sent_score.append(score)\n",
"sent_score "
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [],
"source": [
"#taking the top 6 sentances for summary\n",
"tmp=sent_score\n",
"sent_score.sort(reverse=True)\n",
"summary=\"\"\n",
"for i in range(min(6,len(token_sent))):\n",
" summary+=token_sent[tmp.index(sent_score[i])]+\" \""
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'We make several important observations from the results in Table 1. First, DT PSNRs are generally larger than GT ones, which complies with the fact that our method is basically a feedback system based on DT errors. Second, a larger DT PSNR does not necessarily correspond to a larger GT PSNR. For lossy filters, such as median filter (MF) and local extrema filter (LE), the same output can be obtained from different inputs. Thus the defiltered image may not be the same as the original one, which makes perfect sense. To analyze the convergence of our method on different filters, we plot the PSNR-vs-iteration curves and the curves of standard deviation (SD) of mean square error (MSE) vs. iteration in Fig. '"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment