MatejMecka/README.md

## README.md

      
    Raw
  

              README.md
            
          
    subtitleTextRank

A script developed during Google Code In 2017 for CCExtractor Development to summarize an article.
Requirements:


Python

Installation:

pip install pysrt
pip install -u Gensim
pip install topia.termextract
Getting Started:


Run the Script

Help:

usage: tldr.py [-h] [-i FILE] [-r RATIO] [-k KEYWORDS] [-kr KEYWORDS_RATIO]

Python Program that returns a summarization from a subtitle file

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --input-file FILE
                        takes the input file
  -r RATIO, --ratio RATIO
                        define the ratio of words
  -k KEYWORDS, --keywords KEYWORDS
                        if keywords should be on
  -kr KEYWORDS_RATIO, --keywords-ratio KEYWORDS_RATIO
                        how many keywords to display


## Summarization.txt
# Summarization
The attention span of humans is low. In our busy lives we don't read news articles or even bother to watch the news, we just read the title and move on. But sometimes the title can be misleading.

Text Sumarization algorithms can be classified as Extraction-based summarization and Abstraction-based summarization.

In Extraction-based summarization the text summarizer extracts objects from the entire collection and select whole sentences without modifying. Example For that is TextRank which I'll explain later how it works.

Abstraction-based summarization is when the algorithm retells the selected sentences to form a summary. Humans use this method to summarize text when given.  As of now most tools are Extraction based and Google has an attempt at releasing a tool using Tensorflow.

I've been Researching TextRank, TextTeaser, A Method by Google that used Tensorflow

TextTeaser is a natural language processing and machine learning algorithm released around October 2013 by Jolo Balbin. TextTeaser uses basic text summarization techniques and in the features it includes the title feature, sentence length, keyword frequency, sentence position.

TextRank is an algorithm made by the people from the Engineering Faculty of the University in Buenos Aires. It works by scoring each sentence, then taking the top n sentences and worth them as they appear in the text.

At first each sentence is put in a graph where each sentence is a node and order them by similarity. Once that is done we visit each node randomly and give it a rank and check the neighbors weights too and compute it's probability. For each visited sentence we can give it a 1 and then sort it by the times we visited the node.

What TextRank does is very simple: it finds how similar each sentence is to all other sentences in the text. The most important sentence is the one that is most similar to all the others, with this in mind the similarity function should be oriented to the semantic of the sentence, cosine similarity based on a bag of words approach can work well and BM25/BM25+ work really nicely for TextRank.

Google's way is using Machine Learning. They are extracting the interesting parts and create a summary from it. They were aiming for a abstractive summarization. The way humans do it. The model they trained used a dataset of 10 million files.

Each has it's own Pros and cons for TextTeaser cons of it are that to produce a better summarization we need to pass a title. TextTeaser uses the Title to compare the sentences and finds the sentences connected to the title to summarize.

If we use Machine Learning to summarize text it will take more time to train the classifier, more memory than doing it statistically using TextRank or LexRank.

So In order to save time, storage, memory I picked TextRank. TextRank is both fast, realiable, most used. TextRank works based on Ranking each other and the only thing we need to tell it is how to split sentences which training data is available for.


Source: [Luis Argerich's answer to What is a simple but detailed explanation of Textrank? - Quora](https://www.quora.com/What-is-a-simple-but-detailed-explanation-of-Textrank/answer/Luis-Argerich?srid=STIA)

For keywords extraction I used topia.termextract because TextRank picked the swear words instead of actual keywords. It isn't the best algorithm but does the job. It works by marking words that correspond to a particular part of speech.

*** Statistical machine learning is basically TextRank and LexRank.

## tldr.py
# -*- coding: utf-8 -*-
from gensim.summarization import summarize
from topia.termextract import extract
from topia.termextract import tag
import pysrt
import argparse
import re
import sys

def tldr(text,rat,keys,ratk):
	text = re.sub(r"<[^>]*>",' ',text)
	summarization = summarize(text, word_count=rat)
	if keys is None:
		print(summarization.encode('utf-8'))
	else:
		print(summarization.encode('utf-8'))
		print("======================================")
		tagger = tag.Tagger()
		tagger.initialize()
		tagger.tokenize(text)
		extractor = extract.TermExtractor(tagger)
		words = extractor(text)
		count = 0
		while count != ratk:
			try:
				print(words[count][0].encode('utf-8'))
				count+=1
			except IndexError:
				pass
				sys.exit(1)

def handleFiles(file,ratio,keywords,krat):
	subtitles = pysrt.open(file,encoding='utf-8')
	tosend = u" "
	for index, sub in enumerate(subtitles):
		linefromsub = subtitles[index].text
		tosend = tosend + linefromsub
	tldr(tosend,ratio,keywords,krat)


def main():
	parser = argparse.ArgumentParser(description='Python Program that returns a summarization from a subtitle file')
	parser.add_argument('-i', '--input-file', action="store", help="takes the input file", metavar="FILE")
	parser.add_argument('-r', '--ratio',action="store", help="define the ratio of words", default=120)
	parser.add_argument('-k', '--keywords',action="store", help="if keywords should be on")
	parser.add_argument('-kr', '--keywords-ratio',action="store", help="how many keywords to display", default=5)
	args = parser.parse_args()
	handleFiles(args.input_file,float(args.ratio),args.keywords,float(args.keywords_ratio))

main()
	# Summarization
	The attention span of humans is low. In our busy lives we don't read news articles or even bother to watch the news, we just read the title and move on. But sometimes the title can be misleading.

	Text Sumarization algorithms can be classified as Extraction-based summarization and Abstraction-based summarization.

	In Extraction-based summarization the text summarizer extracts objects from the entire collection and select whole sentences without modifying. Example For that is TextRank which I'll explain later how it works.

	Abstraction-based summarization is when the algorithm retells the selected sentences to form a summary. Humans use this method to summarize text when given. As of now most tools are Extraction based and Google has an attempt at releasing a tool using Tensorflow.

	I've been Researching TextRank, TextTeaser, A Method by Google that used Tensorflow

	TextTeaser is a natural language processing and machine learning algorithm released around October 2013 by Jolo Balbin. TextTeaser uses basic text summarization techniques and in the features it includes the title feature, sentence length, keyword frequency, sentence position.

	TextRank is an algorithm made by the people from the Engineering Faculty of the University in Buenos Aires. It works by scoring each sentence, then taking the top n sentences and worth them as they appear in the text.

	At first each sentence is put in a graph where each sentence is a node and order them by similarity. Once that is done we visit each node randomly and give it a rank and check the neighbors weights too and compute it's probability. For each visited sentence we can give it a 1 and then sort it by the times we visited the node.

	What TextRank does is very simple: it finds how similar each sentence is to all other sentences in the text. The most important sentence is the one that is most similar to all the others, with this in mind the similarity function should be oriented to the semantic of the sentence, cosine similarity based on a bag of words approach can work well and BM25/BM25+ work really nicely for TextRank.

	Google's way is using Machine Learning. They are extracting the interesting parts and create a summary from it. They were aiming for a abstractive summarization. The way humans do it. The model they trained used a dataset of 10 million files.

	Each has it's own Pros and cons for TextTeaser cons of it are that to produce a better summarization we need to pass a title. TextTeaser uses the Title to compare the sentences and finds the sentences connected to the title to summarize.

	If we use Machine Learning to summarize text it will take more time to train the classifier, more memory than doing it statistically using TextRank or LexRank.

	So In order to save time, storage, memory I picked TextRank. TextRank is both fast, realiable, most used. TextRank works based on Ranking each other and the only thing we need to tell it is how to split sentences which training data is available for.


	Source: [Luis Argerich's answer to What is a simple but detailed explanation of Textrank? - Quora](https://www.quora.com/What-is-a-simple-but-detailed-explanation-of-Textrank/answer/Luis-Argerich?srid=STIA)

	For keywords extraction I used topia.termextract because TextRank picked the swear words instead of actual keywords. It isn't the best algorithm but does the job. It works by marking words that correspond to a particular part of speech.

	*** Statistical machine learning is basically TextRank and LexRank.
	# -- coding: utf-8 --
	from gensim.summarization import summarize
	from topia.termextract import extract
	from topia.termextract import tag
	import pysrt
	import argparse
	import re
	import sys

	def tldr(text,rat,keys,ratk):
	text = re.sub(r"<[^>]*>",' ',text)
	summarization = summarize(text, word_count=rat)
	if keys is None:
	print(summarization.encode('utf-8'))
	else:
	print(summarization.encode('utf-8'))
	print("======================================")
	tagger = tag.Tagger()
	tagger.initialize()
	tagger.tokenize(text)
	extractor = extract.TermExtractor(tagger)
	words = extractor(text)
	count = 0
	while count != ratk:
	try:
	print(words[count][0].encode('utf-8'))
	count+=1
	except IndexError:
	pass
	sys.exit(1)

	def handleFiles(file,ratio,keywords,krat):
	subtitles = pysrt.open(file,encoding='utf-8')
	tosend = u" "
	for index, sub in enumerate(subtitles):
	linefromsub = subtitles[index].text
	tosend = tosend + linefromsub
	tldr(tosend,ratio,keywords,krat)


	def main():
	parser = argparse.ArgumentParser(description='Python Program that returns a summarization from a subtitle file')
	parser.add_argument('-i', '--input-file', action="store", help="takes the input file", metavar="FILE")
	parser.add_argument('-r', '--ratio',action="store", help="define the ratio of words", default=120)
	parser.add_argument('-k', '--keywords',action="store", help="if keywords should be on")
	parser.add_argument('-kr', '--keywords-ratio',action="store", help="how many keywords to display", default=5)
	args = parser.parse_args()
	handleFiles(args.input_file,float(args.ratio),args.keywords,float(args.keywords_ratio))

	main()