Skip to content

Instantly share code, notes, and snippets.

@shlomibabluki
Created April 27, 2013 15:36
Star You must be signed in to star a gist
Save shlomibabluki/5473521 to your computer and use it in GitHub Desktop.
# coding=UTF-8
from __future__ import division
import re
# This is a naive text summarization algorithm
# Created by Shlomi Babluki
# April, 2013
class SummaryTool(object):
# Naive method for splitting a text into sentences
def split_content_to_sentences(self, content):
content = content.replace("\n", ". ")
return content.split(". ")
# Naive method for splitting a text into paragraphs
def split_content_to_paragraphs(self, content):
return content.split("\n\n")
# Caculate the intersection between 2 sentences
def sentences_intersection(self, sent1, sent2):
# split the sentence into words/tokens
s1 = set(sent1.split(" "))
s2 = set(sent2.split(" "))
# If there is not intersection, just return 0
if (len(s1) + len(s2)) == 0:
return 0
# We normalize the result by the average number of words
return len(s1.intersection(s2)) / ((len(s1) + len(s2)) / 2)
# Format a sentence - remove all non-alphbetic chars from the sentence
# We'll use the formatted sentence as a key in our sentences dictionary
def format_sentence(self, sentence):
sentence = re.sub(r'\W+', '', sentence)
return sentence
# Convert the content into a dictionary <K, V>
# k = The formatted sentence
# V = The rank of the sentence
def get_senteces_ranks(self, content):
# Split the content into sentences
sentences = self.split_content_to_sentences(content)
# Calculate the intersection of every two sentences
n = len(sentences)
values = [[0 for x in xrange(n)] for x in xrange(n)]
for i in range(0, n):
for j in range(0, n):
values[i][j] = self.sentences_intersection(sentences[i], sentences[j])
# Build the sentences dictionary
# The score of a sentences is the sum of all its intersection
sentences_dic = {}
for i in range(0, n):
score = 0
for j in range(0, n):
if i == j:
continue
score += values[i][j]
sentences_dic[self.format_sentence(sentences[i])] = score
return sentences_dic
# Return the best sentence in a paragraph
def get_best_sentence(self, paragraph, sentences_dic):
# Split the paragraph into sentences
sentences = self.split_content_to_sentences(paragraph)
# Ignore short paragraphs
if len(sentences) < 2:
return ""
# Get the best sentence according to the sentences dictionary
best_sentence = ""
max_value = 0
for s in sentences:
strip_s = self.format_sentence(s)
if strip_s:
if sentences_dic[strip_s] > max_value:
max_value = sentences_dic[strip_s]
best_sentence = s
return best_sentence
# Build the summary
def get_summary(self, title, content, sentences_dic):
# Split the content into paragraphs
paragraphs = self.split_content_to_paragraphs(content)
# Add the title
summary = []
summary.append(title.strip())
summary.append("")
# Add the best sentence from each paragraph
for p in paragraphs:
sentence = self.get_best_sentence(p, sentences_dic).strip()
if sentence:
summary.append(sentence)
return ("\n").join(summary)
# Main method, just run "python summary_tool.py"
def main():
# Demo
# Content from: "http://thenextweb.com/apps/2013/03/21/swayy-discover-curate-content/"
title = """
Swayy is a beautiful new dashboard for discovering and curating online content [Invites]
"""
content = """
Lior Degani, the Co-Founder and head of Marketing of Swayy, pinged me last week when I was in California to tell me about his startup and give me beta access. I heard his pitch and was skeptical. I was also tired, cranky and missing my kids – so my frame of mind wasn’t the most positive.
I went into Swayy to check it out, and when it asked for access to my Twitter and permission to tweet from my account, all I could think was, “If this thing spams my Twitter account I am going to bitch-slap him all over the Internet.” Fortunately that thought stayed in my head, and not out of my mouth.
One week later, I’m totally addicted to Swayy and glad I said nothing about the spam (it doesn’t send out spam tweets but I liked the line too much to not use it for this article). I pinged Lior on Facebook with a request for a beta access code for TNW readers. I also asked how soon can I write about it. It’s that good. Seriously. I use every content curation service online. It really is That Good.
What is Swayy? It’s like Percolate and LinkedIn recommended articles, mixed with trending keywords for the topics you find interesting, combined with an analytics dashboard that shows the trends of what you do and how people react to it. I like it for the simplicity and accuracy of the content curation. Everything I’m actually interested in reading is in one place – I don’t have to skip from another major tech blog over to Harvard Business Review then hop over to another major tech or business blog. It’s all in there. And it has saved me So Much Time
After I decided that I trusted the service, I added my Facebook and LinkedIn accounts. The content just got That Much Better. I can share from the service itself, but I generally prefer reading the actual post first – so I end up sharing it from the main link, using Swayy more as a service for discovery.
I’m also finding myself checking out trending keywords more often (more often than never, which is how often I do it on Twitter.com).
The analytics side isn’t as interesting for me right now, but that could be due to the fact that I’ve barely been online since I came back from the US last weekend. The graphs also haven’t given me any particularly special insights as I can’t see which post got the actual feedback on the graph side (however there are numbers on the Timeline side.) This is a Beta though, and new features are being added and improved daily. I’m sure this is on the list. As they say, if you aren’t launching with something you’re embarrassed by, you’ve waited too long to launch.
It was the suggested content that impressed me the most. The articles really are spot on – which is why I pinged Lior again to ask a few questions:
How do you choose the articles listed on the site? Is there an algorithm involved? And is there any IP?
Yes, we’re in the process of filing a patent for it. But basically the system works with a Natural Language Processing Engine. Actually, there are several parts for the content matching, but besides analyzing what topics the articles are talking about, we have machine learning algorithms that match you to the relevant suggested stuff. For example, if you shared an article about Zuck that got a good reaction from your followers, we might offer you another one about Kevin Systrom (just a simple example).
Who came up with the idea for Swayy, and why? And what’s your business model?
Our business model is a subscription model for extra social accounts (extra Facebook / Twitter, etc) and team collaboration.
The idea was born from our day-to-day need to be active on social media, look for the best content to share with our followers, grow them, and measure what content works best.
Who is on the team?
Ohad Frankfurt is the CEO, Shlomi Babluki is the CTO and Oz Katz does Product and Engineering, and I [Lior Degani] do Marketing. The four of us are the founders. Oz and I were in 8200 [an elite Israeli army unit] together. Emily Engelson does Community Management and Graphic Design.
If you use Percolate or read LinkedIn’s recommended posts I think you’ll love Swayy.
➤ Want to try Swayy out without having to wait? Go to this secret URL and enter the promotion code thenextweb . The first 300 people to use the code will get access.
Image credit: Thinkstock
"""
# Create a SummaryTool object
st = SummaryTool()
# Build the sentences dictionary
sentences_dic = st.get_senteces_ranks(content)
# Build the summary with the sentences dictionary
summary = st.get_summary(title, content, sentences_dic)
# Print the summary
print summary
# Print the ratio between the summary length and the original length
print ""
print "Original Length %s" % (len(title) + len(content))
print "Summary Length %s" % len(summary)
print "Summary Ratio: %s" % (100 - (100 * (len(summary) / (len(title) + len(content)))))
if __name__ == '__main__':
main()
@skeggse
Copy link

skeggse commented May 2, 2013

get_senteces_ranks should be get_sentences_ranks. There are two n's in sentence.

@shehanmunasinghe
Copy link

Is this working?

@mansilla
Copy link

line 29 should be:

if len(s1.intersection(s2)) == 0:

instead of:

if (len(s1) + len(s2)) == 0:

@shehanmunasinghe is a naive approach but it works

@alexanderschochByteInnovations

How could it be that the result summary is just the title and one short sentence, like:

"Swayy is a beautiful new dashboard for discovering and curating online content [Invites]

Who is on the team?"

Isn't there anything missing to summarize the content?
Please explain it to me.

@Mittchel
Copy link

What happends with the following chars: "\r" in the format_sentence method?
I'm porting this code to C#, but my regex returns "" back and tries to add that to a dictionary. Works fine, but on the second "\r" I'm getting a exception that the key "" has already been added.

@jbrooksuk
Copy link

I ported this to Node.js in my node-summary module.

@jlay1
Copy link

jlay1 commented Jan 30, 2014

could i use this script in a project im working on?

@ptwobrussell
Copy link

Nice work. This looks similar to an algorithm developed in the 1950's at IBM by H.P. Luhn. Luhn's algorithm basically finds the most "important" words and then ranks the "importance" of sentences according to the co-occurrences of these "important" words. The summary of the document then becomes the "top n" sentences in chronological order. An implementation of Luhn's algorithm (which also handles URL retrieval and HTML extraction is available in Python here: http://nbviewer.ipython.org/github/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/ipynb/Chapter%209%20-%20Twitter%20Cookbook.ipynb#Example-24.-Summarizing-link-targets

@jlay1
Copy link

jlay1 commented Feb 16, 2014

Does anyone know how i could add user input for the title and content?

@neerajbagga
Copy link

Try this:
title = input ('Enter the title')
content = input ('Enter the Content')

Entering the content this way will only work if you do not have a new line character in your content. Otherwise, I would recommend you read the content from a text file.

contentFile = 'contentFile.txt'
infile = open(contentFile )
content = ''

for line in infile:
conetnt .+ line.strip().decode('utf-8',"ignore")

@kostyll
Copy link

kostyll commented Oct 31, 2014

Hi, I've implemented your algorithm in js : https://github.com/kostyll/summary.js

@hyharryhuang
Copy link

This is really interesting thanks. I made a Swift version of this here: https://github.com/hyharryhuang/SwiftSummary

@mervetuccar
Copy link

I implemented a version of your algorithm in Java, thank you!
Link: https://github.com/mervetuccar/NLP-Basic-Text-Summarizer/blob/master/src/SummaryTool.java

@jstsumguy
Copy link

Big fan of this. Seems to be a very tricky thing to do, especially for the english language.

@iAnatoly
Copy link

Excellent educational post! Minor fix: the comment & code on lines 28-29 are a bit misleading. Github does not allow pull requests for gists, so if you could merge manually? https://gist.github.com/iAnatoly/991ff2c32f68b88f6da4/revisions

@sainihimanshu
Copy link

Superb

@aryopg
Copy link

aryopg commented Feb 28, 2016

What learning algorithm can i inject in this code? And can you give me the example(s)? thank you

@amorebise
Copy link

Thanks for sharing this with the world. You rock!

@trinityXmontoya
Copy link

@tensor5375
Copy link

I executed this program but it was just returning 'the title'. I say sorry before i'm saying it.....IT'S JUST GARBAGE.

@Hack-My-Life
Copy link

Hack-My-Life commented Feb 24, 2019

I am getting the following error when I try to run the code:

Traceback (most recent call last):
  File "sumarize.py", line 183, in <module>
    main()
  File "sumarize.py", line 167, in main
    sentences_dic = st.get_senteces_ranks(content)
  File "sumarize.py", line 51, in get_senteces_ranks
    values = [[0 for x in xrange(n)] for x in xrange(n)]
NameError: name 'xrange' is not defined

xrange does not seem to be defined before it is called. Seems to be an issue only when I run the code in Python 3+.

@SpiffGreen
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment