Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
import math
from text.blob import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb("""Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen. The film concerns a genetically engineered snake, a python, that
escapes and unleashes itself on a small town. It includes the classic final
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
California and Malibu, California. Python was followed by two sequels: Python
II (2002) and Boa vs. Python (2004), both also made-for-TV films.""")
document2 = tb("""Python, from the Greek word (πύθων/πύθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known.""")
document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made.""")
bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))

dear Sir,

If i have to find out Tf-Idf for mutiple files stored in a folder , than how this program will change.

Thanks for the code. Little error

from text.blob import TextBlob as tb
should be
from textblob import TextBlob as tb

younes0 commented Oct 6, 2015

I'm a NLP noob, how could I use this with TextBlob classifiers (Bayes/Maxent) ?

eggie5 commented Dec 3, 2015

I don't know if this is a python 2 thing, but your division in the tf routine is operating on integers...

@eggie5, you can add this line to the top to coerce float division:
from future import division

Run this script in python 2.7 got math domain error
find out that the root cause is len(bloblist) / (1 + n_containing(word, bloblist)) will likely to be 0 and log function will cause exception
same as function

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

as fix solution, float it before calculation, such as :

def tf(word, blob):
    return (float)(blob.words.count(word)) / (float)(len(blob.words))

Not sure Py3 result...

def idf(word, bloblist): return math.log(len(bloblist) / (float)(1 + n_containing(word, bloblist)))

idf function was throwing math domain error as well. hence I modified it. It worked. Ofcourse I also incorporated the suggestion by RangetWolf.

jotixh commented Apr 25, 2016

Thanks, very useful the comments!

tpatil2 commented Jun 28, 2016

Hi, I am having output error...followed given steps:
My OUTPUT:

Top words in document 1
Word: Van, TF-IDF: 0.0
Word: both, TF-IDF: 0.0
Word: including, TF-IDF: 0.0
Top words in document 2
Word: and, TF-IDF: -0.0
Word: among, TF-IDF: 0.0
Word: snakes, TF-IDF: 0.0
Top words in document 3
Word: premium, TF-IDF: 0.0
Word: and, TF-IDF: -0.0
Word: Ian, TF-IDF: 0.0

Please Help

Thank you

@tpatil2

convert all to float as below

def tf(word, blob):
return (float)(blob.words.count(word)) / (float)(len(blob.words))

def n_containing(word, bloblist):
return (float)(sum(1 for blob in bloblist if word in blob))

def idf(word, bloblist):
return (float)(math.log(len(bloblist)) / (float)(1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
return (float)((float)(tf(word, blob)) * (float)(idf(word, bloblist)))

Really helpful stepping into the NLP world. Thanks!

gerraay commented Apr 18, 2017

Hello. Is there any way to sum all the same words in multiple documents?

I use this function to sum the same word in single document
def tf(word, blob): return blob.words.count(word)

Thank you

I have few documents stored in a folder, instead of writing documents data into .py file, I want access the document through code. Please help !!
Thanks in advance.

AnnaBonazzi commented Aug 11, 2017 edited

Hi @nikhilcheke, I have a similar situation to yours. I am using this solution:

import os, glob
folder = "/path/to/folder/"
os.chdir(folder)
files = glob.glob("*.txt") # Makes a list of all files in folder
bloblist = []
for file1 in files:
    with open (file1, 'r') as f:
    data = f.read() # Reads document content into a string
    document = tb(data.decode("utf-8")) # Makes TextBlob object
    bloblist.append(document)

It's working for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment