Skip to content

Instantly share code, notes, and snippets.

@sloria
Created September 1, 2013 20:57
Show Gist options
  • Star 25 You must be signed in to star a gist
  • Fork 18 You must be signed in to fork a gist
  • Save sloria/6407257 to your computer and use it in GitHub Desktop.
Save sloria/6407257 to your computer and use it in GitHub Desktop.
import math
from text.blob import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb("""Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen. The film concerns a genetically engineered snake, a python, that
escapes and unleashes itself on a small town. It includes the classic final
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
California and Malibu, California. Python was followed by two sequels: Python
II (2002) and Boa vs. Python (2004), both also made-for-TV films.""")
document2 = tb("""Python, from the Greek word (πύθων/πύθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known.""")
document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made.""")
bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))
@prabhatntpc
Copy link

dear Sir,

If i have to find out Tf-Idf for mutiple files stored in a folder , than how this program will change.

@jpmallette
Copy link

Thanks for the code. Little error

from text.blob import TextBlob as tb
should be
from textblob import TextBlob as tb

@younes0
Copy link

younes0 commented Oct 6, 2015

I'm a NLP noob, how could I use this with TextBlob classifiers (Bayes/Maxent) ?

@eggie5
Copy link

eggie5 commented Dec 3, 2015

I don't know if this is a python 2 thing, but your division in the tf routine is operating on integers...

@kevincong95
Copy link

@eggie5, you can add this line to the top to coerce float division:
from future import division

@RangerWolf
Copy link

Run this script in python 2.7 got math domain error
find out that the root cause is len(bloblist) / (1 + n_containing(word, bloblist)) will likely to be 0 and log function will cause exception
same as function

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

as fix solution, float it before calculation, such as :

def tf(word, blob):
    return (float)(blob.words.count(word)) / (float)(len(blob.words))

Not sure Py3 result...

@ranjanmanish
Copy link

def idf(word, bloblist): return math.log(len(bloblist) / (float)(1 + n_containing(word, bloblist)))

idf function was throwing math domain error as well. hence I modified it. It worked. Ofcourse I also incorporated the suggestion by RangetWolf.

@jotixh
Copy link

jotixh commented Apr 25, 2016

Thanks, very useful the comments!

@tpatil2
Copy link

tpatil2 commented Jun 28, 2016

Hi, I am having output error...followed given steps:
My OUTPUT:

Top words in document 1
Word: Van, TF-IDF: 0.0
Word: both, TF-IDF: 0.0
Word: including, TF-IDF: 0.0
Top words in document 2
Word: and, TF-IDF: -0.0
Word: among, TF-IDF: 0.0
Word: snakes, TF-IDF: 0.0
Top words in document 3
Word: premium, TF-IDF: 0.0
Word: and, TF-IDF: -0.0
Word: Ian, TF-IDF: 0.0

Please Help

Thank you

@nalindabandara
Copy link

@tpatil2

convert all to float as below

def tf(word, blob):
return (float)(blob.words.count(word)) / (float)(len(blob.words))

def n_containing(word, bloblist):
return (float)(sum(1 for blob in bloblist if word in blob))

def idf(word, bloblist):
return (float)(math.log(len(bloblist)) / (float)(1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
return (float)((float)(tf(word, blob)) * (float)(idf(word, bloblist)))

@aus10powell
Copy link

Really helpful stepping into the NLP world. Thanks!

@gerraay
Copy link

gerraay commented Apr 18, 2017

Hello. Is there any way to sum all the same words in multiple documents?

I use this function to sum the same word in single document
def tf(word, blob): return blob.words.count(word)

Thank you

@nikhilcheke
Copy link

I have few documents stored in a folder, instead of writing documents data into .py file, I want access the document through code. Please help !!
Thanks in advance.

@annabonazzi
Copy link

annabonazzi commented Aug 11, 2017

Hi @nikhilcheke, I have a similar situation to yours. I am using this solution:

import os, glob
folder = "/path/to/folder/"
os.chdir(folder)
files = glob.glob("*.txt") # Makes a list of all files in folder
bloblist = []
for file1 in files:
    with open (file1, 'r') as f:
    data = f.read() # Reads document content into a string
    document = tb(data.decode("utf-8")) # Makes TextBlob object
    bloblist.append(document)

It's working for me

@vishnuragas
Copy link

After above suggested corrections, I get no error, nor it is printing any output in jupyter notebook

@sashavor
Copy link

sashavor commented Jan 8, 2018

Is it possible to incorporate lemmatizing into this process? using TextBlob, for instance?

@arullroja
Copy link

i am using python27 and i got this error
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
AttributeError: 'unicode' object has no attribute 'words'

@adeyemosot
Copy link

Hi prabhatntpc,

I think this is too late but others can benefit from it.

import glob
import os

files = glob.glob(os.path.join(os.getcwd(), ':/folder', '*.txt' ))

iterate over the list getting each file

for file in files:

open the file and then call .read() to get the text

with open(file) as f:
text = f.read()

@meghanagabhushan
Copy link

I have this error -
File "C:\Users\megha\Local\Programs\Python\Python37-32\lib\site-packages\textblob\decorators.py", line 38, in decorated
raise MissingCorpusError()
textblob.exceptions.MissingCorpusError:
Looks like you are missing some required data for this feature.

@1haa
Copy link

1haa commented Jul 18, 2018

How would it be to read data from a txt file?

@sid9394
Copy link

sid9394 commented Sep 26, 2018

How would it be to read data from a txt file?

with open ("abc.txt", "r") as myfile:
     data=myfile.read().replace('\n', '')

Here data will store the contents of your text file.

You can then use the variable "data" as required.

@RathoreRijhu
Copy link

return log(len(bloblist) / (1 + n_containing(word, bloblist)))
ValueError: math domain error

@PiyushKyushu
Copy link

Along with TF-IDF score, I also want TF score. How can I do it?

@noorkosim
Copy link

def score_tf(query, tokenized_document):
print('query:', query)
result = 0.0
for q in query:
count = term_frequency(q, tokenized_document)
tf = 1 + math.log(count)
print ("count:",count, "\tterm:",q,"\ttf:",tf)
result = result + tf
return result

def inverse_document_frequencies(term, documents):
df = 0
for d in documents:
tokenized_d = text2list(d)
if term in tokenized_d:
df = df + 1
return math.log(len(documents)/df)

how to make calculate ti.idf ?
def score_tfidf(query)
...........
please help me !

@iranvir
Copy link

iranvir commented Dec 13, 2020

Can someone help me understand line number 8?
return sum(1 for blob in bloblist if word in blob)
I don't understand how the sum(1 for ...) statement works. What is the purpose of 1 in there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment