Instantly share code, notes, and snippets.

Embed
What would you like to do?
import math
from text.blob import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb("""Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen. The film concerns a genetically engineered snake, a python, that
escapes and unleashes itself on a small town. It includes the classic final
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
California and Malibu, California. Python was followed by two sequels: Python
II (2002) and Boa vs. Python (2004), both also made-for-TV films.""")
document2 = tb("""Python, from the Greek word (πύθων/πύθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known.""")
document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made.""")
bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))
@prabhatntpc

This comment has been minimized.

Copy link

prabhatntpc commented Nov 19, 2013

dear Sir,

If i have to find out Tf-Idf for mutiple files stored in a folder , than how this program will change.

@jpmallette

This comment has been minimized.

Copy link

jpmallette commented Sep 24, 2015

Thanks for the code. Little error

from text.blob import TextBlob as tb
should be
from textblob import TextBlob as tb

@younes0

This comment has been minimized.

Copy link

younes0 commented Oct 6, 2015

I'm a NLP noob, how could I use this with TextBlob classifiers (Bayes/Maxent) ?

@eggie5

This comment has been minimized.

Copy link

eggie5 commented Dec 3, 2015

I don't know if this is a python 2 thing, but your division in the tf routine is operating on integers...

@kevincong95

This comment has been minimized.

Copy link

kevincong95 commented Feb 15, 2016

@eggie5, you can add this line to the top to coerce float division:
from future import division

@RangerWolf

This comment has been minimized.

Copy link

RangerWolf commented Feb 19, 2016

Run this script in python 2.7 got math domain error
find out that the root cause is len(bloblist) / (1 + n_containing(word, bloblist)) will likely to be 0 and log function will cause exception
same as function

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

as fix solution, float it before calculation, such as :

def tf(word, blob):
    return (float)(blob.words.count(word)) / (float)(len(blob.words))

Not sure Py3 result...

@ranjanmanish

This comment has been minimized.

Copy link

ranjanmanish commented Feb 27, 2016

def idf(word, bloblist): return math.log(len(bloblist) / (float)(1 + n_containing(word, bloblist)))

idf function was throwing math domain error as well. hence I modified it. It worked. Ofcourse I also incorporated the suggestion by RangetWolf.

@jotixh

This comment has been minimized.

Copy link

jotixh commented Apr 25, 2016

Thanks, very useful the comments!

@tpatil2

This comment has been minimized.

Copy link

tpatil2 commented Jun 28, 2016

Hi, I am having output error...followed given steps:
My OUTPUT:

Top words in document 1
Word: Van, TF-IDF: 0.0
Word: both, TF-IDF: 0.0
Word: including, TF-IDF: 0.0
Top words in document 2
Word: and, TF-IDF: -0.0
Word: among, TF-IDF: 0.0
Word: snakes, TF-IDF: 0.0
Top words in document 3
Word: premium, TF-IDF: 0.0
Word: and, TF-IDF: -0.0
Word: Ian, TF-IDF: 0.0

Please Help

Thank you

@nalindabandara

This comment has been minimized.

Copy link

nalindabandara commented Jul 16, 2016

@tpatil2

convert all to float as below

def tf(word, blob):
return (float)(blob.words.count(word)) / (float)(len(blob.words))

def n_containing(word, bloblist):
return (float)(sum(1 for blob in bloblist if word in blob))

def idf(word, bloblist):
return (float)(math.log(len(bloblist)) / (float)(1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
return (float)((float)(tf(word, blob)) * (float)(idf(word, bloblist)))

@aus10powell

This comment has been minimized.

Copy link

aus10powell commented Jan 26, 2017

Really helpful stepping into the NLP world. Thanks!

@gerraay

This comment has been minimized.

Copy link

gerraay commented Apr 18, 2017

Hello. Is there any way to sum all the same words in multiple documents?

I use this function to sum the same word in single document
def tf(word, blob): return blob.words.count(word)

Thank you

@nikhilcheke

This comment has been minimized.

Copy link

nikhilcheke commented Jul 26, 2017

I have few documents stored in a folder, instead of writing documents data into .py file, I want access the document through code. Please help !!
Thanks in advance.

@AnnaBonazzi

This comment has been minimized.

Copy link

AnnaBonazzi commented Aug 11, 2017

Hi @nikhilcheke, I have a similar situation to yours. I am using this solution:

import os, glob
folder = "/path/to/folder/"
os.chdir(folder)
files = glob.glob("*.txt") # Makes a list of all files in folder
bloblist = []
for file1 in files:
    with open (file1, 'r') as f:
    data = f.read() # Reads document content into a string
    document = tb(data.decode("utf-8")) # Makes TextBlob object
    bloblist.append(document)

It's working for me

@vishnuragas

This comment has been minimized.

Copy link

vishnuragas commented Oct 20, 2017

After above suggested corrections, I get no error, nor it is printing any output in jupyter notebook

@sashavor

This comment has been minimized.

Copy link

sashavor commented Jan 8, 2018

Is it possible to incorporate lemmatizing into this process? using TextBlob, for instance?

@arullroja

This comment has been minimized.

Copy link

arullroja commented Mar 6, 2018

i am using python27 and i got this error
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
AttributeError: 'unicode' object has no attribute 'words'

@adeyemosot

This comment has been minimized.

Copy link

adeyemosot commented Mar 8, 2018

Hi prabhatntpc,

I think this is too late but others can benefit from it.

import glob
import os

files = glob.glob(os.path.join(os.getcwd(), ':/folder', '*.txt' ))

iterate over the list getting each file

for file in files:

open the file and then call .read() to get the text

with open(file) as f:
text = f.read()

@meghanagabhushan

This comment has been minimized.

Copy link

meghanagabhushan commented May 16, 2018

I have this error -
File "C:\Users\megha\Local\Programs\Python\Python37-32\lib\site-packages\textblob\decorators.py", line 38, in decorated
raise MissingCorpusError()
textblob.exceptions.MissingCorpusError:
Looks like you are missing some required data for this feature.

@1haa

This comment has been minimized.

Copy link

1haa commented Jul 18, 2018

How would it be to read data from a txt file?

@sid9394

This comment has been minimized.

Copy link

sid9394 commented Sep 26, 2018

How would it be to read data from a txt file?

with open ("abc.txt", "r") as myfile:
     data=myfile.read().replace('\n', '')

Here data will store the contents of your text file.

You can then use the variable "data" as required.

@RathoreRijhu

This comment has been minimized.

Copy link

RathoreRijhu commented Oct 24, 2018

return log(len(bloblist) / (1 + n_containing(word, bloblist)))
ValueError: math domain error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment