-
-
Save sloria/6407257 to your computer and use it in GitHub Desktop.
import math | |
from text.blob import TextBlob as tb | |
def tf(word, blob): | |
return blob.words.count(word) / len(blob.words) | |
def n_containing(word, bloblist): | |
return sum(1 for blob in bloblist if word in blob) | |
def idf(word, bloblist): | |
return math.log(len(bloblist) / (1 + n_containing(word, bloblist))) | |
def tfidf(word, blob, bloblist): | |
return tf(word, blob) * idf(word, bloblist) | |
document1 = tb("""Python is a 2000 made-for-TV horror movie directed by Richard | |
Clabaugh. The film features several cult favorite actors, including William | |
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy, | |
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the | |
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean | |
Whalen. The film concerns a genetically engineered snake, a python, that | |
escapes and unleashes itself on a small town. It includes the classic final | |
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles, | |
California and Malibu, California. Python was followed by two sequels: Python | |
II (2002) and Boa vs. Python (2004), both also made-for-TV films.""") | |
document2 = tb("""Python, from the Greek word (πύθων/πύθωνας), is a genus of | |
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are | |
recognised.[2] A member of this genus, P. reticulatus, is among the longest | |
snakes known.""") | |
document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly | |
manufactured by Colt's Manufacturing Company of Hartford, Connecticut. | |
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced | |
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued | |
Colt Python targeted the premium revolver market segment. Some firearm | |
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy | |
Thompson, Renee Smeets and Martin Dougherty have described the Python as the | |
finest production revolver ever made.""") | |
bloblist = [document1, document2, document3] | |
for i, blob in enumerate(bloblist): | |
print("Top words in document {}".format(i + 1)) | |
scores = {word: tfidf(word, blob, bloblist) for word in blob.words} | |
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) | |
for word, score in sorted_words[:3]: | |
print("Word: {}, TF-IDF: {}".format(word, round(score, 5))) |
Hi prabhatntpc,
I think this is too late but others can benefit from it.
import glob
import os
files = glob.glob(os.path.join(os.getcwd(), ':/folder', '*.txt' ))
iterate over the list getting each file
for file in files:
open the file and then call .read() to get the text
with open(file) as f:
text = f.read()
I have this error -
File "C:\Users\megha\Local\Programs\Python\Python37-32\lib\site-packages\textblob\decorators.py", line 38, in decorated
raise MissingCorpusError()
textblob.exceptions.MissingCorpusError:
Looks like you are missing some required data for this feature.
How would it be to read data from a txt file?
How would it be to read data from a txt file?
with open ("abc.txt", "r") as myfile:
data=myfile.read().replace('\n', '')
Here data will store the contents of your text file.
You can then use the variable "data" as required.
return log(len(bloblist) / (1 + n_containing(word, bloblist)))
ValueError: math domain error
Along with TF-IDF score, I also want TF score. How can I do it?
def score_tf(query, tokenized_document):
print('query:', query)
result = 0.0
for q in query:
count = term_frequency(q, tokenized_document)
tf = 1 + math.log(count)
print ("count:",count, "\tterm:",q,"\ttf:",tf)
result = result + tf
return result
def inverse_document_frequencies(term, documents):
df = 0
for d in documents:
tokenized_d = text2list(d)
if term in tokenized_d:
df = df + 1
return math.log(len(documents)/df)
how to make calculate ti.idf ?
def score_tfidf(query)
...........
please help me !
Can someone help me understand line number 8?
return sum(1 for blob in bloblist if word in blob)
I don't understand how the sum(1 for ...) statement works. What is the purpose of 1
in there?
i am using python27 and i got this error
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
AttributeError: 'unicode' object has no attribute 'words'