Last active
December 2, 2020 08:50
-
-
Save vporiz/7f1a0c68f92b28efda65dd41f06ae1a0 to your computer and use it in GitHub Desktop.
TF-IDF computation in PySpark
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.mllib.feature import HashingTF, IDF | |
# Load documents (one per line). | |
documents = sc.textFile("data/mllib/kmeans_data.txt").map(lambda line: line.split(" ")) | |
hashingTF = HashingTF() | |
tf = hashingTF.transform(documents) | |
# While applying HashingTF only needs a single pass to the data, applying IDF needs two passes: | |
# First to compute the IDF vector and second to scale the term frequencies by IDF. | |
tf.cache() | |
idf = IDF().fit(tf) | |
tfidf = idf.transform(tf) | |
# spark.mllib's IDF implementation provides an option for ignoring terms | |
# which occur in less than a minimum number of documents. | |
# In such cases, the IDF for these terms is set to 0. | |
# This feature can be used by passing the minDocFreq value to the IDF constructor. | |
idfIgnore = IDF(minDocFreq=2).fit(tf) | |
tfidfIgnore = idfIgnore.transform(tf) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment