Skip to content

Instantly share code, notes, and snippets.

@mizvol
Created April 21, 2017 10:06
Show Gist options
  • Save mizvol/c0e24c0209839c89ddb536014896c114 to your computer and use it in GitHub Desktop.
Save mizvol/c0e24c0209839c89ddb536014896c114 to your computer and use it in GitHub Desktop.
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.feature import IDF
from pyspark.ml.feature import CountVectorizer
#vectorize tags array for each user
vectorizer = CountVectorizer(inputCol="tokens", outputCol="features").fit(tagsListDF)
countVectors = vectorizer.transform(tagsListDF).select("id", "features")
#find TF-IDF coefficients for each tag
frequencyVectors = countVectors.map(lambda vector: vector[1])
frequencyVectors.cache()
idf = IDF().fit(frequencyVectors)
tfidf = idf.transform(frequencyVectors)
#prepare corpus for LDA
corpus = tfidf.map(lambda x: [1, x]).cache()
#train LDA
ldaModel = LDA.train(corpus, k = 15, maxIterations=100, optimizer="online", docConcentration=2.0, topicConcentration=3.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment