-
-
Save patrickvankessel/0d5bd690910edece831dbdf32fb2fb2d to your computer and use it in GitHub Desktop.
@patrickvankessel, i tried running corEx model on data of 600K its taking more than one hour to process is this expected
I've never tried scaling it up to a dataset that large - it may eventually finish, but it could take hours or days, especially if you have longer documents and a large vocabulary. If you want a faster option, you could fit the model on a sample of 50-100k documents, and then apply the model to the full dataset afterwards. You could also try narrowing the vocabulary by tweaking the TF-IDF vectorizer parameters and setting a max_features
limit.
@patrickvankessel, whole corpus Vocabulary size 850k, trying on 100k data points .
This was very helpful! Thanks @patrickvankessel
Hi @patrickvankessel, in your examples, there are 8 topics. Is it obligatory to give anchors for all topics?
Can I only give anchors for 6 topics? I want the model to naturally learn two new topics.
Hi @patrickvankessel, in your examples, there are 8 topics. Is it obligatory to give anchors for all topics?
Can I only give anchors for 6 topics? I want the model to naturally learn two new topics.
You can provide anchors for as many or as a few topics as you want - it's perfectly fine to leave some (or all) of them empty!
Hi! @patrickvankessel, I really appreciate your example that exactly what I want to implement!, however, I have one question about the final dataframe. In the final dataframe, each text of row has several topics that showed 1.0. How can I determine only one topic per one text? and Can I get any other float numbers except 0 or 1 as a result of CorEx?
How do you predict topics for new documents? I faced an issue when calling the model.predict()
for new documents.
Hi @patrickvankessel, I like your blog. I pass through the same process, thinking: "it's a big mess!". After some adjustments, with all topics "overcooked", I was shocked to see the incoherent results (to me)!
My question: you did not remove the stop word. I keep them before the last step in the process, but your topics maintained them. There was any reason?
@patrickvankessel, i tried running corEx model on data of 600K its taking more than one hour to process is this expected