Skip to content

Instantly share code, notes, and snippets.

@patrickvankessel
Created April 8, 2019 15:36
Show Gist options
  • Star 39 You must be signed in to star a gist
  • Fork 14 You must be signed in to fork a gist
  • Save patrickvankessel/0d5bd690910edece831dbdf32fb2fb2d to your computer and use it in GitHub Desktop.
Save patrickvankessel/0d5bd690910edece831dbdf32fb2fb2d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@ImSajeed
Copy link

ImSajeed commented May 27, 2019

@patrickvankessel, i tried running corEx model on data of 600K its taking more than one hour to process is this expected

@patrickvankessel
Copy link
Author

@patrickvankessel, i tried running corEx model on data of 600K its taking more than one hour to process is this expected

I've never tried scaling it up to a dataset that large - it may eventually finish, but it could take hours or days, especially if you have longer documents and a large vocabulary. If you want a faster option, you could fit the model on a sample of 50-100k documents, and then apply the model to the full dataset afterwards. You could also try narrowing the vocabulary by tweaking the TF-IDF vectorizer parameters and setting a max_features limit.

@ImSajeed
Copy link

@patrickvankessel, whole corpus Vocabulary size 850k, trying on 100k data points .

@cheevahagadog
Copy link

This was very helpful! Thanks @patrickvankessel

@nguyenhaidang94
Copy link

Hi @patrickvankessel, in your examples, there are 8 topics. Is it obligatory to give anchors for all topics?
Can I only give anchors for 6 topics? I want the model to naturally learn two new topics.

@nadia-felix
Copy link

Hi @patrickvankessel, in your examples, there are 8 topics. Is it obligatory to give anchors for all topics?
Can I only give anchors for 6 topics? I want the model to naturally learn two new topics.

@patrickvankessel
Copy link
Author

You can provide anchors for as many or as a few topics as you want - it's perfectly fine to leave some (or all) of them empty!

@ImKH310
Copy link

ImKH310 commented Feb 24, 2022

Hi! @patrickvankessel, I really appreciate your example that exactly what I want to implement!, however, I have one question about the final dataframe. In the final dataframe, each text of row has several topics that showed 1.0. How can I determine only one topic per one text? and Can I get any other float numbers except 0 or 1 as a result of CorEx?

@GiarteDataTeam
Copy link

How do you predict topics for new documents? I faced an issue when calling the model.predict() for new documents.

@eduamf
Copy link

eduamf commented May 3, 2022

Hi @patrickvankessel, I like your blog. I pass through the same process, thinking: "it's a big mess!". After some adjustments, with all topics "overcooked", I was shocked to see the incoherent results (to me)!

My question: you did not remove the stop word. I keep them before the last step in the process, but your topics maintained them. There was any reason?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment