Skip to content

Instantly share code, notes, and snippets.

@ctufts
Last active August 8, 2016 18:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ctufts/8fabcf356c86d52691b1d83c58d17c0f to your computer and use it in GitHub Desktop.
Save ctufts/8fabcf356c86d52691b1d83c58d17c0f to your computer and use it in GitHub Desktop.
General notes from using gensim on 20 million messages
  • save_as_text : don't use this unless you just want to read the text in the file. Otherwise it will cause issues if you want to go back later and revise/filter the dictionary
  • If you choose to import a dictionary then alter it, the corpus must also be updated as outlined here - Q8
  • You have to limit the number of features in large datasets otherwise the memory consumption is huge
  • This is regardless of weather the corpus is loaded in RAM or serialized
  • Iterations argument - refers to the number of iterations in the EM step
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment