Skip to content

Instantly share code, notes, and snippets.

@cscorley
Last active September 10, 2015 18:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cscorley/71fa1da790a55d58912b to your computer and use it in GitHub Desktop.
Save cscorley/71fa1da790a55d58912b to your computer and use it in GitHub Desktop.

the general algo goes like so:

for chunk in corpus:
  e-step
  m-step

gensim hacks in multiple passes:

for pass_ in passes:
  for chunk in corpus:
    e-step
    m-step

what we've been doing (only works for batch):

for pass_ in passes:
  for bound_iter in iters:
    for chunk in corpus:
      e-step
      m-step
  
    break if done

for online updates, would it make more sense to:

for chunk in corpus:
  for bound_iter in iters:
    e-step
    m-step
  
    break if done

this would give us something that works the same for batch (via chunksize=len(corpus) and bound_iters > 1) but also something that works for online mode (via chunksize<len(corpus) and bound_iters > 1).

@hazelybell
Copy link

Well the main difference between bound_iter and passes is that it has a convergence criterion, it's not just "do this 10 times," and secondly it doesn't update the decay

@cscorley
Copy link
Author

that's fine, renamed in the example so it is more clear what i mean (e.g., before edit passes == iters, so for the PR users don't get deprecation problems). not updating rho/gamma/etc until after a chunk is finished is what this would do.

@hazelybell
Copy link

I mean the only reason I can see to use example 4 is if the updates are "actually" online. If chunks are being used just due to memory constraints but we're really running in a batch mode, then having chunks on the inside makes more sense and follows the algorithm in the paper more closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment