cscorley/proposal.md

## proposal.md

      
    Raw
  

              proposal.md
            
          
    the general algo goes like so:
for chunk in corpus:
  e-step
  m-step
gensim hacks in multiple passes:
for pass_ in passes:
  for chunk in corpus:
    e-step
    m-step
what we've been doing (only works for batch):
for pass_ in passes:
  for bound_iter in iters:
    for chunk in corpus:
      e-step
      m-step
  
    break if done
for online updates, would it make more sense to:
for chunk in corpus:
  for bound_iter in iters:
    e-step
    m-step
  
    break if done
this would give us something that works the same for batch (via chunksize=len(corpus) and bound_iters > 1)
but also something that works for online mode (via chunksize<len(corpus) and bound_iters > 1).