the general algo goes like so:
for chunk in corpus:
e-step
m-step
gensim hacks in multiple passes:
for pass_ in passes:
for chunk in corpus:
e-step
m-step
what we've been doing (only works for batch):
for pass_ in passes:
for bound_iter in iters:
for chunk in corpus:
e-step
m-step
break if done
for online updates, would it make more sense to:
for chunk in corpus:
for bound_iter in iters:
e-step
m-step
break if done
this would give us something that works the same for batch (via chunksize=len(corpus)
and bound_iters > 1
)
but also something that works for online mode (via chunksize<len(corpus)
and bound_iters > 1
).
I mean the only reason I can see to use example 4 is if the updates are "actually" online. If chunks are being used just due to memory constraints but we're really running in a batch mode, then having chunks on the inside makes more sense and follows the algorithm in the paper more closely.