the general algo goes like so:
for chunk in corpus:
e-step
m-step
gensim hacks in multiple passes:
for pass_ in passes:
for chunk in corpus:
e-step
m-step
what we've been doing (only works for batch):
for pass_ in passes:
for bound_iter in iters:
for chunk in corpus:
e-step
m-step
break if done
for online updates, would it make more sense to:
for chunk in corpus:
for bound_iter in iters:
e-step
m-step
break if done
this would give us something that works the same for batch (via chunksize=len(corpus)
and bound_iters > 1
)
but also something that works for online mode (via chunksize<len(corpus)
and bound_iters > 1
).
that's fine, renamed in the example so it is more clear what i mean (e.g., before edit
passes == iters
, so for the PR users don't get deprecation problems). not updating rho/gamma/etc until after a chunk is finished is what this would do.