All in one progress in multistream training.
- [ORIG-W2V] Original Mikolov's Word2Vec scaling
- [SINGLESTREAM] Singlestream benchmark numbers
- [FIRST-MULTISTREAM] First multistream training attempt (numbers, code).
Use several streams, create
threading.Thread(_worker_loop, ...)
for each one. This doesn't exceed singlestream case in performance, because input_streams are more CPU-bound than IO-, so GIL doesn't allow to fill the job queue in fully parallel mode. - [MULTISTREAM-INMEMORY] Training with prefilling job queue in-memory in order to determine performance upper bounds (numbers, code). Even with no job producers, scaling doesn't look like in Mikolov's version, so it will be difficult to achieve it. Nevertheless, we see a speedup up to +300k words/sec, so can optimize.
- [MULTISTREAM-NO-PRODUCERS] No job producers, each worker reads from it's own input_stream (numbers). Still no boost because
_worker_loop
s have Python parts for preparing data. These lines of code can't be parallelized because of GIL. But what if we move these lines into C level? See next bulletpoint. - [MULTISTREAM-CYTHON-NO-PRODUCERS] No job producers, each worker prepares batches in Cython nogil function from it's own input_stream (numbers, code). Looks better than preparing batches in Python, but still not satisfiable.
- [MULTISTREAM-CYTHON-WORKERLOOP] Optimized previous version. In this version, the whole
_worker_loop
is made as Cythonnogil
function, which finally yielded full linear performance! Also, it's 2x faster than original Mikolov's word2vec. See the numbers. - [MULTISTREAM-CYTHON-WORKERLOOP-PYTHON-STREAMS] Previous version, but using raw python
LineSentence
streams, notCythonLineSentence
. It requires preparing batches with GIL, but in comparison to original develop version, all C structures needed to make w2v update on a batch are initialized only once (in develop, they are initialized over and over for each batch). This version runs faster than develop, but slower than fully no GIL previous version. See numbers - [MULTISTREAM-CYTHON-WORKERLOOP-FINAL] Final benchmarks with
corpus_file
argument. CPU load grows linearly, processing speed grows much better than standard approach. Numbers