Skip to content

Instantly share code, notes, and snippets.

@persiyanov
Last active August 20, 2018 14:16
Show Gist options
  • Save persiyanov/84b806233947e0069a243433579b35db to your computer and use it in GitHub Desktop.
Save persiyanov/84b806233947e0069a243433579b35db to your computer and use it in GitHub Desktop.
Links to all benchmarks during GSoC 2018

All in one progress in multistream training.

  • [ORIG-W2V] Original Mikolov's Word2Vec scaling
  • [SINGLESTREAM] Singlestream benchmark numbers
  • [FIRST-MULTISTREAM] First multistream training attempt (numbers, code). Use several streams, create threading.Thread(_worker_loop, ...) for each one. This doesn't exceed singlestream case in performance, because input_streams are more CPU-bound than IO-, so GIL doesn't allow to fill the job queue in fully parallel mode.
  • [MULTISTREAM-INMEMORY] Training with prefilling job queue in-memory in order to determine performance upper bounds (numbers, code). Even with no job producers, scaling doesn't look like in Mikolov's version, so it will be difficult to achieve it. Nevertheless, we see a speedup up to +300k words/sec, so can optimize.
  • [MULTISTREAM-NO-PRODUCERS] No job producers, each worker reads from it's own input_stream (numbers). Still no boost because _worker_loops have Python parts for preparing data. These lines of code can't be parallelized because of GIL. But what if we move these lines into C level? See next bulletpoint.
  • [MULTISTREAM-CYTHON-NO-PRODUCERS] No job producers, each worker prepares batches in Cython nogil function from it's own input_stream (numbers, code). Looks better than preparing batches in Python, but still not satisfiable.
  • [MULTISTREAM-CYTHON-WORKERLOOP] Optimized previous version. In this version, the whole _worker_loop is made as Cython nogil function, which finally yielded full linear performance! Also, it's 2x faster than original Mikolov's word2vec. See the numbers.
  • [MULTISTREAM-CYTHON-WORKERLOOP-PYTHON-STREAMS] Previous version, but using raw python LineSentence streams, not CythonLineSentence. It requires preparing batches with GIL, but in comparison to original develop version, all C structures needed to make w2v update on a batch are initialized only once (in develop, they are initialized over and over for each batch). This version runs faster than develop, but slower than fully no GIL previous version. See numbers
  • [MULTISTREAM-CYTHON-WORKERLOOP-FINAL] Final benchmarks with corpus_file argument. CPU load grows linearly, processing speed grows much better than standard approach. Numbers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment