Skip to content

Instantly share code, notes, and snippets.

View piskvorky's full-sized avatar

Radim Řehůřek piskvorky

View GitHub Profile
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
Script to calculate possible final standings for "combined climbing" (Olympics 2020 format)
from incomplete in-progress results:
A. Gines Lopez: 1-2
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2020 Radim Rehurek <>
Help script (template) for benchmarking. Run with:
/usr/bin/time --format "%E elapsed\n%Mk peak RAM" python ~/gensim-data/text9/text9.txt
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2019 RARE Technologies s.r.o.
# Authors: Radim Rehurek <>
# MIT License
Find private/shared memory of one or more processes, identified by their process ids (PIDs).
#include <Python.h>
Return the sum of word lengths of all words (unicode strings) in the list `sentence`.
Return -1 if `sentence` isn't a list, and -2 if any of its elements is not a unicode string.
`sentence` and its elements are const = never changed inside this function, and guaranteed to live
throughout its execution, so we don't bother updating any reference counts.
static long long process_const_sentence(PyObject *sentence) {
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [ 0%] about
writing output... [ 1%] apiref
writing output... [ 2%] changes_080
writing output... [ 3%] corpora/bleicorpus
writing output... [ 4%] corpora/corpora
writing output... [ 5%] corpora/csvcorpus
Warning, treated as error:
/Volumes/work/workspace/gensim/trunk/docs/src/viz/poincare.rst:4: WARNING: autodoc: failed to import module u'gensim.viz.poincare'; the following exception was raised:
Traceback (most recent call last):
File "/Users/kofola3/workspace/vew/gensim/lib/python2.7/site-packages/sphinx/ext/", line 551, in import_object
File "/Volumes/work/workspace/gensim/trunk/gensim/viz/", line 18, in <module>
import plotly.graph_objs as go
ImportError: No module named plotly.graph_objs
make: *** [html] Error 1
"pii_type": "passport",
"severity": "high",
"file_format": ["pdf", "scanned", "archive"],
"archive_name": "Visas for reInvent 2017.tar.gz",
"file_name": "maria_p_scanned.pdf",
"ingest_source": "s3://laptop_backups/maria/2017/11/Documents/",
"pii_instances": [
{"name": "Maria Pereira"},
{"date_of_birth": "1984/07/10"},

CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki and looks like wiki--pages-articles.xml.bz2 or wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB:

It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text article sections from each article.

For each extracted article, it prints its title, section names and plaintext section contents, in json-line format.



🌟 New features:

  • Massive optimizations to LSI model training (@isamaru, #1620 & #1622)
    • LSI model allows use of single precision (float32), to consume 40% less memory while being 40% faster.
    • LSI model can now also accept CSC matrix as input, for further memory and speed boost.
    • Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
      # just an example; the corpus stream is up to you
      streaming_corpus = gensim.corpora.MmCorpus("")  
ERROR: test_iteritems (bounter.tests.hashtable.test_htc_iteration.HashTableIterationTest)
Traceback (most recent call last):
File "/Volumes/work/workspace/bounter/bounter/tests/hashtable/", line 43, in test_iteritems
self.assertEqual(set(, self.pairs)
AttributeError: 'bounter_htc.HT_Basic' object has no attribute 'iteritems'
ERROR: test_iterkeys (bounter.tests.hashtable.test_htc_iteration.HashTableIterationTest)