🌟 New features:
-
Massive optimizations to LSI model training (@isamaru, #1620 & #1622)
- LSI model allows use of single precision (float32), to consume 40% less memory while being 40% faster.
- LSI model can now also accept CSC matrix as input, for further memory and speed boost.
- Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
# just an example; the corpus stream is up to you streaming_corpus = gensim.corpora.MmCorpus("my_tfidf_corpus.mm.gz") # convert your corpus to a CSC sparse matrix (assumes the entire corpus fits in RAM) in_memory_csc_matrix = gensim.matutils.corpus2csc(streaming_corpus, dtype=np.float32) # then pass the CSC to LsiModel directly model = LsiModel(corpus=in_memory_csc_matrix, num_topics=500, dtype=np.float32)
- Even if you continue to use streaming corpora (your training dataset is too large for RAM), you should see significantly faster processing times and a lower memory footprint. In our experiments with a very large LSI model, we saw a drop from 29 GB peak RAM and 38 minutes (before) to 19 GB peak RAM and 26 minutes (now):
model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)
-
Add common terms to Phrases (@alexgarel, #1568)
- Phrases allows to use special terms in bigrams (typically stopwords, but you can choice any words) FIXME what does this mean, why should I use it?
phr_old = Phrases(corpus) phr_new = Phrases(corpus, common_terms=stopwords.words('en')) print(phr_old[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car", "with", "driver"] print(phr_new[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with_driver"]
-
New
segment_wiki.py
script (@menshikh-iv, #1483)- CLI script for processing a raw Wikipedia dump (XML.bz2 format, such as FIXME here) and converting it to a plain text format:
python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o enwiki-20171001-pages-articles.json.gz
The output format is one article per line, serialized into JSON:
for line in smart_open('enwiki-20171001-pages-articles.txt.gz'): # read the file we just created article = json.loads(line) print("Article title: %s" % article['title']) for section_title, section_text in zip(article['section_titles'], article['section_texts']): print("Section title: %s" % section_title) print("Section text: %s" % section_text)