piskvorky/example.md

## example.md

      
    Raw
  

              example.md
            
          
    🌟 New features:


Massive optimizations to LSI model training (@isamaru, #1620 & #1622)

LSI model allows use of single precision (float32), to consume  40% less memory while being 40% faster.
LSI model can now also accept CSC matrix as input, for further memory and speed boost.
Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
# just an example; the corpus stream is up to you
streaming_corpus = gensim.corpora.MmCorpus("my_tfidf_corpus.mm.gz")  

# convert your corpus to a CSC sparse matrix (assumes the entire corpus fits in RAM)
in_memory_csc_matrix = gensim.matutils.corpus2csc(streaming_corpus, dtype=np.float32)

# then pass the CSC to LsiModel directly
model = LsiModel(corpus=in_memory_csc_matrix, num_topics=500, dtype=np.float32)

Even if you continue to use streaming corpora (your training dataset is too large for RAM), you should see significantly faster processing times and a lower memory footprint. In our experiments with a very large LSI model, we saw a drop from 29 GB peak RAM and 38 minutes (before) to 19 GB peak RAM and 26 minutes (now):
model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)


Add common terms to Phrases (@alexgarel, #1568)

Phrases allows to use special terms in bigrams (typically stopwords, but you can choice any words) FIXME what does this mean, why should I use it?

phr_old = Phrases(corpus)
phr_new = Phrases(corpus, common_terms=stopwords.words('en'))

print(phr_old[["we", "provide", "car", "with", "driver"]])  # ["we", "provide", "car", "with", "driver"]
print(phr_new[["we", "provide", "car", "with", "driver"]])  # ["we", "provide", "car_with_driver"]


New segment_wiki.py script (@menshikh-iv, #1483)

CLI script for processing a raw Wikipedia dump (XML.bz2 format, such as FIXME here) and converting it to a plain text format:

python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o enwiki-20171001-pages-articles.json.gz
The output format is one article per line, serialized into JSON:
for line in smart_open('enwiki-20171001-pages-articles.txt.gz'):  # read the file we just created
    article = json.loads(line)
    print("Article title: %s" % article['title'])
    for section_title, section_text in zip(article['section_titles'], article['section_texts']):
        print("Section title: %s" % section_title)
        print("Section text: %s" % section_text)