Skip to content

Instantly share code, notes, and snippets.

Last active November 4, 2017 20:52
Show Gist options
  • Save piskvorky/566ef2142b1eab7012fc38b5d1eeec94 to your computer and use it in GitHub Desktop.
Save piskvorky/566ef2142b1eab7012fc38b5d1eeec94 to your computer and use it in GitHub Desktop.

🌟 New features:

  • Massive optimizations to LSI model training (@isamaru, #1620 & #1622)

    • LSI model allows use of single precision (float32), to consume 40% less memory while being 40% faster.
    • LSI model can now also accept CSC matrix as input, for further memory and speed boost.
    • Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
      # just an example; the corpus stream is up to you
      streaming_corpus = gensim.corpora.MmCorpus("")  
      # convert your corpus to a CSC sparse matrix (assumes the entire corpus fits in RAM)
      in_memory_csc_matrix = gensim.matutils.corpus2csc(streaming_corpus, dtype=np.float32)
      # then pass the CSC to LsiModel directly
      model = LsiModel(corpus=in_memory_csc_matrix, num_topics=500, dtype=np.float32)
    • Even if you continue to use streaming corpora (your training dataset is too large for RAM), you should see significantly faster processing times and a lower memory footprint. In our experiments with a very large LSI model, we saw a drop from 29 GB peak RAM and 38 minutes (before) to 19 GB peak RAM and 26 minutes (now):
      model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)
  • Add common terms to Phrases (@alexgarel, #1568)

    • Phrases allows to use special terms in bigrams (typically stopwords, but you can choice any words) FIXME what does this mean, why should I use it?
    phr_old = Phrases(corpus)
    phr_new = Phrases(corpus, common_terms=stopwords.words('en'))
    print(phr_old[["we", "provide", "car", "with", "driver"]])  # ["we", "provide", "car", "with", "driver"]
    print(phr_new[["we", "provide", "car", "with", "driver"]])  # ["we", "provide", "car_with_driver"]
  • New script (@menshikh-iv, #1483)

    • CLI script for processing a raw Wikipedia dump (XML.bz2 format, such as FIXME here) and converting it to a plain text format:
    python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o enwiki-20171001-pages-articles.json.gz

    The output format is one article per line, serialized into JSON:

    for line in smart_open('enwiki-20171001-pages-articles.txt.gz'):  # read the file we just created
        article = json.loads(line)
        print("Article title: %s" % article['title'])
        for section_title, section_text in zip(article['section_titles'], article['section_texts']):
            print("Section title: %s" % section_title)
            print("Section text: %s" % section_text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment