Skip to content

Instantly share code, notes, and snippets.

@piskvorky
Created July 8, 2014 17:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save piskvorky/eaa837b370b8543e8576 to your computer and use it in GitHub Desktop.
Save piskvorky/eaa837b370b8543e8576 to your computer and use it in GitHub Desktop.
$ python -m gensim.scripts.make_wiki ~/data/wiki/simplewiki-20140623-pages-articles.xml.bz2 simplewiki_en
2014-07-08 18:44:22,009 : INFO : running /Volumes/work/workspace/gensim/trunk/gensim/scripts/make_wiki.py /Users/kofola/data/wiki/simplewiki-20140623-pages-articles.xml.bz2 simplewiki_en
2014-07-08 18:44:22,162 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-07-08 18:44:48,429 : INFO : adding document #10000 to Dictionary(116699 unique tokens: [u'fawn', u'refreshable', u'idaira', u'clottey', u'gavar']...)
2014-07-08 18:45:05,198 : INFO : adding document #20000 to Dictionary(159070 unique tokens: [u'fawn', u'biennials', u'\u03c9\u0431\u0440\u0430\u0434\u043e\u0432\u0430\u043d\u043d\u0430\u0467', u'refreshable', u'grandniece']...)
2014-07-08 18:45:19,946 : INFO : adding document #30000 to Dictionary(198077 unique tokens: [u'biennials', u'idaira', u'clottey', u'gavar', u'experimeter']...)
2014-07-08 18:45:37,237 : INFO : adding document #40000 to Dictionary(232401 unique tokens: [u'biennials', u'sowela', u'mdbg', u'clottes', u'idaira']...)
2014-07-08 18:45:53,758 : INFO : adding document #50000 to Dictionary(261720 unique tokens: [u'biennials', u'sowela', u'mdbg', u'clottes', u'vang']...)
2014-07-08 18:46:12,792 : INFO : adding document #60000 to Dictionary(288641 unique tokens: [u'biennials', u'sowela', u'mdbg', u'clottes', u'klatki']...)
2014-07-08 18:46:33,571 : INFO : adding document #70000 to Dictionary(326692 unique tokens: [u'biennials', u'sowela', u'mdbg', u'clottes', u'klatki']...)
2014-07-08 18:46:51,268 : INFO : adding document #80000 to Dictionary(358238 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...)
2014-07-08 18:47:08,034 : INFO : adding document #90000 to Dictionary(391235 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...)
2014-07-08 18:47:19,986 : INFO : adding document #100000 to Dictionary(403563 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...)
2014-07-08 18:47:32,656 : INFO : adding document #110000 to Dictionary(417230 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...)
2014-07-08 18:47:34,601 : INFO : finished iterating over Wikipedia corpus of 111516 documents with 18341931 positions (total 193436 articles, 19530345 positions before pruning articles shorter than 50 words)
2014-07-08 18:47:34,601 : INFO : built Dictionary(419205 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...) from 111516 documents (total 18341931 corpus positions)
2014-07-08 18:47:34,847 : INFO : keeping 28139 tokens which were in no less than 20 and no more than 11151 (=10.0%) documents
2014-07-08 18:47:35,165 : INFO : resulting dictionary: Dictionary(28139 unique tokens: [u'fawn', u'schlegel', u'sonja', u'woods', u'spiders']...)
2014-07-08 18:47:35,166 : INFO : storing corpus in Matrix Market format to simplewiki_en_bow.mm
2014-07-08 18:47:35,168 : INFO : saving sparse matrix to simplewiki_en_bow.mm
2014-07-08 18:47:35,326 : INFO : PROGRESS: saving document #0
2014-07-08 18:48:10,303 : INFO : PROGRESS: saving document #10000
2014-07-08 18:48:33,232 : INFO : PROGRESS: saving document #20000
2014-07-08 18:48:53,415 : INFO : PROGRESS: saving document #30000
2014-07-08 18:49:16,726 : INFO : PROGRESS: saving document #40000
2014-07-08 18:49:40,372 : INFO : PROGRESS: saving document #50000
2014-07-08 18:50:01,413 : INFO : PROGRESS: saving document #60000
2014-07-08 18:50:20,544 : INFO : PROGRESS: saving document #70000
2014-07-08 18:50:35,360 : INFO : PROGRESS: saving document #80000
2014-07-08 18:50:50,749 : INFO : PROGRESS: saving document #90000
2014-07-08 18:51:03,581 : INFO : PROGRESS: saving document #100000
2014-07-08 18:51:17,333 : INFO : PROGRESS: saving document #110000
2014-07-08 18:51:19,489 : INFO : finished iterating over Wikipedia corpus of 111516 documents with 18341931 positions (total 193436 articles, 19530345 positions before pruning articles shorter than 50 words)
2014-07-08 18:51:19,489 : INFO : saved 111516x28139 matrix, density=0.200% (6284690/3137948724)
2014-07-08 18:51:19,490 : INFO : saving MmCorpus index to simplewiki_en_bow.mm.index
2014-07-08 18:51:19,517 : INFO : saving dictionary mapping to simplewiki_en_wordids.txt.bz2
2014-07-08 18:51:20,147 : INFO : loaded corpus index from simplewiki_en_bow.mm.index
2014-07-08 18:51:20,147 : INFO : initializing corpus reader from simplewiki_en_bow.mm
2014-07-08 18:51:20,147 : INFO : accepted corpus with 111516 documents, 28139 features, 6284690 non-zero entries
2014-07-08 18:51:20,147 : INFO : collecting document frequencies
2014-07-08 18:51:20,153 : INFO : PROGRESS: processing document #0
2014-07-08 18:51:29,844 : INFO : PROGRESS: processing document #10000
2014-07-08 18:51:35,779 : INFO : PROGRESS: processing document #20000
2014-07-08 18:51:40,750 : INFO : PROGRESS: processing document #30000
2014-07-08 18:51:45,936 : INFO : PROGRESS: processing document #40000
2014-07-08 18:51:50,114 : INFO : PROGRESS: processing document #50000
2014-07-08 18:51:55,111 : INFO : PROGRESS: processing document #60000
2014-07-08 18:52:00,033 : INFO : PROGRESS: processing document #70000
2014-07-08 18:52:03,664 : INFO : PROGRESS: processing document #80000
2014-07-08 18:52:07,419 : INFO : PROGRESS: processing document #90000
2014-07-08 18:52:10,415 : INFO : PROGRESS: processing document #100000
2014-07-08 18:52:13,681 : INFO : PROGRESS: processing document #110000
2014-07-08 18:52:14,228 : INFO : calculating IDF weights for 111516 documents and 28138 features (6284690 matrix non-zeros)
2014-07-08 18:52:14,256 : INFO : storing corpus in Matrix Market format to simplewiki_en_tfidf.mm
2014-07-08 18:52:14,256 : INFO : saving sparse matrix to simplewiki_en_tfidf.mm
2014-07-08 18:52:14,264 : INFO : PROGRESS: saving document #0
2014-07-08 18:52:35,928 : INFO : PROGRESS: saving document #10000
2014-07-08 18:52:49,482 : INFO : PROGRESS: saving document #20000
2014-07-08 18:53:00,824 : INFO : PROGRESS: saving document #30000
2014-07-08 18:53:12,513 : INFO : PROGRESS: saving document #40000
2014-07-08 18:53:21,943 : INFO : PROGRESS: saving document #50000
2014-07-08 18:53:33,094 : INFO : PROGRESS: saving document #60000
2014-07-08 18:53:44,313 : INFO : PROGRESS: saving document #70000
2014-07-08 18:53:52,553 : INFO : PROGRESS: saving document #80000
2014-07-08 18:54:01,061 : INFO : PROGRESS: saving document #90000
2014-07-08 18:54:07,902 : INFO : PROGRESS: saving document #100000
2014-07-08 18:54:15,407 : INFO : PROGRESS: saving document #110000
2014-07-08 18:54:16,649 : INFO : saved 111516x28139 matrix, density=0.200% (6284690/3137948724)
2014-07-08 18:54:16,650 : INFO : saving MmCorpus index to simplewiki_en_tfidf.mm.index
2014-07-08 18:54:16,675 : INFO : finished running make_wiki.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment