Skip to content

Instantly share code, notes, and snippets.

View piskvorky's full-sized avatar

Radim Řehůřek piskvorky

View GitHub Profile
>>> mm2 = gensim.corpora.MmCorpus(bz2.BZ2File('./enwiki_bow.mm.bz2'))
INFO : initializing corpus reader from <bz2.BZ2File object at 0x1168d988>
INFO : accepted corpus with 3533010 documents, 50000 features, 525892746 non-zero entries
>>> hdp = gensim.models.HdpModel(corpus=mm, id2word=id2word, outputdir='/net/sojka-local/xrehurek/wiki/', chunksize=2048)

... some 18 hours later (single core):

{
"from": 0,
"size": 100,
"query": {
"bool": {
"must_not": {
"terms": {
"prefix1": [
"a",
"b",
jak pocitat "autocitace"? =>
pro kazdy clanek A:
pro kazdou referenci B z clanku A:
jestlize maji autori B a autori A neprazdny prunik (=existuje aspon jeden spolecny autor v A i B), pridej B k "mnozine autocitaci clanku A"
---
a nasledne muzeme, pri zobrazeni clanku A, zobrazit take pocet autocitaci = velikost "mnoziny autocitaci A"
from cpython cimport PyCObject_AsVoidPtr
from scipy.linalg.blas import fblas
from libc.math cimport fabs
ctypedef float (*sdot_ptr) (const int *N, const float *X, const int *incX, const float *Y, const int *incY) nogil
cdef sdot_ptr sdot=<sdot_ptr>PyCObject_AsVoidPtr(fblas.sdot._cpointer)
ctypedef double (*dsdot_ptr) (const int *N, const float *X, const int *incX, const float *Y, const int *incY) nogil
cdef dsdot_ptr dsdot=<dsdot_ptr>PyCObject_AsVoidPtr(fblas.sdot._cpointer)
import numpy
# load a 10k x 10k array (800MB) previously stored with
# numpy.save('/tmp/x.npy', numpy.random.rand(10000, 10000))
shared = [numpy.load('/tmp/x.npy', mmap_mode='r') for _ in range(10)]
# touch all elements in all 10 copies of the same mmap'ed array
print [x.sum() for x in shared]
# ...resident/real mem spikes at 10x 800MB, nothing shared?
$ pip freeze
Bottleneck==0.7.0
CherryPy==3.2.4
Cython==0.19.1
Jinja2==2.6
Markdown==2.3.1
NearPy==0.1.2
Pattern==2.6
PyTrie==0.2
PyYAML==3.10
$ pip -v -v -v install --pre gensim
Downloading/unpacking gensim
Getting page https://pypi.python.org/simple/gensim/
URLs to search for versions for gensim:
* https://pypi.python.org/simple/gensim/
Analyzing links from page https://pypi.python.org/simple/gensim/
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.2-py2.5.egg#md5=6cd22bc391fb8e7620b6d5aa0b316a5a (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.3.0-py2.5.egg#md5=a2d0ef0fb9b4a6d7224ec102ddfb6670 (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.4-py2.5.egg#md5=c82cbd35bf6b686dd93048ed6c80ab70 (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.4.1-py2.5.egg#md5=fbfb31e1da91fc9249e59f42f3030431 (from https://pypi.python.org/sim
$ easy_install gensim
Searching for gensim
Reading https://pypi.python.org/simple/gensim/
Best match: gensim 0.10.0rc1
Downloading https://pypi.python.org/packages/source/g/gensim/gensim-0.10.0rc1.tar.gz#md5=6bb7cad2ab922dbbcb8ffb0d876f83c7
Processing gensim-0.10.0rc1.tar.gz
Writing /var/folders/wy/80_9ndgx1pv2x5xgvyk0tq5r0000gn/T/easy_install-B4mVYJ/gensim-0.10.0rc1/setup.cfg
Running gensim-0.10.0rc1/setup.py -q bdist_egg --dist-dir /var/folders/wy/80_9ndgx1pv2x5xgvyk0tq5r0000gn/T/easy_install-B4mVYJ/gensim-0.10.0rc1/egg-dist-tmp-Zwfuwg
warning: no files found matching '*.sh' under directory '.'
no previously-included directories found matching 'docs/src*'
$ python -m gensim.scripts.make_wiki ~/data/wiki/simplewiki-20140623-pages-articles.xml.bz2 simplewiki_en
2014-07-08 18:44:22,009 : INFO : running /Volumes/work/workspace/gensim/trunk/gensim/scripts/make_wiki.py /Users/kofola/data/wiki/simplewiki-20140623-pages-articles.xml.bz2 simplewiki_en
2014-07-08 18:44:22,162 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-07-08 18:44:48,429 : INFO : adding document #10000 to Dictionary(116699 unique tokens: [u'fawn', u'refreshable', u'idaira', u'clottey', u'gavar']...)
2014-07-08 18:45:05,198 : INFO : adding document #20000 to Dictionary(159070 unique tokens: [u'fawn', u'biennials', u'\u03c9\u0431\u0440\u0430\u0434\u043e\u0432\u0430\u043d\u043d\u0430\u0467', u'refreshable', u'grandniece']...)
2014-07-08 18:45:19,946 : INFO : adding document #30000 to Dictionary(198077 unique tokens: [u'biennials', u'idaira', u'clottey', u'gavar', u'experimeter']...)
2014-07-08 18:45:37,237 : INFO : adding document #40000 to Dictionary(232401 unique tokens: [u'bienn
bigram count score
external_links 1823239 11.0846952043
united_states 1201240 9.51993073859
references_external 1073067 4.49092215503
new_york 1041129 4.9935587872
th_century 617252 5.71497454564
did_not 526168 4.30135948012
los_angeles 259057 63.2530378731
new_zealand 257352 5.34529759482
does_not 239292 4.39354755795