Skip to content

Instantly share code, notes, and snippets.

View piskvorky's full-sized avatar

Radim Řehůřek piskvorky

View GitHub Profile
import numpy
# load a 10k x 10k array (800MB) previously stored with
# numpy.save('/tmp/x.npy', numpy.random.rand(10000, 10000))
shared = [numpy.load('/tmp/x.npy', mmap_mode='r') for _ in range(10)]
# touch all elements in all 10 copies of the same mmap'ed array
print [x.sum() for x in shared]
# ...resident/real mem spikes at 10x 800MB, nothing shared?
$ pip freeze
Bottleneck==0.7.0
CherryPy==3.2.4
Cython==0.19.1
Jinja2==2.6
Markdown==2.3.1
NearPy==0.1.2
Pattern==2.6
PyTrie==0.2
PyYAML==3.10
$ pip -v -v -v install --pre gensim
Downloading/unpacking gensim
Getting page https://pypi.python.org/simple/gensim/
URLs to search for versions for gensim:
* https://pypi.python.org/simple/gensim/
Analyzing links from page https://pypi.python.org/simple/gensim/
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.2-py2.5.egg#md5=6cd22bc391fb8e7620b6d5aa0b316a5a (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.3.0-py2.5.egg#md5=a2d0ef0fb9b4a6d7224ec102ddfb6670 (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.4-py2.5.egg#md5=c82cbd35bf6b686dd93048ed6c80ab70 (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.4.1-py2.5.egg#md5=fbfb31e1da91fc9249e59f42f3030431 (from https://pypi.python.org/sim
$ easy_install gensim
Searching for gensim
Reading https://pypi.python.org/simple/gensim/
Best match: gensim 0.10.0rc1
Downloading https://pypi.python.org/packages/source/g/gensim/gensim-0.10.0rc1.tar.gz#md5=6bb7cad2ab922dbbcb8ffb0d876f83c7
Processing gensim-0.10.0rc1.tar.gz
Writing /var/folders/wy/80_9ndgx1pv2x5xgvyk0tq5r0000gn/T/easy_install-B4mVYJ/gensim-0.10.0rc1/setup.cfg
Running gensim-0.10.0rc1/setup.py -q bdist_egg --dist-dir /var/folders/wy/80_9ndgx1pv2x5xgvyk0tq5r0000gn/T/easy_install-B4mVYJ/gensim-0.10.0rc1/egg-dist-tmp-Zwfuwg
warning: no files found matching '*.sh' under directory '.'
no previously-included directories found matching 'docs/src*'
$ python -m gensim.scripts.make_wiki ~/data/wiki/simplewiki-20140623-pages-articles.xml.bz2 simplewiki_en
2014-07-08 18:44:22,009 : INFO : running /Volumes/work/workspace/gensim/trunk/gensim/scripts/make_wiki.py /Users/kofola/data/wiki/simplewiki-20140623-pages-articles.xml.bz2 simplewiki_en
2014-07-08 18:44:22,162 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-07-08 18:44:48,429 : INFO : adding document #10000 to Dictionary(116699 unique tokens: [u'fawn', u'refreshable', u'idaira', u'clottey', u'gavar']...)
2014-07-08 18:45:05,198 : INFO : adding document #20000 to Dictionary(159070 unique tokens: [u'fawn', u'biennials', u'\u03c9\u0431\u0440\u0430\u0434\u043e\u0432\u0430\u043d\u043d\u0430\u0467', u'refreshable', u'grandniece']...)
2014-07-08 18:45:19,946 : INFO : adding document #30000 to Dictionary(198077 unique tokens: [u'biennials', u'idaira', u'clottey', u'gavar', u'experimeter']...)
2014-07-08 18:45:37,237 : INFO : adding document #40000 to Dictionary(232401 unique tokens: [u'bienn
def unescape(text):
"""Unescape HTML entities. Input is either unicode or utf8 string; output is always utf8 string."""
# adapted from http://effbot.org/zone/re-sub.htm#unescape-html
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
#!/usr/bin/env bash
# memusg -- Measure memory usage of processes
# Usage: memusg COMMAND [ARGS]...
#
# Author: Jaeho Shin <netj@sparcs.org>
# Created: 2010-08-16
set -um
# check input
[ $# -gt 0 ] || { sed -n '2,/^#$/ s/^# //p' <"$0"; exit 1; }
@piskvorky
piskvorky / f1c51f_doctopics.txt
Created September 1, 2015 02:10
Mallet doctopics
#doc name topic proportion ...
0 0 75 0.27291790375080566 32 0.21542062655806327 91 0.12927086372321364 68 0.11494560291730632 61 0.08635709018778953 73 0.08631053678064 57 0.04319966089780646 28 0.02887936727834596 35 0.014432704123391758 2 2.7911487760191677E-4 45 2.725940123978988E-4 70 2.722625036959093E-4 1 2.3256451843164092E-4 4 2.259570167639352E-4 90 2.2474410586609787E-4 11 2.1430911691440254E-4 58 2.1245904887584203E-4 37 1.8123208755492314E-4 34 1.7234105812375972E-4 15 1.6427407506024936E-4 19 1.632136015179282E-4 10 1.61607028408238E-4 40 1.5930526806191378E-4 51 1.5772117871213373E-4 65 1.5680468190552038E-4 80 1.4563788697690428E-4 99 1.4313460405450642E-4 53 1.4292366928899937E-4 9 1.4076808293294913E-4 59 1.3759332248082507E-4 84 1.335961877850498E-4 41 1.2204449635956596E-4 74 1.1888716247753189E-4 50 1.1866772486063238E-4 76 1.1260766559875016E-4 98 1.1233681058558284E-4 22 1.0940677855090405E-4 56 9.196073733499436E-5 64 9.181876832622669E-5 42 9.119024203586985E-5 72 8.932367891104672E-5
>>> mm2 = gensim.corpora.MmCorpus(bz2.BZ2File('./enwiki_bow.mm.bz2'))
INFO : initializing corpus reader from <bz2.BZ2File object at 0x1168d988>
INFO : accepted corpus with 3533010 documents, 50000 features, 525892746 non-zero entries
>>> hdp = gensim.models.HdpModel(corpus=mm, id2word=id2word, outputdir='/net/sojka-local/xrehurek/wiki/', chunksize=2048)

... some 18 hours later (single core):

{
"from": 0,
"size": 100,
"query": {
"bool": {
"must_not": {
"terms": {
"prefix1": [
"a",
"b",