Create a gist now

Instantly share code, notes, and snippets.

@mkhl /MEDITATION Secret
Last active Oct 31, 2015

What would you like to do?
An entry for the soundcloud berlin buzzwords contest
Requirements:
* Python 2 (tested with 2.7.5)
* numpy 1.6 (tested with 1.6.2)
* PIL or Pillow to draw matches as images (tested with Pillow 2.4.0)
Instructions:
* Run `python find.py` to find matches and write their offset to stdout.
* Run `python draw.py` to find matches and write them as PNG images named "$rank-$offset($score).png"
* Use `LOWER=197000000 UPPER=199000000 python {find.py,draw.py}` to set lower/upper bounds.
Further Information:
* Read `MEDITATION` for information on how the program took shape.
from numpy import *
PATH = 'pi-billion.txt'
max_results = 10
pattern = array([
0, 0, 0, 0, 0, 3, 3, 9, 9, 9, 6, 0, 0, 0,
0, 0, 0, 3, 4, 9, 5, 9, 9, 9, 9, 4, 0, 0,
0, 1, 2, 7, 5, 9, 5, 9, 9, 9, 9, 7, 1, 0,
3, 9, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 9, 3,
6, 9, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 9, 6,
1, 8, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 8, 1,
], dtype=byte)
size = pattern.size
#!/usr/bin/env python
from PIL import Image, ImageColor, ImageDraw
from numpy import *
from data import *
from find import *
colors = array([
'#ffffff',
'#f0f0f0',
'#ebebeb',
'#d0d0d0',
'#c1c1c1',
'#a8a8a8',
'#878787',
'#535353',
'#333333',
'#000000',
])
def main(path, lower=0, upper=None):
buffer = mmap(path)
if upper is None:
upper = buffer.size
lower = int(lower)
upper = int(upper)
scores, offsets = find(buffer, lower, upper)
for i, o in enumerate(offsets):
path = '%d-%d(%s).png' % (i, offsets[i], scores[i])
draw_buffer(buffer[o:o+size], path)
draw_buffer(pattern, 'pattern.png')
def draw_buffer(buffer, path):
cs = colors[buffer.reshape(6, 14)]
im = Image.new('RGB', (14, 6))
dr = ImageDraw.Draw(im)
for y, xs in enumerate(cs):
for x, c in enumerate(xs):
dr.point((x, y), ImageColor.getrgb(c))
im.save(path)
if __name__ == '__main__':
import sys
lower, upper = parse(sys.argv[1:])
main(PATH, lower=lower, upper=upper)
#!/usr/bin/env python
from __future__ import print_function
from numpy import *
from data import *
def main(path, lower=0, upper=None):
buffer = mmap(path)
if upper is None:
upper = buffer.size
lower = int(lower)
upper = int(upper)
scores, offsets = find(buffer, lower, upper)
for o in offsets:
print('%d' % (o+1))
def mmap(path):
# offset=2 skips the first 2 bytes
# - ord('0') converts from ascii to number
return memmap(path, dtype=byte, mode='r', offset=2) - ord('0')
def find(buffer, lower, upper):
vscore = vectorize(lambda offset: score(buffer, offset))
return walk(buffer, vscore, lower, upper - size)
def score(buffer, offset):
return sum(absolute(subtract(pattern, buffer[offset:offset+size])))
def walk(buffer, score, lower, upper):
if upper - lower <= max_results:
offsets = arange(lower, upper)
scores = score(offsets)
indices = argsort(scores)
return scores[indices], offsets[indices]
pivot = lower + (upper - lower) / 2
left_scores, left_offsets = walk(buffer, score, lower, pivot)
right_scores, right_offsets = walk(buffer, score, pivot, upper)
offsets = concatenate((left_offsets, right_offsets))
scores = concatenate((left_scores, right_scores))
indices = argsort(scores, kind='quicksort')[:max_results]
return scores[indices], offsets[indices]
def parse(args):
import os
if len(args) == 0:
return os.getenv('LOWER', 0), os.getenv('UPPER')
if len(args) == 1:
return 0, args[0]
if len(args) == 2:
return args
return 0, None
if __name__ == '__main__':
import sys
lower, upper = parse(sys.argv[1:])
main(PATH, lower=lower, upper=upper)
.PHONY: check draw find
check:
pep8 *.py
draw: pi-billion.txt
python draw.py
find: pi-billion.txt
python find.py
pi-billion.txt:
curl -LO# http://stuff.mit.edu/afs/sipb/contrib/pi/pi-billion.txt
Initial Ideas
-------------
My first idea was to read bytes from the data file into a ring buffer and
compare that to the pattern. Then I remembered that mmap(2)ing would save me
from having to copy the data at all, so I skipped the ring buffer.
Algorithmic Landscape
---------------------
Next I did a short survey of common algorithms for approximate string matching.
I remembered reading about Boyer-Moore (for exact string matching) and found a
paper[1] describing a variant that allows a maximum of k differences. Lacking
a fixed limit on the difference, the algorithm seemed to degenerated to linear
search though. The same thing happened for the subsequence alignment algorithms
I found (Needleman-Wunsch and Smith-Waterman), neither of which seem applicable
to this problem as it lacks meaningful insertion and deletion. Instead I
planned to calculate a running sum of differences between the pattern and the
target which would abort the current target (i.e. position in the data buffer)
once the sum exceeded the current worst best score (the score of the 10th best
match). This idea soon turned obsolete, see below.
[1]: Tarhio, Ukkonen: Approximate Boyer-Moore Matching <http://www.cs.hut.fi/~tarhio/papers/abm.pdf>
Tools of the Trade
------------------
I chose Python because I hadn't written any in a while, wanted a break from the
Java my day job consists of, and didn't feel fluent enough with Go. There was
also numpy, which I never had a need to use, but which provides array
programming facilities, which I am a fan of. Soon I had switched to numpy's
memmap function over the standard library one, and found myself using
vectorized operations all over the place. I still consider this a good choice
because numpy seems well optimized and affords a lot of control, like over
array allocation. For example, with the search data in an mmap(2)ed buffer,
every slice we operate on is really just a view into that buffer.
Separation of Concerns
----------------------
The basic idea at that point was to have a loop over every position in the
target buffer, calculate the "score" of the corresponding slice of that buffer,
and insert the (score, offset) pair into a sorted list of best matches. All of
this could turn into distinct functions to 1) iterate the search space, 2)
score the current target and 3) insert the scored position. Lacking library
functions to insert an item into a sorted list (I had decided not to use heapq
because neither converting back and forth between lists and heaps nor keeping a
large heap to pick items off of at the end seemed feasible), I implemented this
approach, and performance turned out abysmal.
Divide and Conquer
------------------
To eliminate the need for maintaining a sorted list of best results, I explored
a divide-and-conquer approach to the iteration function. There are two basic
approaches here: a) recursively splitting the search space and b) chunking the
search space. The former works kind of like merge sort, where each half of the
search space gets processed recursively and the two lists of results are merged
afterwards. The latter works more like a left fold, where we start with an
empty result list, process each chunk of the search space individually and
immediately merge the results from that chunk into the total result. I didn't
find the time to actually implement the merging and, given that the library
search functions are quite fast for small inputs, opted for sorting the
concatenated result lists. In my experiments, the recursive merging always beat
the chunked merging, even though both merge identically and the latter should
require fewer merges (103 instead of 128 for a buffer of size 1024 and chunks
of size <= 10). This was the most curious result during this experiment. I was
also surprised that quicksort did a better job sorting the concatenation of two
sorted lists, I expected mergesort to come out on top (I recall being taught
that quicksort is slower for sorted inputs).
Scoring
-------
Finding a good scoring function, or rather trying to find one, took the most
time of this whole endeavor. As mentioned above, I planned to calculate a
cumulative sum of differences, breaking out of the computation once it exceeded
the worst of the best scores. It turns out that just calculating the whole sum
of differences at once is faster using vectorized numpy operations.
After the iteration strategy was settled, the scoring function dominated the
runtime behaviour of my program. I explored alternative ways to calculate
scores, among them convolution and covariance matrixes (numpy.convolve and
numpy.cov, respectively) but found that all of them slowed my program down, and
I wasn't able to get good results from convolution. It seems that using
convolution would be the tool of choice for this problem, given that it is used
for searching subimages, and I wouldn't be surprised if the winning entry used
that approach.
Besides that I would next have tried vectorizing my inputs harder, like turning
a chunk of slices of the search spaces into a matrix, hoping that numpy would
be able to do an even better job of scoring that.
Fewer Functions
---------------
There's parts of the code I'd like to write differently, mainly the `walk`
function, which could use being split up into smaller inner functions
(`simple`, `complex` and `merge` in my case), but doing that also slowed down
my program, and while it did read a bit better, I'm not unhappy with the way it
reads now.
Complexity
----------
The algorithm is `O(n*m)`, effectively comparing each digit of the search space
with each digit of the pattern. It doesn't copy data from the search space but
does make copies of the result arrays. During merging, at most `log_2(n)`
result arrays are held in memory at the same time.
The resulting program takes about 35 seconds to scan 1,000,000 digits of pi and
would take several hours to scan the whole 1,000,000,000 digits provided. This
makes me doubtful that it could be a winning entry, but hopefully it's an
interesting read nonetheless.
Martin Kühl <martin.kuehl@gmail.com>, 2014-05-05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment