-
-
Save mkhl/6e79aae644c7c0cb4d52 to your computer and use it in GitHub Desktop.
An entry for the soundcloud berlin buzzwords contest
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Requirements: | |
* Python 2 (tested with 2.7.5) | |
* numpy 1.6 (tested with 1.6.2) | |
* PIL or Pillow to draw matches as images (tested with Pillow 2.4.0) | |
Instructions: | |
* Run `python find.py` to find matches and write their offset to stdout. | |
* Run `python draw.py` to find matches and write them as PNG images named "$rank-$offset($score).png" | |
* Use `LOWER=197000000 UPPER=199000000 python {find.py,draw.py}` to set lower/upper bounds. | |
Further Information: | |
* Read `MEDITATION` for information on how the program took shape. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from numpy import * | |
PATH = 'pi-billion.txt' | |
max_results = 10 | |
pattern = array([ | |
0, 0, 0, 0, 0, 3, 3, 9, 9, 9, 6, 0, 0, 0, | |
0, 0, 0, 3, 4, 9, 5, 9, 9, 9, 9, 4, 0, 0, | |
0, 1, 2, 7, 5, 9, 5, 9, 9, 9, 9, 7, 1, 0, | |
3, 9, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 9, 3, | |
6, 9, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 9, 6, | |
1, 8, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 8, 1, | |
], dtype=byte) | |
size = pattern.size |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
from PIL import Image, ImageColor, ImageDraw | |
from numpy import * | |
from data import * | |
from find import * | |
colors = array([ | |
'#ffffff', | |
'#f0f0f0', | |
'#ebebeb', | |
'#d0d0d0', | |
'#c1c1c1', | |
'#a8a8a8', | |
'#878787', | |
'#535353', | |
'#333333', | |
'#000000', | |
]) | |
def main(path, lower=0, upper=None): | |
buffer = mmap(path) | |
if upper is None: | |
upper = buffer.size | |
lower = int(lower) | |
upper = int(upper) | |
scores, offsets = find(buffer, lower, upper) | |
for i, o in enumerate(offsets): | |
path = '%d-%d(%s).png' % (i, offsets[i], scores[i]) | |
draw_buffer(buffer[o:o+size], path) | |
draw_buffer(pattern, 'pattern.png') | |
def draw_buffer(buffer, path): | |
cs = colors[buffer.reshape(6, 14)] | |
im = Image.new('RGB', (14, 6)) | |
dr = ImageDraw.Draw(im) | |
for y, xs in enumerate(cs): | |
for x, c in enumerate(xs): | |
dr.point((x, y), ImageColor.getrgb(c)) | |
im.save(path) | |
if __name__ == '__main__': | |
import sys | |
lower, upper = parse(sys.argv[1:]) | |
main(PATH, lower=lower, upper=upper) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
from __future__ import print_function | |
from numpy import * | |
from data import * | |
def main(path, lower=0, upper=None): | |
buffer = mmap(path) | |
if upper is None: | |
upper = buffer.size | |
lower = int(lower) | |
upper = int(upper) | |
scores, offsets = find(buffer, lower, upper) | |
for o in offsets: | |
print('%d' % (o+1)) | |
def mmap(path): | |
# offset=2 skips the first 2 bytes | |
# - ord('0') converts from ascii to number | |
return memmap(path, dtype=byte, mode='r', offset=2) - ord('0') | |
def find(buffer, lower, upper): | |
vscore = vectorize(lambda offset: score(buffer, offset)) | |
return walk(buffer, vscore, lower, upper - size) | |
def score(buffer, offset): | |
return sum(absolute(subtract(pattern, buffer[offset:offset+size]))) | |
def walk(buffer, score, lower, upper): | |
if upper - lower <= max_results: | |
offsets = arange(lower, upper) | |
scores = score(offsets) | |
indices = argsort(scores) | |
return scores[indices], offsets[indices] | |
pivot = lower + (upper - lower) / 2 | |
left_scores, left_offsets = walk(buffer, score, lower, pivot) | |
right_scores, right_offsets = walk(buffer, score, pivot, upper) | |
offsets = concatenate((left_offsets, right_offsets)) | |
scores = concatenate((left_scores, right_scores)) | |
indices = argsort(scores, kind='quicksort')[:max_results] | |
return scores[indices], offsets[indices] | |
def parse(args): | |
import os | |
if len(args) == 0: | |
return os.getenv('LOWER', 0), os.getenv('UPPER') | |
if len(args) == 1: | |
return 0, args[0] | |
if len(args) == 2: | |
return args | |
return 0, None | |
if __name__ == '__main__': | |
import sys | |
lower, upper = parse(sys.argv[1:]) | |
main(PATH, lower=lower, upper=upper) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
.PHONY: check draw find | |
check: | |
pep8 *.py | |
draw: pi-billion.txt | |
python draw.py | |
find: pi-billion.txt | |
python find.py | |
pi-billion.txt: | |
curl -LO# http://stuff.mit.edu/afs/sipb/contrib/pi/pi-billion.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Initial Ideas | |
------------- | |
My first idea was to read bytes from the data file into a ring buffer and | |
compare that to the pattern. Then I remembered that mmap(2)ing would save me | |
from having to copy the data at all, so I skipped the ring buffer. | |
Algorithmic Landscape | |
--------------------- | |
Next I did a short survey of common algorithms for approximate string matching. | |
I remembered reading about Boyer-Moore (for exact string matching) and found a | |
paper[1] describing a variant that allows a maximum of k differences. Lacking | |
a fixed limit on the difference, the algorithm seemed to degenerated to linear | |
search though. The same thing happened for the subsequence alignment algorithms | |
I found (Needleman-Wunsch and Smith-Waterman), neither of which seem applicable | |
to this problem as it lacks meaningful insertion and deletion. Instead I | |
planned to calculate a running sum of differences between the pattern and the | |
target which would abort the current target (i.e. position in the data buffer) | |
once the sum exceeded the current worst best score (the score of the 10th best | |
match). This idea soon turned obsolete, see below. | |
[1]: Tarhio, Ukkonen: Approximate Boyer-Moore Matching <http://www.cs.hut.fi/~tarhio/papers/abm.pdf> | |
Tools of the Trade | |
------------------ | |
I chose Python because I hadn't written any in a while, wanted a break from the | |
Java my day job consists of, and didn't feel fluent enough with Go. There was | |
also numpy, which I never had a need to use, but which provides array | |
programming facilities, which I am a fan of. Soon I had switched to numpy's | |
memmap function over the standard library one, and found myself using | |
vectorized operations all over the place. I still consider this a good choice | |
because numpy seems well optimized and affords a lot of control, like over | |
array allocation. For example, with the search data in an mmap(2)ed buffer, | |
every slice we operate on is really just a view into that buffer. | |
Separation of Concerns | |
---------------------- | |
The basic idea at that point was to have a loop over every position in the | |
target buffer, calculate the "score" of the corresponding slice of that buffer, | |
and insert the (score, offset) pair into a sorted list of best matches. All of | |
this could turn into distinct functions to 1) iterate the search space, 2) | |
score the current target and 3) insert the scored position. Lacking library | |
functions to insert an item into a sorted list (I had decided not to use heapq | |
because neither converting back and forth between lists and heaps nor keeping a | |
large heap to pick items off of at the end seemed feasible), I implemented this | |
approach, and performance turned out abysmal. | |
Divide and Conquer | |
------------------ | |
To eliminate the need for maintaining a sorted list of best results, I explored | |
a divide-and-conquer approach to the iteration function. There are two basic | |
approaches here: a) recursively splitting the search space and b) chunking the | |
search space. The former works kind of like merge sort, where each half of the | |
search space gets processed recursively and the two lists of results are merged | |
afterwards. The latter works more like a left fold, where we start with an | |
empty result list, process each chunk of the search space individually and | |
immediately merge the results from that chunk into the total result. I didn't | |
find the time to actually implement the merging and, given that the library | |
search functions are quite fast for small inputs, opted for sorting the | |
concatenated result lists. In my experiments, the recursive merging always beat | |
the chunked merging, even though both merge identically and the latter should | |
require fewer merges (103 instead of 128 for a buffer of size 1024 and chunks | |
of size <= 10). This was the most curious result during this experiment. I was | |
also surprised that quicksort did a better job sorting the concatenation of two | |
sorted lists, I expected mergesort to come out on top (I recall being taught | |
that quicksort is slower for sorted inputs). | |
Scoring | |
------- | |
Finding a good scoring function, or rather trying to find one, took the most | |
time of this whole endeavor. As mentioned above, I planned to calculate a | |
cumulative sum of differences, breaking out of the computation once it exceeded | |
the worst of the best scores. It turns out that just calculating the whole sum | |
of differences at once is faster using vectorized numpy operations. | |
After the iteration strategy was settled, the scoring function dominated the | |
runtime behaviour of my program. I explored alternative ways to calculate | |
scores, among them convolution and covariance matrixes (numpy.convolve and | |
numpy.cov, respectively) but found that all of them slowed my program down, and | |
I wasn't able to get good results from convolution. It seems that using | |
convolution would be the tool of choice for this problem, given that it is used | |
for searching subimages, and I wouldn't be surprised if the winning entry used | |
that approach. | |
Besides that I would next have tried vectorizing my inputs harder, like turning | |
a chunk of slices of the search spaces into a matrix, hoping that numpy would | |
be able to do an even better job of scoring that. | |
Fewer Functions | |
--------------- | |
There's parts of the code I'd like to write differently, mainly the `walk` | |
function, which could use being split up into smaller inner functions | |
(`simple`, `complex` and `merge` in my case), but doing that also slowed down | |
my program, and while it did read a bit better, I'm not unhappy with the way it | |
reads now. | |
Complexity | |
---------- | |
The algorithm is `O(n*m)`, effectively comparing each digit of the search space | |
with each digit of the pattern. It doesn't copy data from the search space but | |
does make copies of the result arrays. During merging, at most `log_2(n)` | |
result arrays are held in memory at the same time. | |
The resulting program takes about 35 seconds to scan 1,000,000 digits of pi and | |
would take several hours to scan the whole 1,000,000,000 digits provided. This | |
makes me doubtful that it could be a winning entry, but hopefully it's an | |
interesting read nonetheless. | |
Martin Kühl <martin.kuehl@gmail.com>, 2014-05-05 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment