mkhl/MEDITATION Secret

## README
Requirements:
* Python 2 (tested with 2.7.5)
* numpy 1.6 (tested with 1.6.2)
* PIL or Pillow to draw matches as images (tested with Pillow 2.4.0)

Instructions:
* Run `python find.py` to find matches and write their offset to stdout.
* Run `python draw.py` to find matches and write them as PNG images named "$rank-$offset($score).png"
* Use `LOWER=197000000 UPPER=199000000 python {find.py,draw.py}` to set lower/upper bounds.

Further Information:
* Read `MEDITATION` for information on how the program took shape.

## data.py
from numpy import *

PATH = 'pi-billion.txt'

max_results = 10

pattern = array([
    0, 0, 0, 0, 0, 3, 3, 9, 9, 9, 6, 0, 0, 0,
    0, 0, 0, 3, 4, 9, 5, 9, 9, 9, 9, 4, 0, 0,
    0, 1, 2, 7, 5, 9, 5, 9, 9, 9, 9, 7, 1, 0,
    3, 9, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 9, 3,
    6, 9, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 9, 6,
    1, 8, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 8, 1,
], dtype=byte)

size = pattern.size

## draw.py
#!/usr/bin/env python

from PIL import Image, ImageColor, ImageDraw
from numpy import *

from data import *
from find import *

colors = array([
    '#ffffff',
    '#f0f0f0',
    '#ebebeb',
    '#d0d0d0',
    '#c1c1c1',
    '#a8a8a8',
    '#878787',
    '#535353',
    '#333333',
    '#000000',
])


def main(path, lower=0, upper=None):
    buffer = mmap(path)
    if upper is None:
        upper = buffer.size
    lower = int(lower)
    upper = int(upper)
    scores, offsets = find(buffer, lower, upper)
    for i, o in enumerate(offsets):
        path = '%d-%d(%s).png' % (i, offsets[i], scores[i])
        draw_buffer(buffer[o:o+size], path)
    draw_buffer(pattern, 'pattern.png')


def draw_buffer(buffer, path):
    cs = colors[buffer.reshape(6, 14)]
    im = Image.new('RGB', (14, 6))
    dr = ImageDraw.Draw(im)
    for y, xs in enumerate(cs):
        for x, c in enumerate(xs):
            dr.point((x, y), ImageColor.getrgb(c))
    im.save(path)


if __name__ == '__main__':
    import sys
    lower, upper = parse(sys.argv[1:])
    main(PATH, lower=lower, upper=upper)

## find.py
#!/usr/bin/env python

from __future__ import print_function
from numpy import *

from data import *


def main(path, lower=0, upper=None):
    buffer = mmap(path)
    if upper is None:
        upper = buffer.size
    lower = int(lower)
    upper = int(upper)
    scores, offsets = find(buffer, lower, upper)
    for o in offsets:
        print('%d' % (o+1))


def mmap(path):
    # offset=2 skips the first 2 bytes
    # - ord('0') converts from ascii to number
    return memmap(path, dtype=byte, mode='r', offset=2) - ord('0')


def find(buffer, lower, upper):
    vscore = vectorize(lambda offset: score(buffer, offset))
    return walk(buffer, vscore, lower, upper - size)


def score(buffer, offset):
    return sum(absolute(subtract(pattern, buffer[offset:offset+size])))


def walk(buffer, score, lower, upper):
    if upper - lower <= max_results:
        offsets = arange(lower, upper)
        scores = score(offsets)
        indices = argsort(scores)
        return scores[indices], offsets[indices]
    pivot = lower + (upper - lower) / 2
    left_scores, left_offsets = walk(buffer, score, lower, pivot)
    right_scores, right_offsets = walk(buffer, score, pivot, upper)
    offsets = concatenate((left_offsets, right_offsets))
    scores = concatenate((left_scores, right_scores))
    indices = argsort(scores, kind='quicksort')[:max_results]
    return scores[indices], offsets[indices]


def parse(args):
    import os
    if len(args) == 0:
        return os.getenv('LOWER', 0), os.getenv('UPPER')
    if len(args) == 1:
        return 0, args[0]
    if len(args) == 2:
        return args
    return 0, None


if __name__ == '__main__':
    import sys
    lower, upper = parse(sys.argv[1:])
    main(PATH, lower=lower, upper=upper)

## Makefile
.PHONY: check draw find

check:
	pep8 *.py

draw: pi-billion.txt
	python draw.py

find: pi-billion.txt
	python find.py

pi-billion.txt:
	curl -LO# http://stuff.mit.edu/afs/sipb/contrib/pi/pi-billion.txt

## MEDITATION
Initial Ideas
-------------

My first idea was to read bytes from the data file into a ring buffer and
compare that to the pattern. Then I remembered that mmap(2)ing would save me
from having to copy the data at all, so I skipped the ring buffer.

Algorithmic Landscape
---------------------

Next I did a short survey of common algorithms for approximate string matching.
I remembered reading about Boyer-Moore (for exact string matching) and found a
paper[1] describing a variant that allows a maximum of k differences.  Lacking
a fixed limit on the difference, the algorithm seemed to degenerated to linear
search though. The same thing happened for the subsequence alignment algorithms
I found (Needleman-Wunsch and Smith-Waterman), neither of which seem applicable
to this problem as it lacks meaningful insertion and deletion.  Instead I
planned to calculate a running sum of differences between the pattern and the
target which would abort the current target (i.e.  position in the data buffer)
once the sum exceeded the current worst best score (the score of the 10th best
match). This idea soon turned obsolete, see below.

[1]: Tarhio, Ukkonen: Approximate Boyer-Moore Matching <http://www.cs.hut.fi/~tarhio/papers/abm.pdf>

Tools of the Trade
------------------

I chose Python because I hadn't written any in a while, wanted a break from the
Java my day job consists of, and didn't feel fluent enough with Go. There was
also numpy, which I never had a need to use, but which provides array
programming facilities, which I am a fan of. Soon I had switched to numpy's
memmap function over the standard library one, and found myself using
vectorized operations all over the place. I still consider this a good choice
because numpy seems well optimized and affords a lot of control, like over
array allocation. For example, with the search data in an mmap(2)ed buffer,
every slice we operate on is really just a view into that buffer.

Separation of Concerns
----------------------

The basic idea at that point was to have a loop over every position in the
target buffer, calculate the "score" of the corresponding slice of that buffer,
and insert the (score, offset) pair into a sorted list of best matches. All of
this could turn into distinct functions to 1) iterate the search space, 2)
score the current target and 3) insert the scored position. Lacking library
functions to insert an item into a sorted list (I had decided not to use heapq
because neither converting back and forth between lists and heaps nor keeping a
large heap to pick items off of at the end seemed feasible), I implemented this
approach, and performance turned out abysmal.

Divide and Conquer
------------------

To eliminate the need for maintaining a sorted list of best results, I explored
a divide-and-conquer approach to the iteration function. There are two basic
approaches here: a) recursively splitting the search space and b) chunking the
search space. The former works kind of like merge sort, where each half of the
search space gets processed recursively and the two lists of results are merged
afterwards. The latter works more like a left fold, where we start with an
empty result list, process each chunk of the search space individually and
immediately merge the results from that chunk into the total result. I didn't
find the time to actually implement the merging and, given that the library
search functions are quite fast for small inputs, opted for sorting the
concatenated result lists. In my experiments, the recursive merging always beat
the chunked merging, even though both merge identically and the latter should
require fewer merges (103 instead of 128 for a buffer of size 1024 and chunks
of size <= 10). This was the most curious result during this experiment. I was
also surprised that quicksort did a better job sorting the concatenation of two
sorted lists, I expected mergesort to come out on top (I recall being taught
that quicksort is slower for sorted inputs).

Scoring
-------

Finding a good scoring function, or rather trying to find one, took the most
time of this whole endeavor. As mentioned above, I planned to calculate a
cumulative sum of differences, breaking out of the computation once it exceeded
the worst of the best scores. It turns out that just calculating the whole sum
of differences at once is faster using vectorized numpy operations.

After the iteration strategy was settled, the scoring function dominated the
runtime behaviour of my program. I explored alternative ways to calculate
scores, among them convolution and covariance matrixes (numpy.convolve and
numpy.cov, respectively) but found that all of them slowed my program down, and
I wasn't able to get good results from convolution. It seems that using
convolution would be the tool of choice for this problem, given that it is used
for searching subimages, and I wouldn't be surprised if the winning entry used
that approach.

Besides that I would next have tried vectorizing my inputs harder, like turning
a chunk of slices of the search spaces into a matrix, hoping that numpy would
be able to do an even better job of scoring that.

Fewer Functions
---------------

There's parts of the code I'd like to write differently, mainly the `walk`
function, which could use being split up into smaller inner functions
(`simple`, `complex` and `merge` in my case), but doing that also slowed down
my program, and while it did read a bit better, I'm not unhappy with the way it
reads now.

Complexity
----------

The algorithm is `O(n*m)`, effectively comparing each digit of the search space
with each digit of the pattern. It doesn't copy data from the search space but
does make copies of the result arrays. During merging, at most `log_2(n)`
result arrays are held in memory at the same time.

The resulting program takes about 35 seconds to scan 1,000,000 digits of pi and
would take several hours to scan the whole 1,000,000,000 digits provided. This
makes me doubtful that it could be a winning entry, but hopefully it's an
interesting read nonetheless.

Martin Kühl <martin.kuehl@gmail.com>, 2014-05-05
	Requirements:
	* Python 2 (tested with 2.7.5)
	* numpy 1.6 (tested with 1.6.2)
	* PIL or Pillow to draw matches as images (tested with Pillow 2.4.0)

	Instructions:
	* Run `python find.py` to find matches and write their offset to stdout.
	* Run `python draw.py` to find matches and write them as PNG images named "$rank-$offset($score).png"
	* Use `LOWER=197000000 UPPER=199000000 python {find.py,draw.py}` to set lower/upper bounds.

	Further Information:
	* Read `MEDITATION` for information on how the program took shape.
	from numpy import *

	PATH = 'pi-billion.txt'

	max_results = 10

	pattern = array([
	0, 0, 0, 0, 0, 3, 3, 9, 9, 9, 6, 0, 0, 0,
	0, 0, 0, 3, 4, 9, 5, 9, 9, 9, 9, 4, 0, 0,
	0, 1, 2, 7, 5, 9, 5, 9, 9, 9, 9, 7, 1, 0,
	3, 9, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 9, 3,
	6, 9, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 9, 6,
	1, 8, 5, 9, 5, 9, 5, 9, 9, 9, 9, 9, 8, 1,
	], dtype=byte)

	size = pattern.size
	#!/usr/bin/env python

	from PIL import Image, ImageColor, ImageDraw
	from numpy import *

	from data import *
	from find import *

	colors = array([
	'#ffffff',
	'#f0f0f0',
	'#ebebeb',
	'#d0d0d0',
	'#c1c1c1',
	'#a8a8a8',
	'#878787',
	'#535353',
	'#333333',
	'#000000',
	])


	def main(path, lower=0, upper=None):
	buffer = mmap(path)
	if upper is None:
	upper = buffer.size
	lower = int(lower)
	upper = int(upper)
	scores, offsets = find(buffer, lower, upper)
	for i, o in enumerate(offsets):
	path = '%d-%d(%s).png' % (i, offsets[i], scores[i])
	draw_buffer(buffer[o:o+size], path)
	draw_buffer(pattern, 'pattern.png')


	def draw_buffer(buffer, path):
	cs = colors[buffer.reshape(6, 14)]
	im = Image.new('RGB', (14, 6))
	dr = ImageDraw.Draw(im)
	for y, xs in enumerate(cs):
	for x, c in enumerate(xs):
	dr.point((x, y), ImageColor.getrgb(c))
	im.save(path)


	if __name__ == '__main__':
	import sys
	lower, upper = parse(sys.argv[1:])
	main(PATH, lower=lower, upper=upper)
	#!/usr/bin/env python

	from __future__ import print_function
	from numpy import *

	from data import *


	def main(path, lower=0, upper=None):
	buffer = mmap(path)
	if upper is None:
	upper = buffer.size
	lower = int(lower)
	upper = int(upper)
	scores, offsets = find(buffer, lower, upper)
	for o in offsets:
	print('%d' % (o+1))


	def mmap(path):
	# offset=2 skips the first 2 bytes
	# - ord('0') converts from ascii to number
	return memmap(path, dtype=byte, mode='r', offset=2) - ord('0')


	def find(buffer, lower, upper):
	vscore = vectorize(lambda offset: score(buffer, offset))
	return walk(buffer, vscore, lower, upper - size)


	def score(buffer, offset):
	return sum(absolute(subtract(pattern, buffer[offset:offset+size])))


	def walk(buffer, score, lower, upper):
	if upper - lower <= max_results:
	offsets = arange(lower, upper)
	scores = score(offsets)
	indices = argsort(scores)
	return scores[indices], offsets[indices]
	pivot = lower + (upper - lower) / 2
	left_scores, left_offsets = walk(buffer, score, lower, pivot)
	right_scores, right_offsets = walk(buffer, score, pivot, upper)
	offsets = concatenate((left_offsets, right_offsets))
	scores = concatenate((left_scores, right_scores))
	indices = argsort(scores, kind='quicksort')[:max_results]
	return scores[indices], offsets[indices]


	def parse(args):
	import os
	if len(args) == 0:
	return os.getenv('LOWER', 0), os.getenv('UPPER')
	if len(args) == 1:
	return 0, args[0]
	if len(args) == 2:
	return args
	return 0, None


	if __name__ == '__main__':
	import sys
	lower, upper = parse(sys.argv[1:])
	main(PATH, lower=lower, upper=upper)
	.PHONY: check draw find

	check:
	pep8 *.py

	draw: pi-billion.txt
	python draw.py

	find: pi-billion.txt
	python find.py

	pi-billion.txt:
	curl -LO# http://stuff.mit.edu/afs/sipb/contrib/pi/pi-billion.txt
	Initial Ideas
	-------------

	My first idea was to read bytes from the data file into a ring buffer and
	compare that to the pattern. Then I remembered that mmap(2)ing would save me
	from having to copy the data at all, so I skipped the ring buffer.

	Algorithmic Landscape
	---------------------

	Next I did a short survey of common algorithms for approximate string matching.
	I remembered reading about Boyer-Moore (for exact string matching) and found a
	paper[1] describing a variant that allows a maximum of k differences. Lacking
	a fixed limit on the difference, the algorithm seemed to degenerated to linear
	search though. The same thing happened for the subsequence alignment algorithms
	I found (Needleman-Wunsch and Smith-Waterman), neither of which seem applicable
	to this problem as it lacks meaningful insertion and deletion. Instead I
	planned to calculate a running sum of differences between the pattern and the
	target which would abort the current target (i.e. position in the data buffer)
	once the sum exceeded the current worst best score (the score of the 10th best
	match). This idea soon turned obsolete, see below.

	[1]: Tarhio, Ukkonen: Approximate Boyer-Moore Matching <http://www.cs.hut.fi/~tarhio/papers/abm.pdf>

	Tools of the Trade
	------------------

	I chose Python because I hadn't written any in a while, wanted a break from the
	Java my day job consists of, and didn't feel fluent enough with Go. There was
	also numpy, which I never had a need to use, but which provides array
	programming facilities, which I am a fan of. Soon I had switched to numpy's
	memmap function over the standard library one, and found myself using
	vectorized operations all over the place. I still consider this a good choice
	because numpy seems well optimized and affords a lot of control, like over
	array allocation. For example, with the search data in an mmap(2)ed buffer,
	every slice we operate on is really just a view into that buffer.

	Separation of Concerns
	----------------------

	The basic idea at that point was to have a loop over every position in the
	target buffer, calculate the "score" of the corresponding slice of that buffer,
	and insert the (score, offset) pair into a sorted list of best matches. All of
	this could turn into distinct functions to 1) iterate the search space, 2)
	score the current target and 3) insert the scored position. Lacking library
	functions to insert an item into a sorted list (I had decided not to use heapq
	because neither converting back and forth between lists and heaps nor keeping a
	large heap to pick items off of at the end seemed feasible), I implemented this
	approach, and performance turned out abysmal.

	Divide and Conquer
	------------------

	To eliminate the need for maintaining a sorted list of best results, I explored
	a divide-and-conquer approach to the iteration function. There are two basic
	approaches here: a) recursively splitting the search space and b) chunking the
	search space. The former works kind of like merge sort, where each half of the
	search space gets processed recursively and the two lists of results are merged
	afterwards. The latter works more like a left fold, where we start with an
	empty result list, process each chunk of the search space individually and
	immediately merge the results from that chunk into the total result. I didn't
	find the time to actually implement the merging and, given that the library
	search functions are quite fast for small inputs, opted for sorting the
	concatenated result lists. In my experiments, the recursive merging always beat
	the chunked merging, even though both merge identically and the latter should
	require fewer merges (103 instead of 128 for a buffer of size 1024 and chunks
	of size <= 10). This was the most curious result during this experiment. I was
	also surprised that quicksort did a better job sorting the concatenation of two
	sorted lists, I expected mergesort to come out on top (I recall being taught
	that quicksort is slower for sorted inputs).

	Scoring
	-------

	Finding a good scoring function, or rather trying to find one, took the most
	time of this whole endeavor. As mentioned above, I planned to calculate a
	cumulative sum of differences, breaking out of the computation once it exceeded
	the worst of the best scores. It turns out that just calculating the whole sum
	of differences at once is faster using vectorized numpy operations.

	After the iteration strategy was settled, the scoring function dominated the
	runtime behaviour of my program. I explored alternative ways to calculate
	scores, among them convolution and covariance matrixes (numpy.convolve and
	numpy.cov, respectively) but found that all of them slowed my program down, and
	I wasn't able to get good results from convolution. It seems that using
	convolution would be the tool of choice for this problem, given that it is used
	for searching subimages, and I wouldn't be surprised if the winning entry used
	that approach.

	Besides that I would next have tried vectorizing my inputs harder, like turning
	a chunk of slices of the search spaces into a matrix, hoping that numpy would
	be able to do an even better job of scoring that.

	Fewer Functions
	---------------

	There's parts of the code I'd like to write differently, mainly the `walk`
	function, which could use being split up into smaller inner functions
	(`simple`, `complex` and `merge` in my case), but doing that also slowed down
	my program, and while it did read a bit better, I'm not unhappy with the way it
	reads now.

	Complexity
	----------

	The algorithm is `O(n*m)`, effectively comparing each digit of the search space
	with each digit of the pattern. It doesn't copy data from the search space but
	does make copies of the result arrays. During merging, at most `log_2(n)`
	result arrays are held in memory at the same time.

	The resulting program takes about 35 seconds to scan 1,000,000 digits of pi and
	would take several hours to scan the whole 1,000,000,000 digits provided. This
	makes me doubtful that it could be a winning entry, but hopefully it's an
	interesting read nonetheless.

	Martin Kühl <martin.kuehl@gmail.com>, 2014-05-05