A few weeks ago I was browsing Hacker News, and I noticed a post about a little online programming game called Colossal Cue Adventure. I was instantly hooked and three problems quickly fell to some hacked-together Python.
I was feeling pretty satisfied with myself until I noticed the little phrase at the bottom:
you can try your hand at the bonus level by typing bonus...
Well there is no way I could ignore the challenge. I typed bonus
at the prompt and waited for the next puzzle. The first bonus round went rather quickly. It was a harder version of the first problem. A few edits to my original code and I had a solution. Then it started to get tricky. The final problem was a new version of the third challenge, and my hastily thrown together algorithm was not up to the task.
The problem gives the following 12x12 array of digits:
* 8 1 7 8 8 5 2 9 5 9 5
8 5 1 1 5 6 9 4 4 5 2 1
7 2 3 5 2 9 2 6 9 3 9 4
9 2 5 9 8 9 5 7 7 5 9 6
2 4 6 7 1 4 2 6 6 2 5 8
2 8 1 5 3 8 4 9 7 5 2 3
2 9 3 5 6 7 2 4 9 4 2 5
6 3 1 7 8 2 3 3 6 7 9 3
2 5 7 4 2 7 8 5 5 3 5 8
5 2 9 8 3 6 1 4 9 5 6 3
4 6 9 8 5 4 9 7 6 4 6 8
2 7 7 1 9 9 7 3 7 2 2 ^
Each digit represents the cost in 'chips' to move to that 'square' in the array from a neighboring square. The cost to move to each square increases by 1 each time a move is made. You start on the * with 444 chips and must reach the ^ having spent all of them. The final square costs 5 chips but ignores the score increase, so you must have exactly 5 chips to move to the final square.
My initial attempts to stitch together some code failed, so I took some time to analyze the problem. There are a few points that jump out right away. First, it will take at least 22 moves to reach the bottom right corner from the top left. Second, a solution must have an even number of moves. With a little analysis I found the likely number of moves required for a solution. The following code tells us the approximate[1] average cost needed for a solution given the number of moves in the solution.
UPDATE: Corrected logic error in code below. The final move doesn't have a move cost, so we need to subtract one from the total moves when calculating cost.
>>> for moves in range(22, 32, 2):
... moves_cost = (moves - 1) * (moves - 2) / 2
... remaining_chips = 444 - moves_cost
... necessary_average = remaining_chips / float(moves)
... print "%d: %f" % (moves, necessary_average)
...
...
22: 10.636364
24: 7.958333
26: 5.538462
28: 3.321429
30: 1.266667
Twenty-two move solutions can be ruled out because a path with an average cost greater than nine is impossible. A cursory glance of the array shows that there is no path that is composed of ones and twos so thirty move solutions are out as well. Twenty-four move solutions are very unlikely since they would require a path consisting almost exclusively of eights and nines. Twenty-six move solutions are the most likely since the necessary average is close to five. Finally, twenty-eight move solutions are possible but much less likely than twenty-six move solutions.
So how many routes are there? Although there are only 705,432 direct routes to the bottom right corner (combinations involving eleven east moves and eleven south moves), there are 54,209,736 ways to get there in twenty-four moves and a whopping 2,372,995,728 ways to get there in twenty-six moves. Examining the twenty-eight move solutions as well will add an additional 78,007,344,984 possible routes, which means there are over 80 billion in total.
The analysis led to two important realizations:
-
A Breadth First Search Will Fail: Trying to fit 80 billion routes in memory will quickly exhaust all available ram on the system. Even if we managed to encode a route in a single 64 bit integer, we would need 592 gigabytes of RAM to hold 80 billion routes.
-
The Majority of the Moves are East and South: A twenty-four move solution will be twenty-three east and south moves, with one north or west. Similarly a twenty-six move solution will have two north or west moves and a twenty-eight move solution will have three.
With the above knowledge, finding a single solution is straightforward. We need a depth first search that favors east and south moves. First we calculate some constants and create a list of integers from the text:
data = """
* 8 1 7 8 8 5 2 9 5 9 5
8 5 1 1 5 6 9 4 4 5 2 1
7 2 3 5 2 9 2 6 9 3 9 4
9 2 5 9 8 9 5 7 7 5 9 6
2 4 6 7 1 4 2 6 6 2 5 8
2 8 1 5 3 8 4 9 7 5 2 3
2 9 3 5 6 7 2 4 9 4 2 5
6 3 1 7 8 2 3 3 6 7 9 3
2 5 7 4 2 7 8 5 5 3 5 8
5 2 9 8 3 6 1 4 9 5 6 3
4 6 9 8 5 4 9 7 6 4 6 8
2 7 7 1 9 9 7 3 7 2 2 ^
""".strip()
chips = 444
height = len(data.split('\n'))
width = len(data.split('\n')[0].split())
data = data.replace(' ', '').replace('\n', '')
start = data.index('*')
goal = data.index('^')
data = data.replace('*', '0')
data = data.replace('^', '5')
data = [ord(c) - 48 for c in data]
Then we define a recursive method to walk through the tree that favors east and south moves:
def walk(moves, position, chips, depth):
chips -= data[position]
if position == goal:
if chips == 0:
return moves
return
elif position != start:
chips -= depth - 1
if chips < data[goal]:
return
x = position % width
y = position // width
movements = (
('e', 1, x < width - 1),
('s', width, y < height - 1),
('w', -1, x > 0),
('n', -width, y > 0),
)
for (direction, delta, allowed) in movements:
if allowed:
res = walk(moves + direction, position + delta, chips, depth + 1)
if res:
return res
print walk('', start, chips, 0)
This prints out a solution in a few milliseconds:
$ python single.py
eeeeeeeeeeesssssssssswness
Proud of my newfound success, I couldn't help but overreach. What it would take to generate all of the solutions? The algorithm above is fairly optimized, but modifying it to find the first 100,000 solutions made a run take more than 30 seconds. I had only searched a tiny fraction of the space and the run time was rapidly approaching minutes.
It was time to get down and dirty with low level optimizations[2]. A few gallons of sweat and tears later the run for 100,000 solutions was down to about 5.5 seconds. Unfortunately there are 43,242,017 solutions[3] to this problem, and the search space for twenty-eight move solutions is enormous. Realistically, I was looking at over 3 hours to generate all of the solutions with Python. If I was going to be successful, I had to broaden my horizons.
My first attempt was to try PyPy. With a warm cache, PyPy managed to shave that 5.5 seconds down to about 0.75 seconds: over 7x faster! My excitement about the speed improvement was dampened by a sobering realization: Python is slow! Not just a little slow. Snail riding on the back of a turtle slow.
I needed to know exactly how much speed I was sacrificing to the language gods, so I rolled up my sleeves and wrote the same algorithm in C. Much to my chagrin, the C version took only 0.04 seconds to find 100,000 solutions. That is 18x faster than the PyPy version and a whopping 134x faster than the CPython version. My beloved Python was costing me two orders of magnitude in run time.
Unwilling to be defeated, I looked for an enjoyable language might give me performance that is closer to C. Google's Go language seemed to fit the bill, so I wrote a Go version of the same algorithm. It clocked in at about 0.06 seconds: 2/3 of the speed of C[4] and 12x faster than PyPy. Not too shabby for a young language.
Full Results:
+----------------------------+
| Version | Time | vs C |
| -------- | ----- | ------- |
| C | 00.04 | n/a |
| Go | 00.06 | 1.45x |
| PyPy | 00.75 | 18.19x |
| CPython | 05.55 | 134.53x |
+----------------------------+
I realize this is a somewhat contrived example; there are plenty of workloads where the disparity is smaller. Still, I was shocked to discover that Python can be almost two orders of magnitude slower than C for some operations. This brings up some interesting questions:
- Are the productivity gains of using something like Python really worth the performance costs?
- Are there other languages (Go perhaps?) that are a pleasure to use without sacrificing so much performance?
- Why is Python becoming a standard for bioinformatics processing, a field which is compute hungry?
- Why aren't more people using PyPy?
Obviously, Python has benefits that often outweigh its performance shortcomings. It is also easy to mix python with C, so it is possible to find bottlenecks and optimize just the slow parts of a large Python code base. Still, it is important to choose the right language for the problem, and performance-hungry tasks should be implemented in another language.
The source code for the implementations can be found in my github. Suggestions for further optimizations are welcome.
[1] This is approximate because moving back to the starting square doesn't incur a move cost (it always costs 0), however we can ignore this because the neighboring squares are eights and therefore the savings we might achieve by moving back to the initial square at the beginning will be offset by the high cost of moving back.
[2] There are quite a few tricky little optimizations in walk.py that I won't discuss in detail. A notable one is storing the list of moves in a single 64 bit integer.
[3] See all.c for code to generate all of the solutions. It takes about 83 seconds on my machine.
[4] I suspect almost all of the performance benefit of C over Go is due to compiler optimizations. The 0.04 seconds was achieved using gcc 4.7. Using gcc 4.2 the C version runs in just under 0.06 seconds and is neck-and-neck with the Go version.
Stumbled on your comment on hn. I have found the problem interesting enough that I have bothered to explore it myself.
I think for this problem you could argue that the choice of algorithm/approach is probably as important as the language (if not more, but then you could also argue that with a performance oriented language you wouldn't need to worry about the approach for a while...)
For me the key insight is that one need not enumerate all the potential moves since they are mostly useless.
Alternatively one could take a dynamic programming approach where the sub-problem is to figure out how to get to the goal from a given "state". Here the state would be (chips you have left, position in the grid, depth). This 3 dimensional state space is not very large. You have essentially. 444 * 144 * sqrt(444) ~= 1.4MM. Upon "solving" the dynamic programming problem (takes about 3 to 5 seconds in python or pypy on my machine, without any optimizations), you can then enumerate all the paths from this space. If for instance we limit the number of paths to generate/enumerate to 100,000, then the code always completes in less than 5 seconds (i.e. including the time to solve the dynamic program). The enumeration time depends on if you want to generate the full paths or just count them. If we just count we use about 1 to 1.5 minutes in total, whereas if we generate the paths we take about 5 minutes. On my machine there was no difference between python and pypy for the overall time (admittedly I have not "warmed" the PyPy JIT cache by running multiple times like you do).
Also worth noting that while you claim that there are 43,242,017 solutions to this problem, I actually count 45,577,488 . Not sure where the discrepancy comes from, but I include code that "replays" my generated paths to ensure that they are valid. Maybe you would have some insights into this.
There are perhaps many low hanging optimization fruits in my implementation, but I do not bother to go over them as I already achieve comparable performance to your C solution for counting all possible paths. A few directions: perhaps I should linearize the state space of the dynamic program into a list as this might give for faster look ups? Also in the code that generates paths, is it really kosher to use recursive generators? Maybe python is just slow...
My thoughts on your questions:
I think fundamentally you are comparing two different things: programmer productivity which is really not a runtime metric, with program runtime performance. Obviously the conversion factor between these two metrics would depend on the specifics of the application area and workflows. One could argue that the choice of python in many areas is an indication that the economics of computing today does not really place too much a premium on performance but on the speed of iteration.
I guess Go could be one language in this regard. I have recently been exploring D and IMO (compared to C++ which I have also used for many years) it feels more pleasurable and in many ways comparable to the ergonomics to python. Widespread adoption on the other hand...
Isn't the whole point of python in the scientific python community to be a sort of API to the code doing the computation usually written in performance focused languages like C++. That way you get an environment in which you can do a lot of iteration without worrying about the arcana of getting compilation with your dependent modules going like you would in some other languages. IIRC, there had been at least two different projects (CINT, Cling) at CERN to get a sort of interactive C++ environment going for quick iteration, so maybe it does not have to be python. I don't know much about Go but I wonder whether it would be well suited for scientific workflows too.
A lot of reasons that revolve around having the guarantee that all your dependencies are compatible with PyPy. Part of the effort being led by the PyPy team is a standard extension mechanism HPy which would mean that extension modules for python implemented targeting the HPy API would be compatible with other variants of python out of the box. Also as with all optimization efforts your actual mileage may vary depending on your workflows such that it might not be worth the effort and downsides (e.g. lagging behind the standard python releases, at present PyPy is on 3.9 while CPython is on 3.11).