vishvananda/python-is-slow.md

## python-is-slow.md

      
    Raw
  

              python-is-slow.md
            
          
    Introduction

A few weeks ago I was browsing Hacker News, and I noticed a post about a little online programming game called Colossal Cue Adventure. I was instantly hooked and three problems quickly fell to some hacked-together Python.
I was feeling pretty satisfied with myself until I noticed the little phrase at the bottom:
you can try your hand at the bonus level by typing bonus...
Addiction

Well there is no way I could ignore the challenge. I typed bonus at the prompt and waited for the next puzzle. The first bonus round went rather quickly. It was a harder version of the first problem. A few edits to my original code and I had a solution. Then it started to get tricky. The final problem was a new version of the third challenge, and my hastily thrown together algorithm was not up to the task.
The Problem

The problem gives the following 12x12 array of digits:
* 8 1 7 8 8 5 2 9 5 9 5
8 5 1 1 5 6 9 4 4 5 2 1
7 2 3 5 2 9 2 6 9 3 9 4
9 2 5 9 8 9 5 7 7 5 9 6
2 4 6 7 1 4 2 6 6 2 5 8
2 8 1 5 3 8 4 9 7 5 2 3
2 9 3 5 6 7 2 4 9 4 2 5
6 3 1 7 8 2 3 3 6 7 9 3
2 5 7 4 2 7 8 5 5 3 5 8
5 2 9 8 3 6 1 4 9 5 6 3
4 6 9 8 5 4 9 7 6 4 6 8
2 7 7 1 9 9 7 3 7 2 2 ^

Each digit represents the cost in 'chips' to move to that 'square' in the array from a neighboring square. The cost to move to each square increases by 1 each time a move is made. You start on the * with 444 chips and must reach the ^ having spent all of them. The final square costs 5 chips but ignores the score increase, so you must have exactly 5 chips to move to the final square.
Analysis

My initial attempts to stitch together some code failed, so I took some time to analyze the problem. There are a few points that jump out right away. First, it will take at least 22 moves to reach the bottom right corner from the top left. Second, a solution must have an even number of moves. With a little analysis I found the likely number of moves required for a solution. The following code tells us the approximate[1] average cost needed for a solution given the number of moves in the solution.
UPDATE: Corrected logic error in code below. The final move doesn't have a move cost, so we need to subtract one from the total moves when calculating cost.
>>> for moves in range(22, 32, 2):
...     moves_cost = (moves - 1) * (moves - 2) / 2
...     remaining_chips = 444 - moves_cost
...     necessary_average = remaining_chips / float(moves)
...     print "%d: %f" % (moves, necessary_average)
...
...
22: 10.636364
24: 7.958333
26: 5.538462
28: 3.321429
30: 1.266667
Twenty-two move solutions can be ruled out because a path with an average cost greater than nine is impossible. A cursory glance of the array shows that there is no path that is composed of ones and twos so thirty move solutions are out as well. Twenty-four move solutions are very unlikely since they would require a path consisting almost exclusively of eights and nines. Twenty-six move solutions are the most likely since the necessary average is close to five. Finally, twenty-eight move solutions are possible but much less likely than twenty-six move solutions.
So how many routes are there? Although there are only 705,432 direct routes to the bottom right corner (combinations involving eleven east moves and eleven south moves), there are 54,209,736 ways to get there in twenty-four moves and a whopping 2,372,995,728 ways to get there in twenty-six  moves. Examining the twenty-eight move solutions as well will add an additional 78,007,344,984 possible routes, which means there are over 80 billion in total.
The analysis led to two important realizations:


A Breadth First Search Will Fail: Trying to fit 80 billion routes in memory will quickly exhaust all available ram on the system. Even if we managed to encode a route in a single 64 bit integer, we would need 592 gigabytes of RAM to hold 80 billion routes.


The Majority of the Moves are East and South: A twenty-four move solution will be twenty-three east and south moves, with one north or west. Similarly a twenty-six move solution will have two north or west moves and a twenty-eight move solution will have three.


Solution

With the above knowledge, finding a single solution is straightforward. We need a depth first search that favors east and south moves. First we calculate some constants and create a list of integers from the text:
data = """
* 8 1 7 8 8 5 2 9 5 9 5
8 5 1 1 5 6 9 4 4 5 2 1
7 2 3 5 2 9 2 6 9 3 9 4
9 2 5 9 8 9 5 7 7 5 9 6
2 4 6 7 1 4 2 6 6 2 5 8
2 8 1 5 3 8 4 9 7 5 2 3
2 9 3 5 6 7 2 4 9 4 2 5
6 3 1 7 8 2 3 3 6 7 9 3
2 5 7 4 2 7 8 5 5 3 5 8
5 2 9 8 3 6 1 4 9 5 6 3
4 6 9 8 5 4 9 7 6 4 6 8
2 7 7 1 9 9 7 3 7 2 2 ^
""".strip()
chips = 444
height = len(data.split('\n'))
width = len(data.split('\n')[0].split())
data = data.replace(' ', '').replace('\n', '')
start = data.index('*')
goal = data.index('^')
data = data.replace('*', '0')
data = data.replace('^', '5')
data = [ord(c) - 48 for c in data]
Then we define a recursive method to walk through the tree that favors east and south moves:
def walk(moves, position, chips, depth):
    chips -= data[position]
    if position == goal:
        if chips == 0:
            return moves
        return
    elif position != start:
        chips -= depth - 1

    if chips < data[goal]:
        return

    x = position % width
    y = position // width
    movements = (
        ('e', 1, x < width - 1),
        ('s', width, y < height - 1),
        ('w', -1, x > 0),
        ('n', -width, y > 0),
    )

    for (direction, delta, allowed) in movements:
        if allowed:
            res = walk(moves + direction, position + delta, chips, depth + 1)
            if res:
                return res

print walk('', start, chips, 0)
This prints out a solution in a few milliseconds:
$ python single.py
eeeeeeeeeeesssssssssswness
Disillusionment

Proud of my newfound success, I couldn't help but overreach. What it would take to generate all of the solutions? The algorithm above is fairly optimized, but modifying it to find the first 100,000 solutions made a run take more than 30 seconds. I had only searched a tiny fraction of the space and the run time was rapidly approaching minutes.
It was time to get down and dirty with low level optimizations[2]. A few gallons of sweat and tears later the run for 100,000 solutions was down to about 5.5 seconds. Unfortunately there are 43,242,017 solutions[3] to this problem, and the search space for twenty-eight move solutions is enormous. Realistically, I was looking at over 3 hours to generate all of the solutions with Python. If I was going to be successful, I had to broaden my horizons.
My first attempt was to try PyPy. With a warm cache, PyPy managed to shave that 5.5 seconds down to about 0.75 seconds: over 7x faster! My excitement about the speed improvement was dampened by a sobering realization: Python is slow! Not just a little slow. Snail riding on the back of a turtle slow.
Benchmarking

I needed to know exactly how much speed I was sacrificing to the language gods, so I rolled up my sleeves and wrote the same algorithm in C. Much to my chagrin, the C version took only 0.04 seconds to find 100,000 solutions. That is 18x faster than the PyPy version and a whopping 134x faster than the CPython version. My beloved Python was costing me two orders of magnitude in run time.
Unwilling to be defeated, I looked for an enjoyable language might give me performance that is closer to C. Google's Go language seemed to fit the bill, so I wrote a Go version of the same algorithm.  It clocked in at about 0.06 seconds: 2/3 of the speed of C[4] and 12x faster than PyPy. Not too shabby for a young language.
Full Results:
+----------------------------+
| Version  |  Time |    vs C |
| -------- | ----- | ------- |
| C        | 00.04 |     n/a |
| Go       | 00.06 |   1.45x |
| PyPy     | 00.75 |  18.19x |
| CPython  | 05.55 | 134.53x |
+----------------------------+

Conclusion

I realize this is a somewhat contrived example; there are plenty of workloads where the disparity is smaller. Still, I was shocked to discover that Python can be almost two orders of magnitude slower than C for some operations. This brings up some interesting questions:

Are the productivity gains of using something like Python really worth the performance costs?
Are there other languages (Go perhaps?) that are a pleasure to use without sacrificing so much performance?
Why is Python becoming a standard for bioinformatics processing, a field which is compute hungry?
Why aren't more people using PyPy?

Obviously, Python has benefits that often outweigh its performance shortcomings. It is also easy to mix python with C, so it is possible to find bottlenecks and optimize just the slow parts of a large Python code base. Still, it is important to choose the right language for the problem, and performance-hungry tasks should be implemented in another language.
The source code for the implementations can be found in my github. Suggestions for further optimizations are welcome.
[1] This is approximate because moving back to the starting square doesn't incur a move cost (it always costs 0), however we can ignore this because the neighboring squares are eights and therefore the savings we might achieve by moving back to the initial square at the beginning will be offset by the high cost of moving back.
[2] There are quite a few tricky little optimizations in walk.py that I won't discuss in detail. A notable one is storing the list of moves in a single 64 bit integer.
[3] See all.c for code to generate all of the solutions. It takes about 83 seconds on my machine.
[4] I suspect almost all of the performance benefit of C over Go is due to compiler optimizations. The 0.04 seconds was achieved using gcc 4.7. Using gcc 4.2 the C version runs in just under 0.06 seconds and is neck-and-neck with the Go version.