Skip to content

Instantly share code, notes, and snippets.

@vishvananda
Last active March 21, 2023 07:33
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save vishvananda/7a2f1942d0e9ffff4093 to your computer and use it in GitHub Desktop.
Save vishvananda/7a2f1942d0e9ffff4093 to your computer and use it in GitHub Desktop.
Python is Slow

Introduction

A few weeks ago I was browsing Hacker News, and I noticed a post about a little online programming game called Colossal Cue Adventure. I was instantly hooked and three problems quickly fell to some hacked-together Python.

I was feeling pretty satisfied with myself until I noticed the little phrase at the bottom:

you can try your hand at the bonus level by typing bonus...

Addiction

Well there is no way I could ignore the challenge. I typed bonus at the prompt and waited for the next puzzle. The first bonus round went rather quickly. It was a harder version of the first problem. A few edits to my original code and I had a solution. Then it started to get tricky. The final problem was a new version of the third challenge, and my hastily thrown together algorithm was not up to the task.

The Problem

The problem gives the following 12x12 array of digits:

* 8 1 7 8 8 5 2 9 5 9 5
8 5 1 1 5 6 9 4 4 5 2 1
7 2 3 5 2 9 2 6 9 3 9 4
9 2 5 9 8 9 5 7 7 5 9 6
2 4 6 7 1 4 2 6 6 2 5 8
2 8 1 5 3 8 4 9 7 5 2 3
2 9 3 5 6 7 2 4 9 4 2 5
6 3 1 7 8 2 3 3 6 7 9 3
2 5 7 4 2 7 8 5 5 3 5 8
5 2 9 8 3 6 1 4 9 5 6 3
4 6 9 8 5 4 9 7 6 4 6 8
2 7 7 1 9 9 7 3 7 2 2 ^

Each digit represents the cost in 'chips' to move to that 'square' in the array from a neighboring square. The cost to move to each square increases by 1 each time a move is made. You start on the * with 444 chips and must reach the ^ having spent all of them. The final square costs 5 chips but ignores the score increase, so you must have exactly 5 chips to move to the final square.

Analysis

My initial attempts to stitch together some code failed, so I took some time to analyze the problem. There are a few points that jump out right away. First, it will take at least 22 moves to reach the bottom right corner from the top left. Second, a solution must have an even number of moves. With a little analysis I found the likely number of moves required for a solution. The following code tells us the approximate[1] average cost needed for a solution given the number of moves in the solution.

UPDATE: Corrected logic error in code below. The final move doesn't have a move cost, so we need to subtract one from the total moves when calculating cost.

>>> for moves in range(22, 32, 2):
...     moves_cost = (moves - 1) * (moves - 2) / 2
...     remaining_chips = 444 - moves_cost
...     necessary_average = remaining_chips / float(moves)
...     print "%d: %f" % (moves, necessary_average)
...
...
22: 10.636364
24: 7.958333
26: 5.538462
28: 3.321429
30: 1.266667

Twenty-two move solutions can be ruled out because a path with an average cost greater than nine is impossible. A cursory glance of the array shows that there is no path that is composed of ones and twos so thirty move solutions are out as well. Twenty-four move solutions are very unlikely since they would require a path consisting almost exclusively of eights and nines. Twenty-six move solutions are the most likely since the necessary average is close to five. Finally, twenty-eight move solutions are possible but much less likely than twenty-six move solutions.

So how many routes are there? Although there are only 705,432 direct routes to the bottom right corner (combinations involving eleven east moves and eleven south moves), there are 54,209,736 ways to get there in twenty-four moves and a whopping 2,372,995,728 ways to get there in twenty-six moves. Examining the twenty-eight move solutions as well will add an additional 78,007,344,984 possible routes, which means there are over 80 billion in total.

The analysis led to two important realizations:

  1. A Breadth First Search Will Fail: Trying to fit 80 billion routes in memory will quickly exhaust all available ram on the system. Even if we managed to encode a route in a single 64 bit integer, we would need 592 gigabytes of RAM to hold 80 billion routes.

  2. The Majority of the Moves are East and South: A twenty-four move solution will be twenty-three east and south moves, with one north or west. Similarly a twenty-six move solution will have two north or west moves and a twenty-eight move solution will have three.

Solution

With the above knowledge, finding a single solution is straightforward. We need a depth first search that favors east and south moves. First we calculate some constants and create a list of integers from the text:

data = """
* 8 1 7 8 8 5 2 9 5 9 5
8 5 1 1 5 6 9 4 4 5 2 1
7 2 3 5 2 9 2 6 9 3 9 4
9 2 5 9 8 9 5 7 7 5 9 6
2 4 6 7 1 4 2 6 6 2 5 8
2 8 1 5 3 8 4 9 7 5 2 3
2 9 3 5 6 7 2 4 9 4 2 5
6 3 1 7 8 2 3 3 6 7 9 3
2 5 7 4 2 7 8 5 5 3 5 8
5 2 9 8 3 6 1 4 9 5 6 3
4 6 9 8 5 4 9 7 6 4 6 8
2 7 7 1 9 9 7 3 7 2 2 ^
""".strip()
chips = 444
height = len(data.split('\n'))
width = len(data.split('\n')[0].split())
data = data.replace(' ', '').replace('\n', '')
start = data.index('*')
goal = data.index('^')
data = data.replace('*', '0')
data = data.replace('^', '5')
data = [ord(c) - 48 for c in data]

Then we define a recursive method to walk through the tree that favors east and south moves:

def walk(moves, position, chips, depth):
    chips -= data[position]
    if position == goal:
        if chips == 0:
            return moves
        return
    elif position != start:
        chips -= depth - 1

    if chips < data[goal]:
        return

    x = position % width
    y = position // width
    movements = (
        ('e', 1, x < width - 1),
        ('s', width, y < height - 1),
        ('w', -1, x > 0),
        ('n', -width, y > 0),
    )

    for (direction, delta, allowed) in movements:
        if allowed:
            res = walk(moves + direction, position + delta, chips, depth + 1)
            if res:
                return res

print walk('', start, chips, 0)

This prints out a solution in a few milliseconds:

$ python single.py
eeeeeeeeeeesssssssssswness

Disillusionment

Proud of my newfound success, I couldn't help but overreach. What it would take to generate all of the solutions? The algorithm above is fairly optimized, but modifying it to find the first 100,000 solutions made a run take more than 30 seconds. I had only searched a tiny fraction of the space and the run time was rapidly approaching minutes.

It was time to get down and dirty with low level optimizations[2]. A few gallons of sweat and tears later the run for 100,000 solutions was down to about 5.5 seconds. Unfortunately there are 43,242,017 solutions[3] to this problem, and the search space for twenty-eight move solutions is enormous. Realistically, I was looking at over 3 hours to generate all of the solutions with Python. If I was going to be successful, I had to broaden my horizons.

My first attempt was to try PyPy. With a warm cache, PyPy managed to shave that 5.5 seconds down to about 0.75 seconds: over 7x faster! My excitement about the speed improvement was dampened by a sobering realization: Python is slow! Not just a little slow. Snail riding on the back of a turtle slow.

Benchmarking

I needed to know exactly how much speed I was sacrificing to the language gods, so I rolled up my sleeves and wrote the same algorithm in C. Much to my chagrin, the C version took only 0.04 seconds to find 100,000 solutions. That is 18x faster than the PyPy version and a whopping 134x faster than the CPython version. My beloved Python was costing me two orders of magnitude in run time.

Unwilling to be defeated, I looked for an enjoyable language might give me performance that is closer to C. Google's Go language seemed to fit the bill, so I wrote a Go version of the same algorithm. It clocked in at about 0.06 seconds: 2/3 of the speed of C[4] and 12x faster than PyPy. Not too shabby for a young language.

Full Results:

+----------------------------+
| Version  |  Time |    vs C |
| -------- | ----- | ------- |
| C        | 00.04 |     n/a |
| Go       | 00.06 |   1.45x |
| PyPy     | 00.75 |  18.19x |
| CPython  | 05.55 | 134.53x |
+----------------------------+

Conclusion

I realize this is a somewhat contrived example; there are plenty of workloads where the disparity is smaller. Still, I was shocked to discover that Python can be almost two orders of magnitude slower than C for some operations. This brings up some interesting questions:

  • Are the productivity gains of using something like Python really worth the performance costs?
  • Are there other languages (Go perhaps?) that are a pleasure to use without sacrificing so much performance?
  • Why is Python becoming a standard for bioinformatics processing, a field which is compute hungry?
  • Why aren't more people using PyPy?

Obviously, Python has benefits that often outweigh its performance shortcomings. It is also easy to mix python with C, so it is possible to find bottlenecks and optimize just the slow parts of a large Python code base. Still, it is important to choose the right language for the problem, and performance-hungry tasks should be implemented in another language.

The source code for the implementations can be found in my github. Suggestions for further optimizations are welcome.

[1] This is approximate because moving back to the starting square doesn't incur a move cost (it always costs 0), however we can ignore this because the neighboring squares are eights and therefore the savings we might achieve by moving back to the initial square at the beginning will be offset by the high cost of moving back.

[2] There are quite a few tricky little optimizations in walk.py that I won't discuss in detail. A notable one is storing the list of moves in a single 64 bit integer.

[3] See all.c for code to generate all of the solutions. It takes about 83 seconds on my machine.

[4] I suspect almost all of the performance benefit of C over Go is due to compiler optimizations. The 0.04 seconds was achieved using gcc 4.7. Using gcc 4.2 the C version runs in just under 0.06 seconds and is neck-and-neck with the Go version.

@bostikforever
Copy link

Stumbled on your comment on hn. I have found the problem interesting enough that I have bothered to explore it myself.

I think for this problem you could argue that the choice of algorithm/approach is probably as important as the language (if not more, but then you could also argue that with a performance oriented language you wouldn't need to worry about the approach for a while...)

For me the key insight is that one need not enumerate all the potential moves since they are mostly useless.

Alternatively one could take a dynamic programming approach where the sub-problem is to figure out how to get to the goal from a given "state". Here the state would be (chips you have left, position in the grid, depth). This 3 dimensional state space is not very large. You have essentially. 444 * 144 * sqrt(444) ~= 1.4MM. Upon "solving" the dynamic programming problem (takes about 3 to 5 seconds in python or pypy on my machine, without any optimizations), you can then enumerate all the paths from this space. If for instance we limit the number of paths to generate/enumerate to 100,000, then the code always completes in less than 5 seconds (i.e. including the time to solve the dynamic program). The enumeration time depends on if you want to generate the full paths or just count them. If we just count we use about 1 to 1.5 minutes in total, whereas if we generate the paths we take about 5 minutes. On my machine there was no difference between python and pypy for the overall time (admittedly I have not "warmed" the PyPy JIT cache by running multiple times like you do).

Also worth noting that while you claim that there are 43,242,017 solutions to this problem, I actually count 45,577,488 . Not sure where the discrepancy comes from, but I include code that "replays" my generated paths to ensure that they are valid. Maybe you would have some insights into this.

There are perhaps many low hanging optimization fruits in my implementation, but I do not bother to go over them as I already achieve comparable performance to your C solution for counting all possible paths. A few directions: perhaps I should linearize the state space of the dynamic program into a list as this might give for faster look ups? Also in the code that generates paths, is it really kosher to use recursive generators? Maybe python is just slow...

My thoughts on your questions:

Are the productivity gains of using something like Python really worth the performance costs?

I think fundamentally you are comparing two different things: programmer productivity which is really not a runtime metric, with program runtime performance. Obviously the conversion factor between these two metrics would depend on the specifics of the application area and workflows. One could argue that the choice of python in many areas is an indication that the economics of computing today does not really place too much a premium on performance but on the speed of iteration.

Are there other languages (Go perhaps?) that are a pleasure to use without sacrificing so much performance?

I guess Go could be one language in this regard. I have recently been exploring D and IMO (compared to C++ which I have also used for many years) it feels more pleasurable and in many ways comparable to the ergonomics to python. Widespread adoption on the other hand...

Why is Python becoming a standard for bioinformatics processing, a field which is compute hungry?

Isn't the whole point of python in the scientific python community to be a sort of API to the code doing the computation usually written in performance focused languages like C++. That way you get an environment in which you can do a lot of iteration without worrying about the arcana of getting compilation with your dependent modules going like you would in some other languages. IIRC, there had been at least two different projects (CINT, Cling) at CERN to get a sort of interactive C++ environment going for quick iteration, so maybe it does not have to be python. I don't know much about Go but I wonder whether it would be well suited for scientific workflows too.

Why aren't more people using PyPy?

A lot of reasons that revolve around having the guarantee that all your dependencies are compatible with PyPy. Part of the effort being led by the PyPy team is a standard extension mechanism HPy which would mean that extension modules for python implemented targeting the HPy API would be compatible with other variants of python out of the box. Also as with all optimization efforts your actual mileage may vary depending on your workflows such that it might not be worth the effort and downsides (e.g. lagging behind the standard python releases, at present PyPy is on 3.9 while CPython is on 3.11).

@vishvananda
Copy link
Author

vishvananda commented Mar 14, 2023

Hey, thanks for the comment @bostikforever. It has been a while since I worked on this. The answers to your questions are along the lines of what I was thinking when i wrote the original problem. A few thoughts on your solution. I have tried to solve this problem with dynamic programming and I have never managed to make it faster than the search i ultimately settled on. You claim to solve the problem in 3-5 seconds, but it isn't solved in any meaningful sense until you enumerate or count the paths, and this is where most of the time is spent in any solution. The current python3 implementation in https://github.com/vishvananda/walk counts 100,000 solutions in 1.3 seconds. The go solution (currently fastest) counts 100,000 paths in 0.03 seconds and can count all paths in 28.8 seconds on my macbook.

@bostikforever
Copy link

The dynamic programming solution would count or generate 100k solutions in about 5 seconds, but would count all 40+ million solutions in a little over 60 seconds (or generate them in 5 minutes). Of course you still have better performance with C or Go but my point is that while search is right for a single solution (or even up to N solutions) it's IMO not the best for generating or enumerating all solutions (irrespective of the language you use).

Do you know why we have a discrepancy in number of solutions?

@vishvananda
Copy link
Author

One should be able to have a smaller search space through dynamic programming, but I couldn't ever extract any savings from it. FWIW I also experimented with heuristics to exit the search early. For example, you can truncate the current branch of the DFS if the average cost of the remaining moves needed rises above 10, although i never managed to find any savings by including calculations like that. I suspect this is an artifact of the rarity of backtracking moves and the small cost of completing the search.

@vishvananda
Copy link
Author

vishvananda commented Mar 14, 2023

@bostikforever good point, it is possible that your method is faster for using all solutions. Not sure why there is a discrepancy. One possibility is that moving back to the initial square does not incur the additional cost (e.g. it should always cost zero no matter how many moves you have made). I don't think you account for that in your code, although i would expect that to yield more solutions not less. Is it possibly your code is generating duplicate solutions? I suppose we could print out all solutions and sort them to figure out what might be missing. Maybe my maximum of two backtracks assumption is incorrect and you are finding some 28 or 30 move solutions?

@bostikforever
Copy link

I actually count more solutions than you. However, I also account for the cost of the final square always being zero. FWIW, I think the single solution that your code generated was a prefix of the solutions I generated. Also I think that I do not generate duplicate solutions. Since the states space generated by the dynamic program only contains unique states, the subsequent search of the valid states in the state space to generate paths can only yield unique solutions. I think your analysis is right and there cannot be 30 move solutions. Maybe there are 28 moves solutions? I would provide a dump of all solutions I generated and we can do a diff to figure out counter-examples either way.

@vishvananda
Copy link
Author

vishvananda commented Mar 14, 2023

@bostikforever I went ahead and generated the lists and found the discrepencies. It is as I expected, you are adding extra cost for the first square and last square. Your verify method for example should look like this:

    for step in inner_steps:
        path_chips -= data[step]
        if step != start and step != goal:
            path_chips -= add_cost
        add_cost += 1

That said i have an issue in my solution as well. Given the discounting due to double visiting start and goal, there are some 28 move solutions which i was not counting. With 28 move solutions included, there are 43,242,017 solutions. Adding the 28 move solutions (3 backtracks) makes the search significantly slower so it is probably an even bigger win for your solution if you can modify it to handle the cost changes. With 3 backtracks the go version takes over 6 minutes. I can get that down by tracking a minimum remaining cost and truncating the search early when the minimum remaining cost is > chips remaining, but i think a go implementation of your solution would end up being faster in this case. I might try to tackle it at some point.

Here is an example of a solution which you find which is incorrect
eeeeeeeeeeesssnsssssssssns
And here is one i find that you do not:
eewweeeeeeeeeessssesssssss

@bostikforever
Copy link

@vishvananda I'm not sure about the verify method. Look at https://gist.github.com/bostikforever/62b2f551566383f385c4a307482e5425#file-improv-py-L118-L120, I already exclude the first and last squares (hence the name inner_squares) in the verify method. Is there something I am missing? Also I start the add_cost at 0 so that in the first step (i.e going to the first inner square) there is not extra cost. And I assert at the end that we are left with only the cost to cover moving to the final square (but with no additional cost).

I'll go over your counterexamples later.

Interesting bit about the change to handle 28 move solutions. I guess the win there comes from the fact that at an increased depth, you have exponentially more potential paths, but only a small number of incremental paths.

@vishvananda
Copy link
Author

The differences are when you double visit the first or last square. The point is that the first and last squares cost do not increase in value with each move, so in my example eeww, move 4 has which lands on the initial square again has a cost of 0, not 4.

@vishvananda
Copy link
Author

Well, there are 30 move solutions as well. I modified your code to be able to find them and it looks like there are actually 68,285,670 solutions. The modification i made is:

                for neighbor in neighbors(pos):
                    neighbor_chip = target_chip - data[neighbor]
                    if neighbor != start and neighbor != goal:
                        neighbor_chip -= depth
                    neighbor_key = (neighbor_chip, neighbor, depth + 1)

I found another bug in my code which was preventing double visiting the last node. I now get the same number of solutions as the modified code and it takes 102 seconds.

@bostikforever
Copy link

Aha! I totally didn't realise this was a constraint. Looking at your problem description again it still isn't obvious that this should be the case. Still, I'll update my code to reflect this extra constraint when I get the chance.

@bostikforever
Copy link

I just saw your second message that you have updated the logic in my version already to meet the constraints. How did you bring down the running time of the Go version and still handle 30 step paths?

@vishvananda
Copy link
Author

Yes the description from the original problem was very unclear on what should happen in this case so I guess we can just pick one and call it correct. IIRC I tested a solution on the original problem that backtracted to the start square and it always cost zero. I didn't test the final square so there are perhaps 3 interpretations of what happens if you hit the bottom right with more than 5 chips:

  1. it immediately fails (this is what my first implementation did)
  2. it subtracts 5 chips and you can keep going (this is the modified implementation)
  3. If you hit the final square it subtracts 5 + depth chips and you can keep going (this is what your solution did i believe)
    The optimization looks like this:
   func walk(moves uint64, position int, chips int, depth int, backs int, min int) {
    chips -= intdata[position]
    if position == goal_y*width+goal_x {
      if chips == 0 {
        solutions += 1
        return
      }
    } else if position != start_y*width+start_x {
      chips -= int(depth - 1)
    }

    if chips-min < 5 {
      return
    }

    var next_depth int = depth + 1
    moves <<= 2
    var x int = position % 12
    if x < width-1 {
     // moving closer means that the minimum cost decreases by next_depth
      walk(moves+e, position+1, chips, next_depth, backs, min-next_depth)
    }
    if position < height*(width-1) {
      // moving farther means that the minimum cost decreases by next_depth
      walk(moves+s, position+width, chips, next_depth, backs, min-next_depth)
    }
    if backs < 4 {
      if x > 0 {
        // moving farther means that the remaining steps all increase by one. Formula is simplified to 45 - depth
        walk(moves+w, position-1, chips, next_depth, backs+1, min+45-depth)
      }
      if position > width-1 {
        // moving farther means that the remaining steps all increase by one. Formula is simplified to 45 - depth
        walk(moves+n, position-width, chips, next_depth, backs+1, min+45-depth)
      }
    }
  }
  // minimum cost from starting square to one of the two second to last squares is 22 * 21 / 2 = 231
  walk(0, 0, start_chips, 0, 0, 231)

There are some magic numbers in there. 231 is the minimum cost of moving to one of the squares next to the goal. The (45-depth) is the penalty for backtracking which i hand calculated. Note that min cost is slightly too high because it doesn't account for the savings of double or triple visiting the final square, so initially i thought i would have to modify that number a bit, but it gets the right number of solutions.

@bostikforever
Copy link

Now that I think about it I prefer the solutions that don't visit the goal state twice.

@vishvananda
Copy link
Author

I re-implemented your solution in go and it is much faster. It generates the solutions in 0.2 seconds, can count them in 8 seconds and can print them all in about 60 seconds. I think it might even be possible to go faster by aggregating the count and/or the paths as part of the initial solution generation. I'll try that next.

@vishvananda
Copy link
Author

So I can generate the count at the same time so counting the paths takes less than 0.3 seconds now. I experimented with generating the paths on the first pass but it is too inefficient. There is an interesting way to optimize the generation a little bit that might work. Mostly generating the paths uses too much memory when it branches out to all 144 squares. If we do the search in both directions (from start and goal) it might make the list small enough to work

@vishvananda
Copy link
Author

@bostikforever
Copy link

Interesting work. I have not had time to revisit and implement in a compiled language (C++ or D), but it's interesting to see the result.

Some curiosities.

  • I would not expect a large difference in time between solving the dp and counting separately, vs doing both that in a single step.
  • The python solution does not struggle with memory, so it's surprising that this re-implementation does.

For 1. I guess it makes sense since a lot of paths share the same suffix. It didn't occur to me to exploit this in the count function via memoisation (which is essentially what you do by computing the counts upfront). I think if this is done then the timings should be similar.

For 2. I think your solution struggles with memory because it is written such that you have at each level all the full path suffixes from that level to the end. So that your total memory cost at the top level of gen_path is basically the whole 10s of millions of paths which is already multiple gigabytes. On the other hand the python solution searches in a sort of depth first manner, yielding at the bottom of the path, and never having more than a single path in memory (at least not within gen_path). I think that is the way to go, but it might be a bit ugly if you don't have an equivalent of python generators: basically you would have to do the work to convert to string and/or writing the path to file in your base case.

@bostikforever
Copy link

bostikforever commented Mar 16, 2023

I have added a solution that does the count with memoization and the solving + count now takes only 5 seconds. Talk about low hanging optimization fruits! (I also updated both versions to only consider paths that don't visit the goal in the middle. Visiting start twice is still allowed, this brings down my initial count of 45 million to about 43 million).

@vishvananda
Copy link
Author

vishvananda commented Mar 16, 2023

@bostikforever ok I'm deep down the rabbit hole now. I managed to get it to generate the strings for all of the solutions in about 12 seconds by building in both forward and reverse: https://gist.github.com/vishvananda/a39557090df00a90cdbb16a8369bd7ce

EDIT: added some concurrency when constructing the paths at the end and it runs in 8-10 seconds now.

@bostikforever
Copy link

Interesting choice to split the paths into two parts for memoization. That must have helped with the memory too!
To be honest at this point the go is a bit too much for me to follow, so I can't really comment on any obvious improvements.

Concurrency is a whole different kettle of fish, probably needs to be designed into the approach ab initio to take full advantage.

@vishvananda
Copy link
Author

vishvananda commented Mar 18, 2023

@bostikforever ok one more update for you. I converted the go approach back to python to see how it compares. It generates all of the paths in integer form in 42 seconds and with an optimized string conversion it clocks in at about 2 minutes. With pypy, i get the strings in about 60 seconds. https://gist.github.com/vishvananda/686e9e08749e6c6b772ffa60e13bb0ab. Newest go version can generate strings in about 6.5 seconds now.

@bostikforever
Copy link

@vishvananda I have added a few updates. In general, I was not convinced that complicated was the way to go.

I simplified the solution in improve_simplify, to not call to_moves as a separate step, by noting that we already know the moves in the neighbors generator, so we just store that in the state. This leads to an uptick of memory usage from ~300MB to ~450MB, but it brings down the runtime to about 90 seconds (vanilla python 3.11, not tried the changes on pypy). This is great because I can keep the solution conceptually simple, but still efficient (wrt to time and memory IMO).
In improve_simplify_with_lengthen, I investigated the effect of lengthening the state by concatenating states. The conclusion was that I cannot really do more than 2-move fragments before I start needing unreasonable (IMO) amount of memory to store the state. For the 2-move fragment solution, the memory usage x3 to about 1.4GB (from 450MB), while the running time drops to about 45 seconds. Again this is vanilla python.
improve_simplify_no_gen is essentially improve_simplify but without using recursive generators, instead I pass in a callback for the action to perform at the bottom of the recursion. This version is more efficient than the generator variant in run time and takes only about 60 - 70 seconds. The memory profile is similar as with improve_simplify.
improve_simplify_no_gen_with_lengthen again is essentially improve_simplify_with_lengthen but using a callback instead of generators for recursion. In this version we have a similar memory profile as the generator version but use about 30 - 35 seconds of runtime.

It is kind of sad for me that the generator solution has about a 30% - 50% overhead as I kind of like the API that that approach provides better than the callback one.

If you are still up for it, I would be curious to see the impact of efficient string generation on these and/or the effect of running via pypy. I am also curious about how this simple approach would fair faithfully translated to a compiled language. I expect there would be significant gains in string manipulation. Maybe I should bite the bullet and try a C++/D solution.

@vishvananda
Copy link
Author

vishvananda commented Mar 19, 2023

Fascinating. You're right, with plain python, just storing the paths in the tree as strings is much more effective (see walk_fast_strings.py in https://gist.github.com/vishvananda/686e9e08749e6c6b772ffa60e13bb0ab. I can generate all of the solutions in native python in ~25 seconds instead. No savings running the string solution on pypy. Interesting that storing the paths as integers and then converting them at the end is slower in native python. I tried doing a similar approach in go where i just store the paths as byte arrays instead of integers and it completes in more like 10 seconds instead of less than 5 from using integers.

@vishvananda
Copy link
Author

vishvananda commented Mar 20, 2023

I can get the solution generation in python down to ~14 seconds by applying an aggressive cutoff in the search. I don't have a good method for finding an optimal cutoff value, so that is a hand selected cutoff of < 260 chips in the remaining in the fwd search search and > 290 in the reverse. A very safe cutoff of say 100 forward and (start_chips - 100) in the reverse also saves some time but just a few seconds.

@vishvananda
Copy link
Author

I implemented the string version in go and it generates all solutions in about 3.5 seconds (string.go in https://gist.github.com/vishvananda/a39557090df00a90cdbb16a8369bd7ce). Single threaded it takes closer to 8 seconds so python is actually pretty competitive here.
I did some optimizations to fastest go version which stores the path as an integer and it finishes in about 1.8 seconds now. I do have a few assumptions that i'm making to convert the path from an int to a string. It only works if max <= 32 and if the linear distance between source and goal is even and the goal is below and right of the start. I don't imagine it's possible to get much faster than 1.8 seconds. still pretty impressive that we managed to get the python version down to 14 seconds.

@bostikforever
Copy link

Indeed it's impressive that with the change of approach you've gone from 30 seconds to 1.8 seconds on the go solution. I guess the premise of your original post still holds, but the numbers would look very different now.

Sorry I have not contributed further iterations (or tried a version in a compiled language). I don't think I would manage. This is a good place to end the thread!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment