I made a few optimizations to the cython code. The timings on my machine are:
from sequences import P53_HUMAN, P53_MOUSE
from alignment import align as py_align
from align_numpy import align as cy_align
from align_numpy2 import align as cy_align2
%timeit py_align(P53_HUMAN, P53_MOUSE)
1 loops, best of 3: 442 ms per loop
%timeit cy_align(P53_HUMAN, P53_MOUSE)
10 loops, best of 3: 178 ms per loop
# With optimizations in align_numpy2
%timeit cy_align2(P53_HUMAN, P53_MOUSE)
10 loops, best of 3: 22.6 ms per loop
I've validated that the results are the same as the original code. Most of the speedup was ditching the dict representation of traceback for a numpy array of objects.