I made a few optimizations to the cython code. The timings on my machine are:
from sequences import P53_HUMAN, P53_MOUSE
from alignment import align as py_align
from align_numpy import align as cy_align
from align_numpy2 import align as cy_align2
%timeit py_align(P53_HUMAN, P53_MOUSE)
1 loops, best of 3: 442 ms per loop