Skip to content

Instantly share code, notes, and snippets.

@Wurstnase
Created December 2, 2015 19:16
Show Gist options
  • Save Wurstnase/a428edd40768428b0ad1 to your computer and use it in GitHub Desktop.
Save Wurstnase/a428edd40768428b0ad1 to your computer and use it in GitHub Desktop.
Wilco 3cycle/bit square root
Source:http://www.verycomputer.com/24_e95a6e361498c566_1.htm
In the 1/sqrt() thread, I made the remark that a square root does take
only 3 cycles per bit. OK, here is the algorithm:
Goal: Calculate root = INT (SQRT (N)), for N = 0 .. 2^32 -1
Idea: Successive approximation of the equation (root + delta) ^ 2 = N until
delta < 1. If delta < 1 we have the integer part of SQRT (N).
Use delta = 2^i for i = 15 .. 0.
Init: Start with root = 0 and delta = 2^15.
Step: If (root + delta) ^ 2 <= N then
root += delta
endif
delta = delta / 2
This step is repeated until delta = 0. Now root = INT (SQRT (N)).
Idea: (root + delta) ^ 2 = root^2 + 2 * root * delta + delta^2 =
root^2 + delta * (2 * root + delta).
Now take M = N - root^2, and update M in each step of the algorithm:
Init: M = N, delta = 2^15
Step: if delta * (2 * root + delta) <= M then
M -= delta * (2 * root + delta)
root += delta
endif
delta = delta / 2
We need to calculate delta * (2 * root + delta) fast, without any
multiplications. This is possible because delta is a power of 2.
In ARM code we might optimize a bit by taking root' = root *2, and
using loop unrolling for delta = 2^i.
MOV M, N
[ unroll for i = 15 .. 0
ADD t, root, #1 << i ; t = 2 * root + delta
CMP M, t, LSL #i ; if t * delta <= M then
SUBHS M, M, t, LSL #i ; M -= t * delta
ADDHS root, root, #2 << i ; root += delta
]
MOV root, root, LSR #1
OK, now we have a 4-cycle per bit routine. For 3-cycle per bit we have to
leave one instruction out... The idea is to calculate 2 * root + delta
successively (so we don't need the ADD t, root, #1 << i instruction).
In binary the value 2 * root + delta looks like this:
xxxx01 ; the xxxx bits are the value of root, the zero is from 2 * root,
and the 1 is the delta bit.
Now look at successive values of 2 * root + delta:
xxxx010000
xxxxa01000 ; a is the new bit of root (from root += delta)
xxxxab0100 ; b new bit of root
xxxxabc010
It can be seen that the new bit of the new approximation of root is put on
the zero bit of the previous approximation, while the delta bit shifts one
position. We want to calculate this new approximation in only ONE instruction!
If you forget about the delta bit, and shift root to the right, you would get:
xxxx
xxxxa
xxxxab
This is an ADC root, root, root instruction. Now, we use a rotate right,
instead of a shift right:
010000xxxx
01000xxxxa
0100xxxxab
This can be done with an ADC root, offset, root, LSL #1 instruction.
Take offset = 3^30:
010000xxxx ; original root
10000xxxx0 ; root LSL #1
10000xxxxa ; root LSL #1 + carry bit
01000xxxxa ; ADC root, offset, root, LSL #1
Et voila! Now we have the following ARM routine:
; IN : n 32 bit unsigned integer
; OUT: root = INT (SQRT (n))
; TMP: offset
MOV offset, #3 << 30
MOV root, #1 << 30
[ unroll for i = 0 .. 15
CMP n, root, ROR #2 * i
SUBHS n, n, root, ROR #2 * i
ADC root, offset, root, LSL #1
]
BIC root, root, #3 << 30 ; for rounding add: CMP n, root ADC root, #1
This routine takes 3 * 16 + 3 = 51 cycles for a root of a 32 bit integer.
Of course, there are many other possibilities. The routine can be used with
some extra code which uses early termination if n is less than 32 bit. Also
fixed point versions are possible, or roots of 64 bit numbers. Even looped
versions are no problem: Shift in 8 or 16 bits at a time,
so you need only to unroll 4 or 8 times (overhead 6 cycles).
Wilco
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment