For simplicity, this will just go over the unsigned case
Written for someone, such as myself, who isn't as quick to put together the same logical steps
Given $y \in \mathbb{Z}$ where $y > 0$
Find $a,sh \in \mathbb{Z}$ such that $\forall x \in \mathbb{Z} \ (\ \lfloor \frac{x}{y} \rfloor = \lfloor x \cdot a / 2^{sh} \rfloor)$
To motivate what we'll be doing, lets start by considering:
$$
\begin{align*}
\frac{x \cdot a}{2^{sh}} &= \frac{x}{y} \\
\frac{a}{2^{sh}} &= \frac{1}{y} \\
a &= \frac{2^{sh}}{y} \\
\end{align*}
$$
What we want to show is that $a = \lfloor \frac{2^{sh}}{y} \rfloor$ can still be used to compute $\lfloor \frac{x}{y} \rfloor$ exactly
So, let $sh = w + \lceil \log_2y \rceil$ where w is our 'effective wordsize' in bits
From an implementation standpoint, we'd like to be able to write:
Recip(uint64_t y){
uint64_t a, sh;
// compute a, sh from y
return (a, sh);
}
which would seem to imply $w = 64$, however we musn't be so hasty.
Given that $\lceil \log_2y \rceil \in \mathbb{Z}$ we have:
$$
\begin{alignedat}{3}
\log_2y \ &\leq & \ \lceil \log_2y \rceil & &&< \ \log_2y + 1 \\
2^{\log_2y} \ &\leq & \ 2^{\lceil \log_2y \rceil} & &&< \ 2^{\log_2y + 1} \\
2^{\log_2y} \ &\leq & \ 2^{\lceil \log_2y \rceil} & &&< \ 2^{\log_2y} \cdot 2 \\
y \ &\leq & \ 2^{\lceil \log_2y \rceil} & &&< \ 2y \\
y \ &\leq & \ 2^{\lceil \log_2y \rceil} & &&\leq \ 2y-1 \quad \text{since } 2y,\ 2^{\lceil \log_2y \rceil} \in \mathbb{Z} \\
2^w \cdot y \ &\leq & \ 2^w \cdot 2^{\lceil \log_2y \rceil} & &&\leq \ 2^w \cdot (2y-1) \\
2^w \cdot y \ &\leq & \ 2^{w + \lceil \log_2y \rceil} & &&\leq \ 2^{w+1} \cdot y - 2^{w} \\
2^w \cdot y \ &\leq & \ 2^{sh} & &&\leq \ 2^{w+1} \cdot y - 2^{w} \\
2^w \ &\leq & \ \frac{2^{sh}}{y} & &&\leq \ 2^{w+1} - \frac{2^w}{y} \\
2^w \ &\leq & \ a & &&\leq \ 2^{w+1} - 1 \quad \text{when } y \leq 2^w \\
\end{alignedat}
$$
So what do we get from the bounds $2^w \leq a \leq \ 2^{w+1} - 1$?
Well, if we know $a$ fits in $w+1$ bits, and since we want to store $a$ in a uint64_t
$\Rightarrow w+1 = 64 \Rightarrow w = 63$
Note that above we restricted $y \leq 2^w \equiv y \leq 2^{63}$ which is no longer true for all values uint64_t
If that's a deal breaker, then you'll have to implement the extended precision case for $a$ as an exercise for the reader ;)
Okay, so we know how to find $a,sh$, but how do we know that they compute $\lfloor \frac{x}{y} \rfloor$ exactly?
Well, we're going to have to be a bit more precise with what those $\lfloor ... \rfloor$ and $\lceil ... \rceil$ are doing
$$
\begin{alignedat}{3}
\text{let } q &= \lfloor \frac{x}{y} \rfloor &&\Rightarrow \exists \ r_q \in \mathbb{Z} & \ x &&= q \cdot y + r_q & \quad \text{where } 0 \leq r_q < y \\
\text{let } a &= \lceil \frac{2^{sh}}{y} \rceil &&\Rightarrow \exists \ r_a \in \mathbb{Z} & \quad 2^{sh} &&= a \cdot y - r_a & \quad \text{where } 0 \leq r_a < y \\
\end{alignedat}
$$
Note, $\lfloor ... \rfloor$ makes $q$ smaller, so $r_q$ is added on to compensate
while $\lceil ... \rceil$ makes $a$ larger, so $r_a$ is taken away to compensate
Now, if we can show that $r_q$ and $r_a$ make no difference when calculating $q = \lfloor \frac{x}{y} \rfloor = \lfloor x \cdot a / 2^{sh} \rfloor$ then we're done :)
But first, we'll get the easy case out of the way. When $x=0$ it should be clear that $q = \lfloor \frac{0}{y} \rfloor = \lfloor 0 \cdot a / 2^{sh} \rfloor = 0$
So for the rest of the proof can make use of the case where $x \geq 1$
$$
\begin{alignedat}{3}
\text{let } Q &= &&\lfloor x \cdot a \ /\ 2^{sh} \rfloor \\
&= &&\lfloor (q \cdot y + r_q) \cdot a \ /\ 2^{sh} \rfloor \\
&= &&\lfloor (q \cdot a y + a\ r_q) \ /\ 2^{sh} \rfloor \\
&= &&\lfloor (q \cdot (2^{sh} + r_a) + a\ r_q) \ /\ 2^{sh} \rfloor \\
&= &&\lfloor (q \cdot 2^{sh} + q \cdot r_a + a\ r_q) \ /\ 2^{sh} \rfloor \\
&= \ q \ + \ &&\lfloor (q\ r_a + a\ r_q) \ /\ 2^{sh} \rfloor \\
\end{alignedat}
$$
To guarantee that $Q = q$ we need to show that $0 \leq q\ r_a + a\ r_q < 2^{sh}$
It should come to no surprise that this combination of non-negative values is non-negative, so all that's left is the upper bound
$$
\begin{alignedat}{3}
& r_a &&< y \\
& r_a &&\leq y-1 \\
& q \ r_a &&\leq q\ y - q \\
& q \ r_a &&\leq (x - r_q) - q \\
& q \ r_a &&\leq (x - q) - r_q \\
\end{alignedat}
$$
$$
\text{let } 1 \leq x \leq 2^w \ \text{along with } y \leq 2^w \ \land \ q = \lfloor\frac{x}{y}\rfloor \\
\Rightarrow \ q>0 \ \Rightarrow \ (x-q) \ <\ x\ \leq \ 2^w
$$
$$
\begin{alignedat}{3}
& \ q \ r_a &&\leq (x - q) - r_q \\
& \ q \ r_a &&< (2^w) - r_q \\
& \ q \ r_a \ + \ a \ r_q &&< 2^w \ + \ a \ r_q - r_q \\
& \ q \ r_a \ + \ a \ r_q &&< 2^w \ + \ (a-1) \ r_q \\
& \ q \ r_a \ + \ a \ r_q &&< 2^w \ + \ (a-1) \ (y-1) \\
& \ q \ r_a \ + \ a \ r_q &&< 2^w \ + \ ay - a - y + 1 \\
& \ q \ r_a \ + \ a \ r_q &&< 2^w \ + \ (2^{sh}+r_a) - a - y + 1 \\
& \ q \ r_a \ + \ a \ r_q &&< 2^w \ + \ 2^{sh}+(y-1) - a - y + 1 \\
& \ q \ r_a \ + \ a \ r_q &&< 2^w \ + \ 2^{sh} - a \\
& \ q \ r_a \ + \ a \ r_q &&< 2^{sh} \ + \ 2^w - a \\
& \ q \ r_a \ + \ a \ r_q &&< 2^{sh} \quad \text{since } 2^w \leq a \quad \text{might have to scroll up a bit ;)} \\
\end{alignedat}
$$
All right, that's enough math, time to get a bit gung-ho with writing the code for it
Recip(uint64_t y){
assert(y <= (1ull << 63)); // Ensure y <= 2^63
uint64_t sh = 0;
while(y > (1ull << sh)){ sh++; } // sh' = ceil(log2(y))
sh += 63; // sh = ceil(log2(y)) + w
uint64_t a = ((1ull << sh) + y-1) / y // a = ceil(2^sh / y) = floor((2^sh + y-1) / y)
return (a, sh);
}
FastDivExamp(uint64_t[] x, uint64_t y){
uint64_t a, sh;
a, sh = Recip(y);
for(i, 0, x.count){
//x[i] = x[i] / y; // q = floor(x/y)
x[i] = (x[i]*a) >> sh; // Q = (x*a) / 2^sh
}
}
Ahhh, such an elegant mapping from theory to implementation.
Unfortunately, it fails the smell test for anyone paying attention :P
If we tried plugging in y = 8
, we'd get sh = 3 + 63 = 66
So, pray tell, what do we expect to get from either (1ull << sh)
or (x*a)
?
Instead, we'll explicitly handle the pseudo uint128_t
operations using a 64:64 tuple
This has the benefit where if we have (v1:v0) then returning v1 is equivalent to (v1:v0) >> 64 which we can bake into sh
So the only hurdle left is calculating ((1<<sh) + y-1) / y
We have a convenient mapping from (1<<sh) + (y-1)
to (v1:v0)
v0 = y-1
when $y-1 < 2^{64}$
v1 = 1 << (sh-64)
when $2^{64} \leq 2^{sh} < 2^{128}$
The condition for v1
is only true when $64 \leq sh \ \Leftrightarrow \ 64 \leq 63 + \lceil \log_2y \rceil \ \Rightarrow \ 1 \leq \lceil \log_2y \rceil \ \Rightarrow \ 2 \leq y$
Now, for the case when $y=1$, I propose another algorithm. $\forall x \in \mathbb{Z} \ (\ \lfloor \frac{x}{1} \rfloor = x)$
Finally, we can take our (v1,v0) tuple and pass it to the x64 DIV
While this can raise an exception if the result won't fit within a uint64_t
, we've already proven that $a \leq \ 2^{w+1} - 1$
The same idea can be used with x64 MUL and only using the high 64 bits to compute the aforementioned (v1:v0) >> 64
Recip(uint64_t y){
assert(1 < y && y <= (1ull << 63)); // Ensure 1 < y <= 2^63
uint64_t sh = 0;
while(y > (1ull << sh)){ sh++; } // sh' = ceil(log2(y))
sh += 63 - 64; // sh = ceil(log2(y)) + w - 64
uint64_t a = Div128(1ull<<sh, y-1, y); // a = ceil(2^sh / y) = floor((2^sh + y-1) / y)
return (a, sh);
}
FastDivExamp(uint64_t[] x, uint64_t y){
uint64_t a, sh;
a, sh = Recip(y);
for(i, 0, x.count){
//x[i] = x[i] / y; // q = floor(x/y)
x[i] = MulHi64(x[i], a) >> sh; // Q = (x*a) / 2^sh
}
}