Recall that in KKW all the players essentially execute the same operation in parallel, e.g. in the case of addition, they all add their shares.
The new observation is that this is obviously also the case if you consider the operation of all the players across all the repetitions.
Fix the number of players n = 8 and the ring as R = GF(2) (bits).
Then I can represent the shares of 8 players in 8 instances of KKW as a single 8 x 8 = 64-bit register:
S_x = | s_{1,1}, s_{1,2}, ..., s_{1,8} | s_{2,1} ... | ... | s_{8,1}, ..., s_{8,8} |
If I need to have the players add the wire "x" with the wire "y" across all 8 repetitions, I can do this by simply xoring the 64-bit register for wire "x" and "y". This will individually add/xor the share of each player in each repetition.
Reconstruction can also be bit-sliced very efficiently Let S_x be the 64-bit integer:
Using a logical shift:
t = (S_x >> 4) ^ S_x)
t = (t >> 2) ^ t
t = (t >> 1) ^ t
Using a rotation:
t = (S_x ROL 4) ^ S_x)
t = (t ROL 2) ^ t
t = (t ROL 1) ^ t
Yields the reconstructed value in the least significant bit of each byte and repeated 8 times in each byte, respectively.
Beaver multiplication of shared values [x] and [y], given a Beaver triple [a], [b], [c] (with a * b = c) requires that the players compute:
[d] = [x] - [a]
[e] = [y] - [b]
[z] = [c] + [x] * e + [y] * d - e * d
The first two are just addition (done using 64-bit XOR), done as above. The last requires:
- reconstructing d and e.
- computing e * d can be done using a single 64-bit AND.
- computing [x] * e and [y] * d can be done as follows: Recall that using the reconstruction (with ROL) of S_x we get a register t holding a 64-bit integer where the reconstructed bit is repeated in every byte. Hence [x] * e can be computed using a single bit-wise AND: S_x & t
[d] = [x] - [a] : 1 ins. (64-bit XOR)
[e] = [y] - [b] : 2 ins. (64-bit XOR)
Recons e : 6 ins. (using ROL ensuring the result is repeated across all bits)
Recons d : 6 ins. (using ROL ensuring the result is repeated across all bits)
[x] * e : 1 ins. (using AND)
[y] * d - : 1 ins. (using AND)
e * d : 2 ins. (using AND and masking AND 0x010101...01 to only take one copy of the result)
[z] = [c] + [x] * e + [y] * d - e * d : 3 ins. (XOR)
Meaning a total of ~ 22 AMD64 cycles per multiplication across 8 repetions, ignoring cryptographic ops, cache misses etc.
Since we are in running 8 repetions in parallel, we are in effect "doing 8 times fewer repetions" by packing them in batches of 8. Since we are doing ~250 repetions atm. this gets down to ~32 a significant speedup.
Ignoring the cost of invoking the PRG and looking up wires, using the technique above would mean we can prove:
- Mult gate: using ~ 700 cycles on AMD64.
- Addition gate: using ~ 32 cycles on AMD64