Skip to content

Instantly share code, notes, and snippets.

@briansmith
Created August 6, 2015 22:50
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save briansmith/22342243c8521dee93f8 to your computer and use it in GitHub Desktop.
Save briansmith/22342243c8521dee93f8 to your computer and use it in GitHub Desktop.
Copyright 2015 Brian Smith.
One Way to do Multi-precision Multiplication
============================================
Introduction
------------
This is not the only, or even the best, way to do multiplication. However, this
is the way that the multiplication in GFp is currently done.
When implementing elliptic curve cryptography for n-bit curves, we frequently
have to multiply two n-bit numbers together. The numbers are much too large
to fit into a single word, so we cannot simply use the processor's mul
instruction to do the multiplication. Instead, we have to break the values into
words, do the multiplication word-by-word, and then sum things up. The words of
a multi-precision number are called "limbs." (The GMP project's documentation
explains why the term "limbs" is used.) We store the multi-precision numbers as
arrays of limbs, with the least significant limb first. For example, the P-384
curve on a 32-bit processor requires 12 limbs per value. All the examples in
this document are for the P-256 curve on a 64-bit processor. We denote the
multiplication of two such values as:
wxyz
* dcba
--------
ghijklmn
|w| and |d| are the most significant limbs of the operands, |z| and |a| are the
least significant limbs of the operands, |g| is the most significant limb of
the result, and |n| is the least significant limb of the result. The result has
twice as many limbs as the operands, but it will be reduced modulo |q| or |n|
so that it is the same size as the original operands.
Row-wise Schoolbook Multiplication
----------------------------------
GFp multiplies using a variant of the row-wise schoolbook method. As the name
implies, this is the same method that children usually learn in school:
w x y z
* d c b a
--------------
wa xa ya za
+ wb xb yb zb
+ wc xc yc zc
+ wd xd yd zd
First, the top operand is multiplied by the least significant digit of the
second operand. Then, the top operand is multiplied by the second-least
significant digit of the second operand, etc. All of these products are then
summed together. The diagram makes it look easy. However, the diagram is very
misleading because it makes all the products (|wd|, |yb|, |za|, etc.) look like
they fit into single words. In fact, each of those products is two words, and
that makes things considerably more complicated.
Accumulating a Carry Bit in the Low-order Limb of a Sum
-------------------------------------------------------
Here are the sums of two maximum-value limbs of size 64 bits, 32 bits, and 16
bits, with the high-order and low-order limb of each sum separated with a
space:
0xFFFFFFFFFFFFFFFFFF 0xFFFFFFFF 0xFFFF
+ 0xFFFFFFFFFFFFFFFFFF + 0xFFFFFFFF + 0xFFFF
---------------------- ------------ --------
0x1 FFFFFFFFFFFFFFFFFE 0x1 FFFFFFFE 0x1 FFFE
Notice that the sums are one bit larger than the inputs. This extra bit is the
"carry bit" and dealing with it is source of most of the complication in
implementing multi-precision multiplication efficiently.
Let's define some notation. A sum (a + b) that results in a carry is denoted:
a
+ b
c <----
r
A sum (a + b) that is guaranteed to NOT result in a carry is denoted:
a
+ b
---
r
In both cases, the low-order limb of the sum is stored into |r|.
Notice from the sums of maximum-value limbs above that the low-order limb of
each sum is always at least one less than the maximum value of a limb. This
allows us to add a single carry bit |c| to the low-order limb of a sum |s|
of two limbs without causing a carry:
c
+ s
---
r
In particular, this means that we can always add a carry bit and two full-sized
limbs together, generating at most one carry bit:
c
+ a
+ b
c <----
r
(This explains why add-with-carry instructions like the x86/x64 adc instruction
work when there is only one carry bit to set on output.)
Accumulating a Carry Bit in the High-order Limb of a Product
------------------------------------------------------------
Here are the product values for the multiplication of two maximum-value limbs
of size 64 bits, 32 bits, and 16 bits, with the high-order and low-order limb
of each product separated with a space:
0xFFFFFFFFFFFFFFFFFF 0xFFFFFFFF 0xFFFF
* 0xFFFFFFFFFFFFFFFFFF * 0xFFFFFFFF * 0xFFFF
--------------------------------------- ------------------- -----------
0xFFFFFFFFFFFFFFFFFE 000000000000000001 0xFFFFFFFE 00000001 0xFFFE 0001
This shows that it is possible to add 1 to the value of the high order limb of
the double-precision product of two single-limb operands without causing a
carry. In particular, adding a carry bit to the high-order limb of a
multiplication will not cause a subsequent carry.
The First Row
-------------
Let's have |ab[lo]| and |ab[hi]| denote, respectively, the low-order limb and
high-order limb of the product of single-limb values |a| and |b|.
Now, let's consider the case of a multiplication of a multi-limb value |wxyz|
with with a single-word operand |a|, where |w| is the high-order limb and |z|
is the low-order limb. This is the first step of a full multi-precision
multiplication:
w x y z
* a
----------------------------------------------------
za[lo]
za[hi] ------
+ ya[lo] r0
c <---------
+ ya[hi] r1
+ xa[lo]
c <---------
+ xa[hi] r2
+ wa[lo]
c <---------
+ wa[hi] r3
--------
r4
The diagram is a time series, with time starting at the top and ending at the
bottom. Thus, for example, the multiplication |z*a| that results in |za[hi]|
and |za[lo]| is done first and the addition that results in |r4| is done last.
The values |r4| through |r0| contain the result of the multiplication, with
|r4| being the high-order limb and |r0| being the low-order limb.
The above diagram is oversimplified though, because it does not show how the
multiply instructions are to be scheduled. On x86 and x64, at least, the normal
mul, add, and adc instructions all reset the carry bit of the flags register.
This means the exact ordering of all the operations is important. To account
for this, in the following diagrams, we also denote when each multiply should
be done, and precisely how to save the carry bit before it s destroyed.
The issuance of the multiplication instruction to multiply |w| and |a| is
denoted:
mul w*a
When the variable |wa[hi]| holds the high-order limb of the product of two
single limbs |w| and |a|, incrementing its value by the value of a carry bit
|c| is denoted by:
wa[hi] += c
The addition will not cause a carry for the reason described earlier.
This is the same multiplication of |wxyz| by |a| as in the previous diagram,
with the instruction scheduling made explicit:
w x y z
* a
------------------------------------------------------
mul z*a
za[lo]
------
r0
+------------------------------+
| |
| mul y*a |
| |
| za[hi] |
| + ya[lo] |
| c <----------- |
| r1 |
| ya[hi] += c |
| |
+------------------------------+
+------------------------------+
| |
| mul x*a |
| |
| ya[hi] |
| + xa[lo] |
| c <----------- |
| r2 |
| xa[hi] += c |
| |
+------------------------------+
mul w*a
xa[hi]
+ wa[lo]
c <-----------
+ wa[hi] r3
--------
r4
In the diagram, the steps in the two boxes are the same. The expansion of the
to logic to work with values of a larger number of limbs just requires
replicating those middle steps as necessary--i.e. the steps in the boxes are
the ones that would go into a loop. The steps preceding the repeating steps are
different only because they don't include a carry bit in the addition, while
the section after the repeating steps differs only because it doesn't include a
low-order limb in the final addition.
Subsequent Rows
---------------
After the first row multiplication, subsequent multiplications will have their
product added to an accumulator. (Ignore the fact that some kind of shifting by
some number of place values needs to take place before the addition; it isn't
relevant for now, and in a good implementation of Montgomery multiplication,
the shifting is all implicit.)
r4 r3 r2 r1 r0
+= ( w x y z
* b )
----------------------------------------------------------------
mul z*b
r0
+ zb[lo]
c <---------
r0
zb[hi] += c
mul y*b
yb[lo]
+ zb[hi]
c <-----------
tmp
+---------------------------------+
| |
| yb[hi] += c |
| |
| mul x*b |
| r1 |
| + tmp |
| c <----------- |
| + xb[lo] r1 |
| + yb[hi] |
| c <----------- |
| tmp |
| |
+---------------------------------+
+---------------------------------+
| |
| xb[hi] += c |
| |
| mul w*b |
| r2 |
| + tmp |
| c <----------- |
| + wb[lo] r2 |
| + xb[hi] |
| c <----------- |
| tmp |
| |
+---------------------------------+
wb[hi] += c
r3
+ tmp
c <-----------
+ r4 r3
+ wb[hi]
c <-----------
---- r4
rc
Again, the boxes indicate which steps would appear in a loop. The variable |rc|
accounts for the extra highest-order carry bits that accumulate as the sums are
done.
Further Considerations
----------------------
The above diagrams cannot be translated directly into x86/x64 assembly language
because it uses three-operand arithmetic while x86/x64 assembly language uses
two-operand arithmetic. However, the translation from three-operand arithmetic
to two-operand arithmetic is pretty straightforward.
TODO: OpenSSL and others have gotten this wrong multiple times, by not
accounting for an almost-never-occurring carry. It seems likely that we
currently suffer from the same issue. We should figure out what it is and fix
it.
TODO: It may be possible to optimize such addition chains further by taking
advantage of the fact that certain combinations of the sums of the high-order
limbs of products will not generate multiple carry bits, as mentioned in this
Intel paper:
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment