vtnerd/benchmark.md

## benchmark.md

      
    Raw
  

              benchmark.md
            
          
    Benchmark for Various Wallet Crypto Modes

Research into performance optimizations for wallet scanning. Primarily useful for organizations handling multiple wallets.
Variations


Standard - "whatever monero-wallet-cli is currently doing".
3-bit pre-computed table - pre-compute a small table for the ECDH step (tx pub shared secret). Only useful if scanning multiple wallets at a time (i.e. mymonero case).
spend public de-compression - de-compress the spend public key from (y) to (x,y,z,t) exactly once. Currently this is done for every output scanned
curve25519 shared secret - Use curve25519 for the ECDH step (tx pub shared secret) instead of the ed25519 curve. This is faster in some cases because there are many montgomery ladder implementations for this curve.

Optimizations


ref10 - default implementation currently in use by monero-wallet-cli.
amd64-51-30k - ed25519 implementation from supercop, currently proposed in a PR for wallet scanning (i.e. compatible with current protocol).
amd64-64-24k - ed25519 implementation from supercop, currently proposed in a PR for wallet scanning (i.e. compatible with current protocol)
amd64-51-sandy2x - curve25519 implementation from supercop + amd64-51-30k (see above) where appropriate. Not compatible with current protocol.
amd64-64-sandy2x - curve25519 implementation from supercop + amd64-64-24k (see above) where appropriate. Not compatible with current protocol.

Running

The source code is on github in a branch. Clone my repo, switch to this branch, create a build directory (anywhere) and then do cmake /PATH_TO_SOURCE/ -DCMAKE_BUILD_TYPE=Release && make wallet-crypto-bench. This should automatically added amd64 specializations if targetting that architecture. If you have a newish processor, adding -DWALLET_CRYPTO=auto -DWALLET_CRYPTO_BENCH="amd64-51-sandy2x;amd64-64-sandy2x" will add optimizations requiring instructions added by the sandy bridge line of processors.
Observations

ECDH Step

The sandy2x curve25519 EDCH is 30% faster than the amd64-51-30k ed25519 monero ECDH ("monero" means multiplying by the cofactor AFTER the ECDH whereas curve25519 uses scalar clamping with several bits of security lost). A small 3-bit table reduces the time by 15%, and if users were willing to trade more memory when scanning many wallets its likely that it would beat the sandy2x implementation (there is a reason why ed25519 exists separately). But for standard wallets, using a sandy2x curve25519 will likely remain faster.
Tx Scanning

The ECDH step is only done once per transaction. So once a Tx has an average of 3 outputs, the de-compression optimization is faster than the curve25519 protocol variant. Since they can be combined, the entire speedup will be quite large. The de-compression optimization is more likely to make it into mainline since it does not require a protocol change.
Benchmark Results

@vtnerd (Lee Clagett)

OSX/2014 Mac Mini - i5-3210M (3MB cache, 2.5Ghz/3.1GHz)

bash-3.2$ ./src/wallet/crypto/wallet-crypto-bench 1000 2,3,4
Running benchmark using 1000 iterations
Transaction Component Benchmarks
--------------------------------
+ Output Pub Key 21319184 ns (+0%)
|-+ spend public de-compression 21319184 ns (+0%)
| |----> amd64-64-24k => 21319184 ns (+0%)
| |----> amd64-51-30k => 21726363 ns (+1.90992%)
|
|-+ standard 27133825 ns (+27.2742%)
| |----> amd64-64-24k => 27133825 ns (+0%)
| |----> amd64-51-30k => 27171394 ns (+0.138458%)
| |-----------> ref10 => 78909394 ns (+190.816%)
|
+ TX Pub Key 51482527 ns (+141.485%)
|-+ curve25519 shared secret 51482527 ns (+0%)
| |> amd64-51-sandy2x => 51482527 ns (+0%)
| |> amd64-64-sandy2x => 51534492 ns (+0.100937%)
|
|-+ 3-bit table precomp 59956636 ns (+16.4602%)
| |----> amd64-51-30k => 59956636 ns (+0%)
| |----> amd64-64-24k => 63450467 ns (+5.82726%)
|
|-+ standard 66972198 ns (+30.0872%)
| |----> amd64-51-30k =>  66972198 ns (+0%)
| |----> amd64-64-24k =>  70968002 ns (+5.96636%)
| |-----------> ref10 => 213573377 ns (+218.899%)
|
Transaction Benchmarks
----------------------
+ Txes with 2 outputs 106635041 ns (+0%)
|-+ curve25519 tx shared secret 106635041 ns (+0%)
| |> amd64-64-sandy2x => 106635041 ns (+0%)
| |> amd64-51-sandy2x => 106902531 ns (+0.250846%)
|
|-+ spend public de-compression 110553385 ns (+3.67454%)
| |----> amd64-51-30k => 110553385 ns (+0%)
| |----> amd64-64-24k => 114183139 ns (+3.28326%)
|
|-+ standard scanning 121447979 ns (+13.8912%)
| |----> amd64-51-30k => 121447979 ns (+0%)
| |----> amd64-64-24k => 126257847 ns (+3.96043%)
| |-----------> ref10 => 371441590 ns (+205.844%)
|
+ Txes with 3 outputs 132387825 ns (+24.1504%)
|-+ spend public de-compression 132387825 ns (+0%)
| |----> amd64-51-30k => 132387825 ns (+0%)
| |----> amd64-64-24k => 135484028 ns (+2.33874%)
|
|-+ curve25519 tx shared secret 133791021 ns (+1.05991%)
| |> amd64-64-sandy2x => 133791021 ns (+0%)
| |> amd64-51-sandy2x => 134106593 ns (+0.235869%)
|
|-+ standard scanning 148677118 ns (+12.3042%)
| |----> amd64-51-30k => 148677118 ns (+0%)
| |----> amd64-64-24k => 159819183 ns (+7.49414%)
| |-----------> ref10 => 450211899 ns (+202.812%)
|
+ Txes with 4 outputs 154223497 ns (+44.6274%)
|-+ spend public de-compression 154223497 ns (+0%)
| |----> amd64-51-30k => 154223497 ns (+0%)
| |----> amd64-64-24k => 156956657 ns (+1.77221%)
|
|-+ curve25519 tx shared secret 161023821 ns (+4.4094%)
| |> amd64-64-sandy2x => 161023821 ns (+0%)
| |> amd64-51-sandy2x => 161395605 ns (+0.230888%)
|
|-+ standard scanning 176036947 ns (+14.1441%)
| |----> amd64-51-30k => 176036947 ns (+0%)
| |----> amd64-64-24k => 180762432 ns (+2.68437%)
| |-----------> ref10 => 528785112 ns (+200.383%)
|