jsign/msg.md

## msg.md

      
    Raw
  

              msg.md
            
          
    So, for the verifying part of the proof, parallelization makes a difference.
Verifying proof of 16k openings:
Total duration: 56ms
Duration per step:
      Generate challenge r and powers                       : 32.02ms
      Calculating helper_scalars r^i/(t-z_i)                : 0.72ms
      g_2(t) = SUM y_i*(r^i/(t-z_i))=SUM y_i*helper_scalars : 0.03ms
      Compute E                                             : 21.26ms
      Compute E-D and verify IPA                            : 2.92ms

And without parallelization:
Total duration: 152ms
Duration per step:
      Generate challenge r and powers                       : 31.31ms
      Calculating helper_scalars r^i/(t-z_i)                : 0.86ms
      g_2(t) = SUM y_i*(r^i/(t-z_i))=SUM y_i*helper_scalars : 0.03ms
      Compute E                                             : 112.91ms
      Compute E-D and verify IPA                            : 6.96ms

This makes sense since computing E is an MSM of length 16k, and this is parallelized in gnark-crypto. So, that part becomes the dominant factor. All make sense.
But to prove I see both results as the same (1 core vs 8 cores), which didn’t make sense to me.
Looking at the code now, we’re missing some opportunities for the prover…
The bottleneck for the prover is (if N is the number of openings):

The commitments come in projective form from the VKT layer. But for FS we need to transform them to affine to add them to the transcript. We properly batch all those N inversions, so that’s fine (1 inversion for 16k normalization). But doing all the X*Zinv and Y*ZInv is decent work for 16k points. This could be parallelized and maybe make a difference.
The r, r^2, r^3, … is serial, as mentioned before in the chat. So this isn’t a surprise. (but we can play a bit with the ideas back and see how many openings make a competing cutoff point against serial)
The last bottleneck is computing g  in evaluation form, since it depends on N. Here, we apply a clever trick from Kev to first group each opening by evaluation point. Although we have 16k openings, we can only have 256 eval points. This bounds much work after reg “divide on domain” and related. But… I don’t see that the grouping by evaluation point is parallelized, so we could try doing it.

Conclusion: Looks like our prover is mostly serial today? (despite being pretty efficient with tricks). I’ll see next week to apply the above ideas and see if they make a difference (depending on the number of openings), and measure again both parallel and serial.