Finding the barrier toward better parallelism #11

weikengchen · 2020-11-28T08:35:16Z

More updates will be added to this note.

I measured the cost for Groth16 to prove a constraint system with two million constraints, with a different number of cores, using cargo bench in this repo with appropriate command line parameters.

Below I focus on the main cost in my constraint system, which turns out to be in the witness map and computation of C.

Going from 1 core to 4 cores, the improvement is significant. But later when more cores are added, the time for the witness map seems to not change a lot.

Since the witness map involves a lot of FFT, it may suggest that the current implementation of FFT has a barrier toward many-many-core parallelism.

Such a barrier, maybe avoidable, maybe unavoidable. I will take a look at the detailed breakdown of the cost of the witness map.

===== Groth16 =====

1 core
R1CS to QAP witness map ... 32.102s
Compute C ................. 56.453s
Total ..................... 93.780s

4 cores
R1CS to QAP witness map ... 12.476s
Compute C ................. 16.54s
Total ..................... 34.203s

20 cores
R1CS to QAP witness map ... 8.856s
Compute C ................. 8.631s
Total ..................... 23.99s

40 cores
R1CS to QAP witness map ... 8.300s
Compute C ................. 4.414s
Total ..................... 18.319s

48 cores
R1CS to QAP witness map ... 8.166s
Compute C ................. 4.682s
Total ..................... 18.230s

The text was updated successfully, but these errors were encountered:

weikengchen · 2020-11-29T06:16:41Z

new numbers after the FFT change. Seems to be still similar @ValarDragon

Note: the result is one iteration, so there could be statistical fluctuation.

===== Groth16 =====

1 core
R1CS to QAP witness map ... 31.945s
Compute C ................. 55.35s
Total ..................... 92.371s

4 cores
R1CS to QAP witness map ... 12.143s
Compute C ................. 16.867s
Total ..................... 34.371s

20 cores
R1CS to QAP witness map ... 7.408s
Compute C ................. 4.489s
Total ..................... 17.283s

40 cores
R1CS to QAP witness map ... 7.754s
Compute C ................. 4.484s
Total ..................... 17.738s

48 cores
R1CS to QAP witness map ... ~8.40s~ (rerun: 7.807s)
Compute C ................. ~4.715s~ (rerun: 4.421s)
Total ..................... ~18.349s~ (rerun: 17.680s)

ValarDragon · 2020-11-29T16:20:50Z

Oh that PR didn't actually change the FFT algorithm, just documented the existing one and identified the problem

I'll let you know once I update the algorithm

Pratyush · 2021-01-21T20:05:28Z

what was the exact command that you ran, @weikengchen ?

Pratyush mentioned this issue Nov 28, 2020

Add documentation for parallel FFT algorithm arkworks-rs/algebra#103

Merged

6 tasks

Pratyush added D-medium Difficulty: medium T-design Type: discuss API design and/or research T-performance Type: performance improvements labels Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding the barrier toward better parallelism #11

Finding the barrier toward better parallelism #11

weikengchen commented Nov 28, 2020

weikengchen commented Nov 29, 2020 •

edited

Loading

ValarDragon commented Nov 29, 2020 •

edited

Loading

Pratyush commented Jan 21, 2021

Finding the barrier toward better parallelism #11

Finding the barrier toward better parallelism #11

Comments

weikengchen commented Nov 28, 2020

weikengchen commented Nov 29, 2020 • edited Loading

ValarDragon commented Nov 29, 2020 • edited Loading

Pratyush commented Jan 21, 2021

weikengchen commented Nov 29, 2020 •

edited

Loading

ValarDragon commented Nov 29, 2020 •

edited

Loading