Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding the barrier toward better parallelism #11

Open
weikengchen opened this issue Nov 28, 2020 · 3 comments
Open

Finding the barrier toward better parallelism #11

weikengchen opened this issue Nov 28, 2020 · 3 comments
Labels
D-medium Difficulty: medium T-design Type: discuss API design and/or research T-performance Type: performance improvements

Comments

@weikengchen
Copy link
Member

More updates will be added to this note.

I measured the cost for Groth16 to prove a constraint system with two million constraints, with a different number of cores, using cargo bench in this repo with appropriate command line parameters.

Below I focus on the main cost in my constraint system, which turns out to be in the witness map and computation of C.

Going from 1 core to 4 cores, the improvement is significant. But later when more cores are added, the time for the witness map seems to not change a lot.

Since the witness map involves a lot of FFT, it may suggest that the current implementation of FFT has a barrier toward many-many-core parallelism.

Such a barrier, maybe avoidable, maybe unavoidable. I will take a look at the detailed breakdown of the cost of the witness map.

===== Groth16 =====

1 core
R1CS to QAP witness map ... 32.102s
Compute C ................. 56.453s
Total ..................... 93.780s

4 cores
R1CS to QAP witness map ... 12.476s
Compute C ................. 16.54s
Total ..................... 34.203s

20 cores
R1CS to QAP witness map ... 8.856s
Compute C ................. 8.631s
Total ..................... 23.99s

40 cores
R1CS to QAP witness map ... 8.300s
Compute C ................. 4.414s
Total ..................... 18.319s

48 cores
R1CS to QAP witness map ... 8.166s
Compute C ................. 4.682s
Total ..................... 18.230s
@weikengchen
Copy link
Member Author

weikengchen commented Nov 29, 2020

new numbers after the FFT change. Seems to be still similar @ValarDragon

Note: the result is one iteration, so there could be statistical fluctuation.

===== Groth16 =====

1 core
R1CS to QAP witness map ... 31.945s
Compute C ................. 55.35s
Total ..................... 92.371s

4 cores
R1CS to QAP witness map ... 12.143s
Compute C ................. 16.867s
Total ..................... 34.371s

20 cores
R1CS to QAP witness map ... 7.408s
Compute C ................. 4.489s
Total ..................... 17.283s

40 cores
R1CS to QAP witness map ... 7.754s
Compute C ................. 4.484s
Total ..................... 17.738s

48 cores
R1CS to QAP witness map ... ~8.40s~ (rerun: 7.807s)
Compute C ................. ~4.715s~ (rerun: 4.421s)
Total ..................... ~18.349s~ (rerun: 17.680s)

@ValarDragon
Copy link
Member

ValarDragon commented Nov 29, 2020

Oh that PR didn't actually change the FFT algorithm, just documented the existing one and identified the problem

I'll let you know once I update the algorithm

@Pratyush
Copy link
Member

what was the exact command that you ran, @weikengchen ?

@Pratyush Pratyush added D-medium Difficulty: medium T-design Type: discuss API design and/or research T-performance Type: performance improvements labels Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D-medium Difficulty: medium T-design Type: discuss API design and/or research T-performance Type: performance improvements
Projects
None yet
Development

No branches or pull requests

3 participants