Support 64-bit limbs on no-asm platforms #109

uweigand · 2022-03-04T16:49:34Z

Currently, platforms without assembler support always use 32-bit limbs,
but the Rust bindings always assume 64-bit limbs. This breaks on
big-endian platforms like our IBM Z (s390x).

This patch enables 64-bit limbs on 64-bit platforms even if there is
no hand-written assembler, by using a 128-bit integer type in the
C implementation (this is an extension that is widely supported on
64-bit platforms with GCC or LLVM).

This fixes the broken Rust bindings on IBM Z, and also improves
performance by a factor or 3 or more, because compiler-generated
code handling __int128 already uses the 64x64->128 multiply
instruction our ISA provides.

To improve performance of compiler-generated code a bit more, this
also switches to the -O3 optimization level, which helps with
unrolling of the Montgomery multiply core loop.

mratsim · 2022-03-05T23:08:48Z

For all platforms that don't support native 64x64->128 multiply (does MIPS or Riscv5 for example?) is the GCC/Clang fallback constant-time or does it use ternary or comparisons for carries?

uweigand · 2022-03-07T13:08:33Z

For all platforms that don't support native 64x64->128 multiply (does MIPS or Riscv5 for example?) is the GCC/Clang fallback constant-time or does it use ternary or comparisons for carries?

The 64-bit versions of MIPS and RISC-V support a native 64x64->128 multiply (either as a single instruction or as a pair of instructions).

I can find only one 64-bit target supported by LLVM that doesn't have a native implementation (sparc64), and this uses a call to the __multi3 library routine. I guess it depends on where this library is taken from, but at least the LLVM (compiler-rt) default implementation of it doesn't appear to use conditional code paths:
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/multi3.c

If this is a concern, we could make use of 64-bit limbs conditional on the architecture, with s390x (and possibly others where it makes sense) opting in?

uweigand · 2022-05-04T13:41:03Z

Ping? Any update on this?

emlowe · 2023-09-01T20:39:03Z

I know this PR is over a year old now, but I recently was porting Chia to a VisionFive2 riscv64 SBC (StarFive JH7110 with RISC-V quad-core CPU with 2 MB L2 cache and a monitor core, supporting RV64GC ISA, working up to 1.5 GHz ) and I decided to try this patch to see if there were any speed improvements.

Unfortunately, our wrapper library and test code crashes and dumps core when using this patch. (https://github.com/Chia-Network/bls-signatures)

uweigand · 2023-09-08T15:01:16Z

Unfortunately, our wrapper library and test code crashes and dumps core when using this patch. (https://github.com/Chia-Network/bls-signatures)

I was able to reproduce this on s390x. The problem is this __builtin_assume in src/no_asm.h:

static limb_t quot_rem_n(limb_t *div_rem, const limb_t *divisor,
                                          limb_t quotient, size_t n)
{
    __builtin_assume(n != 0 && n%2 == 0);

which was added by commit 7cc9267 after I originally created and tested this PR.

This assumes that quot_rem_n is always called with an n that is a multiple of two. However, there are only two call sites:

inline limb_t quot_rem_128(limb_t *div_rem, const limb_t *divisor,
                                            limb_t quotient)
{   return quot_rem_n(div_rem, divisor, quotient, NLIMBS(128));   }

inline limb_t quot_rem_64(limb_t *div_rem, const limb_t *divisor,
                                           limb_t quotient)
{   return quot_rem_n(div_rem, divisor, quotient, NLIMBS(64));   }

so n is always equal to NLIMBS(128) or NLIMBS(64). Without this patch, the limb size in this file is always 32, so n is either 4 or 2. But with this patch, the limb size becomes 64, so n is 2 or 1, breaking the assumption.

If I simply remove the && n%2 == 0, the test suite (./build/src/runtest) passes again for me, and the benchmark suite (./build/src/runbench) shows significant improvements, up to a factor of 3 or so.

I think removing this part of the assumption (or even the whole assumption) should be fine on other platforms too, as this function will likely be inlined into its two call sites where n becomes constant anyway. But I haven't verified this.

emlowe · 2023-12-13T20:15:51Z

Took me some time to get back to this.

but YES, applying your suggestion and all tests pass in the VisionFive2 SBC and it is considerably faster at roughly 2.5 times faster for signing and verification - pretty much everything is 2.5 times faster

I would LOVE for this to get merged in - until we get someone to write some actual riscv assembler this is likely as good as it can get.

Currently, platforms without assembler support always use 32-bit limbs, but the Rust bindings always assume 64-bit limbs. This breaks on big-endian platforms like our IBM Z (s390x). This patch enables 64-bit limbs on 64-bit platforms even if there is no hand-written assembler, by using a 128-bit integer type in the C implementation (this is an extension that is widely supported on 64-bit platforms with GCC or LLVM). Note that this means that the argument "n" to quot_rem_n is no longer guaranteed to always be a multiple of 2, so the corresponding assertion needs to be removed as well. This fixes the broken Rust bindings on IBM Z, and also improves performance by a factor or 3 or more, because compiler-generated code handling __int128 already uses the 64x64->128 multiply instruction our ISA provides. To improve performance of compiler-generated code a bit more, this also switches to the -O3 optimization level, which helps with unrolling of the Montgomery multiply core loop.

uweigand · 2023-12-19T12:25:31Z

but YES, applying your suggestion and all tests pass in the VisionFive2 SBC and it is considerably faster at roughly 2.5 times faster for signing and verification - pretty much everything is 2.5 times faster

Thanks for verifying! I've now updated this to fix the incorrect assertion.

hoffmang9 · 2024-02-09T23:59:36Z

Any chance this can get in with the next major release?

uweigand force-pushed the 64bit-noasm branch from 0333c0e to c2407b5 Compare December 19, 2023 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support 64-bit limbs on no-asm platforms #109

Support 64-bit limbs on no-asm platforms #109

uweigand commented Mar 4, 2022

mratsim commented Mar 5, 2022

uweigand commented Mar 7, 2022

uweigand commented May 4, 2022

emlowe commented Sep 1, 2023 •

edited

Loading

uweigand commented Sep 8, 2023

emlowe commented Dec 13, 2023

uweigand commented Dec 19, 2023

hoffmang9 commented Feb 9, 2024

Support 64-bit limbs on no-asm platforms #109

Are you sure you want to change the base?

Support 64-bit limbs on no-asm platforms #109

Conversation

uweigand commented Mar 4, 2022

mratsim commented Mar 5, 2022

uweigand commented Mar 7, 2022

uweigand commented May 4, 2022

emlowe commented Sep 1, 2023 • edited Loading

uweigand commented Sep 8, 2023

emlowe commented Dec 13, 2023

uweigand commented Dec 19, 2023

hoffmang9 commented Feb 9, 2024

emlowe commented Sep 1, 2023 •

edited

Loading