Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2AffinesAdd crashes when executing in alphine docker image #142

Open
maxkaneco opened this issue Dec 2, 2022 · 12 comments
Open

P2AffinesAdd crashes when executing in alphine docker image #142

maxkaneco opened this issue Dec 2, 2022 · 12 comments

Comments

@maxkaneco
Copy link

maxkaneco commented Dec 2, 2022

I am getting crash on Cgo execution of P2AffinesAdd. Wonder if you guys have any advice?
Signature and verification are fine so I guess it only affects certain APIs.

Docker base: golang:1.18-alpine

Crash stacktrace:

fatal: morestack on g0
SIGTRAP: trace trap
PC=0x48c2e2 m=8 sigcode=128
signal arrived during cgo execution

goroutine 49 [syscall]:
runtime.cgocall(0xf6cda0, 0xc000de3480)
        /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc000de3458 sp=0xc000de3420 pc=0x42481c
github.com/photon-storage/blst/bindings/go._Cfunc_blst_p2s_add(0xc000196a20, 0xc003690000, 0x400)
        _cgo_gotypes.go:1115 +0x45 fp=0xc000de3480 sp=0xc000de3458 pc=0x9bafa5
github.com/photon-storage/blst/bindings/go.P2AffinesAdd.func2(0xc000300400?, 0xc000de3508, 0x9baeec?, 0x1543d90)
        /go/pkg/mod/github.com/photon-storage/blst@v1.0.1/bindings/go/blst.go:2599 +0xf3 fp=0xc000de34c8 sp=0xc000de3480 pc=0x9c6833
github.com/photon-storage/blst/bindings/go.P2AffinesAdd({0xc003690000, 0x400, 0x400}, {0x0?, 0x4384d3?, 0x4000?})
        /go/pkg/mod/github.com/photon-storage/blst@v1.0.1/bindings/go/blst.go:2599 +0x65 fp=0xc000de3508 sp=0xc000de34c8 pc=0x9c66e5
github.com/photon-storage/go-photon/crypto/por.(*Verifier).VerifyBlocks(0xc0012e2770, {0xc0034c0000, 0x400, 0xf5394a60a035baa9?},
@dot-asm
Copy link
Collaborator

dot-asm commented Dec 2, 2022

Is it 100% reproducible or does it fail occasionally? Since you mention it explicitly, is it actually alpine-specific, or is it so you simply happen to use it? Since you mention it explicitly, were previous go versions working, or is 1.18 simply the first go version you compile blst with?

It's not impossible to imagine that the problem is due to an uncertainty introduced by _cgoCheckPointer := func(...interface{}) {} in the function in question. Rationale behind it was that since a) we know that the pointers point to pure data and b) we know that the C side doesn't modify pointers, the pointer check is safe to suppress. The check was causing problems but they were deemed to be false positives. But this assessment was done at specific point in time, and might be not entirely correct by now. Hence the questions.

Is caller's code publicly available? I've found go-photon on github, but there is no VerifyBlocks there...

@maxkaneco
Copy link
Author

It is 100% reproducible and always fails. I have tried the same docker image in both EC2 host and my macbook, they both fail. I have also tried docker image based on "golang:1.18.8-bullseye", it is fine.

The crashing docker image is based on "golang:1.18-alpine" with "apk --no-cache --update add gcc musl-dev build-base bash git"

@maxkaneco
Copy link
Author

Our repo is not ready to open to public yet as it is still very rough and WIP. I paste the function below. Let me know if you need any further information.

// VerifyBlocks used to batch verify POR proofs of multiple blocks
func (v *Verifier) VerifyBlocks(sigs, blks [][]byte, indexes []uint32) error {
    blkSize := len(blks)
    if len(sigs) != blkSize || len(indexes) != blkSize {
        return ErrBlockSizeMismatch
    }

    mus := make([]*big.Int, v.spb)
    for j := range mus {
        mus[j] = big.NewInt(0)
    }
    for i := range blks {
        for j := uint32(0); j < v.spb; j++ {
            s := BlockSector(blks[i], j)
            mus[j] = new(big.Int).Add(mus[j], s)
        }
    }

    sigmas := make([]*blst.P2Affine, blkSize)
    for i, sig := range sigs {
        sigmas[i] = new(blst.P2Affine).Uncompress(sig)
    }
    sgm := blst.P2AffinesAdd(sigmas)

    var aggs []*blst.P2Affine
    for _, index := range indexes {
        aggs = append(aggs, BlockTag(v.oid, index).ToAffine())
    }
    for i, u := range v.us {
        aggs = append(aggs, blst.P2Mult(u, ReduceToScalar(mus[i])))
    }

    p1, err := v.pk.GetRaw()
    if err != nil {
        return err
    }

    a := blst.Fp12MillerLoop(sgm.ToAffine(), blst.P1Generator().ToAffine())
    b := blst.Fp12MillerLoop(blst.P2AffinesAdd(aggs).ToAffine(), p1)
    if !blst.Fp12FinalVerify(a, b) {
        return ErrVerificationFailure
    }
    return nil
}

@dot-asm
Copy link
Collaborator

dot-asm commented Dec 3, 2022

I have also tried docker image based on "golang:1.18.8-bullseye", it is fine.

So it sounds like a musl vs glibc thing... Not that it totally explains the failure on alpine, but at least there is some hint... However, I see no reason for why you wouldn't be able to restructure your code to not rely on an array of pointers. And it would be more efficient too. So instead of sigmas := make([]*blst.P2Affine, blkSize) do sigmas := make([]blst.P1Affine, blkSize), in which case you would call sgm := blst.P2Affines(sigmas).Add(). As for the second Miller loop call. Its input appears to be a sum of points and result of a multi-scalar multiplication. You would be better off using the corresponding method for the latter. Note that there are Add and Mult for both P2Affines and P2s. Latter perform efficient conversions to affine so you don't have to do it one by one... As for scalars. How wide are they? Either way, the best option is to lay them down as flat array of bytes. And there also will be a more efficient multi-Miller-loop interface...

@mratsim
Copy link
Contributor

mratsim commented Jan 10, 2023

How many points are you adding?

When going over 1024 G2 points or 2048 G1 points I have a crash as well (mratsim/nim-blscurve#1 (comment))

It appears here: https://github.com/supranational/blst/blob/6382d67/src/bulk_addition.c#L148-L149

The C-API seems to have a double pointer/array indirection, could it be that the code assumes the points are in batches of 1024/2048?

@dot-asm
Copy link
Collaborator

dot-asm commented Jan 10, 2023

I'm not convinced that this is the same problem. The [original] problem is believed to have everything to do with how Go pointers are handled on the Cgo interface. Or more specifically Go array of Go pointers. And suggestion was to restructure the application code to avoid it.

As for the 2nd question. No, C-API doesn't expect specific batch sizes. But it does claim temporary storage on the stack to accommodate the internal batching, [up to] 288KB. In Go it's not a problem, because C subroutine is executed on own stack, one with "normal" limit, where 288KB are readily available. But can you be hitting your stack ceiling, @mratsim?

@dot-asm
Copy link
Collaborator

dot-asm commented Jan 10, 2023

Just in case for reference, what's up with internal batching. The algorithm in question requires temporary storage, by where to get is from? blst doesn't do heap allocations, so we take it from stack. But it's not safe to assume that arbitrary amount of stack is available, so we iterate in smaller steps. On the other hand we want to amortize inversion costs across larger amount of points. So where to draw the line? Somewhere within 1-2-3% performance penalty...

@dot-asm
Copy link
Collaborator

dot-asm commented Jan 10, 2023

Somewhere within 1-2-3% performance penalty...

Though I think that the evaluation was done with slower inversion subroutine... In other words it might turn out that reducing the internal batch size could be acceptable...

@mratsim
Copy link
Contributor

mratsim commented Jan 10, 2023

Long reply, a bit off-topic regarding OP problem then.

As for the 2nd question. No, C-API doesn't expect specific batch sizes. But it does claim temporary storage on the stack to accommodate the internal batching, 288KB. In Go it's not a problem, because C subroutine is executed on own stack, one with "normal" limit, where 288KB are readily available. But can you be hitting your stack ceiling, @mratsim?

It's unlikely, I'm not using MSVC which limits stack to 1MB, my ulimit -a shows 8MB
image

Just in case for reference, what's up with internal batching. The algorithm in question requires temporary storage, by where to get is from? blst doesn't do heap allocations, so we take it from stack. But it's not safe to assume that arbitrary amount of stack is available, so we iterate in smaller steps. On the other hand we want to amortize inversion costs across larger amount of points. So where to draw the line? Somewhere within 1-2-3% performance penalty...

I don't think anyone has issue with internal batching, in particular quoting myself in my own implementation: https://github.com/mratsim/constantine/blob/53a5729/constantine/math/elliptic/ec_shortweierstrass_batch_ops.nim#L269-L286

We chunk the addition to limit memory usage
especially as we allocate on the stack.

From experience in high-performance computing,
here are the constraints we want to optimize for

  1. MSVC limits stack to 1MB by default, we want to use a fraction of that.
  2. We want to use a large fraction of L2 cache, but not more.
  3. We want to use a large fraction of the memory addressable by the TLB.
  4. We optimize for hyperthreading with 2 sibling threads (Xeon Phi hyperthreads have 4 siblings).
    Meaning we want to use less than half the L2 cache so that if run on siblings threads (same physical core),
    the chunks don't evict each other.

Hardware:

  • a Raspberry Pi 4 (2019, Cortex A72) has 1MB L2 cache size
  • Intel Ice Lake (2019, Core 11XXX) and AMD Zen 2 (2019, Ryzen 3XXX) have 512kB L2 cache size

After one chunk is processed we are well within all 64-bit CPU L2 cache bounds
as we halve after each chunk.

The part that is confusing to me is the function signature / double pointer indirection, why

void blst_p2s_add(blst_p2 *ret, const blst_p2_affine *const points[], size_t npoints);

over

void blst_p2s_add(blst_p2 *ret, const blst_p2_affine const points[], size_t npoints);

@dot-asm
Copy link
Collaborator

dot-asm commented Jan 10, 2023

The part that is confusing to me is the function signature / double pointer indirection, why

For flexibility. Note that it traverses the array of pointers till NULL, after which it treats the last pointer as an array of points. So that if you have an array of points you can simply blst_p2_affine *temp[2] = { array, NULL} and pass temp. But if the subroutine accepted array of points and you had an array of pointers, then you'd have to allocate an array and copy points...

@dot-asm
Copy link
Collaborator

dot-asm commented Jan 10, 2023

I don't think anyone has issue with internal batching

The remark was more for the OP :-) As already said "just in case" :-)

@dot-asm
Copy link
Collaborator

dot-asm commented Jan 10, 2023

my ulimit -a shows 8MB

Yes, but the main thread can start threads with smaller stacks... I'm not saying that it does, but it's the only thing I can think of give the circumstances. Can you collect the failing address and compare it to the current stack pointer? Just in case, I can pass more points to the subroutines in question...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants