Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LibCEED failures on CI node #857

Open
pvelesko opened this issue May 22, 2024 · 1 comment
Open

LibCEED failures on CI node #857

pvelesko opened this issue May 22, 2024 · 1 comment
Milestone

Comments

@pvelesko
Copy link
Collaborator

pvelesko commented May 22, 2024

Some tests fail correctness when using /gpu/hip/ref, these tests do not fail when using /gpu/hip/shared or /gpu/hip/gen. On cupcake , none of these tests fail at all.

The following tests fail:

Test Summary Report
-------------------
t352-basis          (Wstat: 0 Tests: 1 Failed: 1)
  Failed test:  1
t506-operator       (Wstat: 0 Tests: 1 Failed: 1)
  Failed test:  1

√ Updated the runtime to match cupcake
√ Fails for both OpenCL and Level Zero
√ Passes on PoCL

@pvelesko pvelesko added this to the Release 1.2 milestone May 29, 2024
@pvelesko
Copy link
Collaborator Author

These failures seem to be related to the runtime. Furthermore, these tests do not fail on ALCF systems which use engineering SDK drops.

  • If the problem size is reduced, I don't see these tests failing anymore.
  • Using shared memory also makes the error go away
  • PoCL also passes

It's very difficult to extract a reproducer from these so I think we can revisit during 1.3 as these don't seem to be an issue in our implementation.

@pvelesko pvelesko modified the milestones: Release 1.2, Release 1.3 Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant