Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix swarm memory management unit test failure #1214

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

brryan
Copy link
Collaborator

@brryan brryan commented Nov 25, 2024

PR Summary

Right now multiple branches are failing in the CI with

38/51 Test #38: Swarm memory management ....................................................***Failed    0.01 sec
Filters: Swarm memory management
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[9b3082cbec2a:02162] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

This should be fixed.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@pgrete
Copy link
Collaborator

pgrete commented Nov 26, 2024

The CI machine is currently heavily loaded (I already contacted the user in question), so I'm wondering if this is an actual bug or sth weird popping up with a machine under heavy load.

Nevermind. This happened also for the MacOS runner, which is not shared...

@BenWibking
Copy link
Collaborator

BenWibking commented Dec 2, 2024

So, err... does adding printf cause the error to go away? I tried re-triggering the CI actions, and I still don't see a problem...

Update: I triggered the macOS CI again...let's see what happens.

@brryan
Copy link
Collaborator Author

brryan commented Dec 2, 2024

So, err... does adding printf cause the error to go away? I tried re-triggering the CI actions, and I still don't see a problem...

Update: I triggered the macOS CI again...let's see what happens.

So it just passed the macOS CI... so maybe this is an intermittent failure? Because it happens to this test on multiple CI workflows, even if only occasionally, I'll assume this is not fixed and I'll keep looking at this. Maybe it's some issue with how we're initializing the hierarchy of objects needed to create swarms...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants