-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register and use non-default MPI communicators in Atlas objects #159
Conversation
@fmahebert @ytremolet this branch is now up for beta-testing. |
Private downstream CI failed. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #159 +/- ##
===========================================
+ Coverage 78.93% 79.45% +0.51%
===========================================
Files 838 840 +2
Lines 61316 61979 +663
===========================================
+ Hits 48399 49244 +845
+ Misses 12917 12735 -182
☔ View full report in Codecov by Sentry. |
Private downstream CI succeeded. |
3b02d02
to
441954b
Compare
Private downstream CI succeeded. |
441954b
to
f70414b
Compare
Private downstream CI succeeded. |
f70414b
to
37ec089
Compare
I've been testing this branch for JEDI applications, and it's working well. The interface is unobtrusive and the functionality is as expected. To be more specific, I'm able to construct multiple NodeColumns FunctionSpaces on a split MPI communicator (one NodeColumns per subcommunicator), and our tests are working as expected. I've tested this for meshes imported into atlas via the MeshBuilder or generated via the DelaunayMeshGenerator. On develop, these tests had been failing with segfaults or MPI hangs. Thanks @wdeconinck ! |
Private downstream CI succeeded. |
Private downstream CI succeeded. |
…Boundaries, BuildHalo, BuildEdges
4194ac9
to
8d5b072
Compare
Private downstream CI succeeded. |
Private downstream CI succeeded. |
Private downstream CI succeeded. |
Hi @wdeconinck — on the JEDI side we are seeing two side-effects of this PR (for the small number of us who work with develop atlas and not a tagged release):
At the moment we mostly need to investigate more. Perhaps the runtime errors are on our side, or perhaps we've found a defect in this PR? Also, would you consider adding a version bump associated with this PR so that we can more gracefully transition between over from This post is mostly as an FYI, I'm not asking for any changes right now. First need to learn more... |
Hi @fmahebert, thanks for the feedback. This will be released as part of 0.35.0 I will bump the version, and add a deprecated transition for mesh.nb_partitions. |
This is now done. The version 0.35.0 will be tagged tomorrow or day after latest. |
Hi @wdeconinck — thanks for making the tag and the deprecated transition. This will ease things on our side. All the other bugs I've found have been cases of atlas becoming more robust and crashing when we wrote bad code. At the same time, the increased consistency across different FunctionSpaces will be really helpful in designing code. Very nice! (As an example, we had a case of creating a PointCloud with a Grid on rank 0 and an empty list of PointXY on rank 1+... this worked (apparently?) on atlas 0.34 but with 0.35 atlas is crashing. The fix, clearly, is to use a SerialPartitioner to do the assignment of the grid points onto rank 0. A good mistake to identify and address.) |
The problem we want to solve:
Atlas components currently rely on the default communicator registered in eckit and accessed as
eckit::mpi::comm()
.A Mesh or FunctionSpace which is then generated usually relies on the default communicator as present at its construction time.
When the default communicator is then changed outside of the Atlas context, and e.g. a halo exchange is performed on fields on these meshes or function spaces, then the results are unpredictable and most likely wrong, or the program will just crash or hang.
Alternatively there may be a use case for atlas in different model components with each different MPI communicators. This naturally does not work. A current solution for this would be to context-switch the eckit default communicator to the correct one before each invocation. This is tedious, error-prone, and ugly boiler-plate code.
The solution
The MPI communicators are stored internally in eckit in a
map<name,Comm>
. Communicators can be inserted and deleted in this map. This mechanism could allow to define model, or mesh/functionspace-specific communicators, which are referable by name.Without changing Atlas API's we can then pass these communicators as configuration option "mpi_comm", referring to the name registered in eckit. The Mesh or FunctionSpace must then upon construction inspect for this configuration option and keep track of the chosen communicator. If the option is not set the present default communicator is used and kept track of (e.g. "world").
Plan
atlas::mpi::comm(std::string_view)
method to access the CommObjects that need to be adapted
Cannot be done:
Follow up PR, possibly