Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register and use non-default MPI communicators in Atlas objects #159

Merged
merged 43 commits into from
Sep 21, 2023

Conversation

wdeconinck
Copy link
Member

@wdeconinck wdeconinck commented Aug 30, 2023

The problem we want to solve:

Atlas components currently rely on the default communicator registered in eckit and accessed as eckit::mpi::comm().
A Mesh or FunctionSpace which is then generated usually relies on the default communicator as present at its construction time.
When the default communicator is then changed outside of the Atlas context, and e.g. a halo exchange is performed on fields on these meshes or function spaces, then the results are unpredictable and most likely wrong, or the program will just crash or hang.
Alternatively there may be a use case for atlas in different model components with each different MPI communicators. This naturally does not work. A current solution for this would be to context-switch the eckit default communicator to the correct one before each invocation. This is tedious, error-prone, and ugly boiler-plate code.

The solution

The MPI communicators are stored internally in eckit in a map<name,Comm>. Communicators can be inserted and deleted in this map. This mechanism could allow to define model, or mesh/functionspace-specific communicators, which are referable by name.
Without changing Atlas API's we can then pass these communicators as configuration option "mpi_comm", referring to the name registered in eckit. The Mesh or FunctionSpace must then upon construction inspect for this configuration option and keep track of the chosen communicator. If the option is not set the present default communicator is used and kept track of (e.g. "world").

Plan

  • Add atlas::mpi::comm(std::string_view) method to access the Comm
  • Add mechanism to push/pop the default Comm to cleanly change the default communicator in a certain scope. This is useful to port routines incrementally to use a custom communicator instead.
  • Register mpi_comm via configuration in several data structure related components: MeshGenerator, Mesh, FunctionSpace, MatchingPartitioner
  • Register mpi_comm via configuration in parallel communication patterns: HaloExchange, GatherScatter, Checksum, Redistribution
  • Add tests to verify the functionality

Objects that need to be adapted

  • Mesh
  • MeshBuilder
  • StructuredMeshGenerator
  • DelaunayMeshGenerator
  • CubedSphereMeshGenerator
  • CubedSphereDualMeshGenerator
  • HealpixMeshGenerator
  • RegularMeshGenerator
  • PartitionGraph
  • PartitionPolygons
  • Partitioner
  • MatchingPartitioner
  • parallel::HaloExchange
  • parallel::GatherScatter
  • parallel::Checksum
  • functionspace::NodeColumns
  • functionspace::CellColumns
  • functionspace::EdgeColumns
  • functionspace::StructuredColumns
  • functionspace::BlockStructuredColumns
  • functionspace::CubedSphere...
  • functionspace::PointCloud
  • Redistribution
  • output::Gmsh
  • mesh::action::BuildParallelFields
  • mesh::action::BuildPeriodicBoundary
  • mesh::action::BuildEdges
  • mesh::action::BuildHalo

Cannot be done:

  • TransPartitioner (underlying ectrans only works with a single communicator)
  • functionspace::Spectral (underlying ectrans only works with a single communicator)

Follow up PR, possibly

  • ... other mesh actions ...
  • Verify Interpolation, likely working

@wdeconinck wdeconinck self-assigned this Aug 30, 2023
@wdeconinck wdeconinck added the Status: In Progress This is currently being worked on label Aug 30, 2023
@wdeconinck wdeconinck marked this pull request as draft August 30, 2023 15:37
@wdeconinck
Copy link
Member Author

@fmahebert @ytremolet this branch is now up for beta-testing.
Its current state is more or less finished for meshes and functionspaces based on this.
There is yet some work to do (see list in description). It would be good to know if there is anything more urgent needed.

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Private downstream CI failed.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6096409156.

@codecov-commenter
Copy link

codecov-commenter commented Sep 6, 2023

Codecov Report

Patch coverage: 91.38% and project coverage change: +0.51% 🎉

Comparison is base (1633483) 78.93% compared to head (8d5b072) 79.45%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #159      +/-   ##
===========================================
+ Coverage    78.93%   79.45%   +0.51%     
===========================================
  Files          838      840       +2     
  Lines        61316    61979     +663     
===========================================
+ Hits         48399    49244     +845     
+ Misses       12917    12735     -182     
Files Changed Coverage Δ
src/atlas/functionspace/FunctionSpace.h 100.00% <ø> (ø)
...tlas/functionspace/detail/BlockStructuredColumns.h 64.00% <0.00%> (-2.67%) ⬇️
src/atlas/functionspace/detail/FunctionSpaceImpl.h 70.00% <ø> (-5.00%) ⬇️
src/atlas/grid/Partitioner.h 100.00% <ø> (ø)
...c/atlas/grid/detail/partitioner/BandsPartitioner.h 14.28% <0.00%> (+1.78%) ⬆️
.../grid/detail/partitioner/CheckerboardPartitioner.h 100.00% <ø> (ø)
...s/grid/detail/partitioner/CubedSpherePartitioner.h 100.00% <ø> (ø)
.../grid/detail/partitioner/EqualRegionsPartitioner.h 100.00% <ø> (ø)
...ail/partitioner/MatchingFunctionSpacePartitioner.h 100.00% <ø> (+100.00%) ⬆️
.../grid/detail/partitioner/MatchingMeshPartitioner.h 100.00% <ø> (ø)
... and 80 more

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Private downstream CI succeeded.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6096640870.

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Private downstream CI succeeded.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6098677281.

@github-actions
Copy link

Private downstream CI succeeded.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6144750586.

@fmahebert
Copy link
Contributor

I've been testing this branch for JEDI applications, and it's working well. The interface is unobtrusive and the functionality is as expected.

To be more specific, I'm able to construct multiple NodeColumns FunctionSpaces on a split MPI communicator (one NodeColumns per subcommunicator), and our tests are working as expected. I've tested this for meshes imported into atlas via the MeshBuilder or generated via the DelaunayMeshGenerator. On develop, these tests had been failing with segfaults or MPI hangs.

Thanks @wdeconinck !

@github-actions
Copy link

Private downstream CI succeeded.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6233018689.

@github-actions
Copy link

Private downstream CI succeeded.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6233059538.

@github-actions
Copy link

Private downstream CI succeeded.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6249966361.

@github-actions
Copy link

Private downstream CI succeeded.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6250450591.

@wdeconinck wdeconinck marked this pull request as ready for review September 21, 2023 08:09
@wdeconinck wdeconinck merged commit d02f399 into develop Sep 21, 2023
91 checks passed
@github-actions
Copy link

Private downstream CI succeeded.
Workflow name: private downstream ci
View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/6259079104.

@wdeconinck wdeconinck deleted the feature/mpi_comm branch September 22, 2023 09:13
@fmahebert
Copy link
Contributor

Hi @wdeconinck — on the JEDI side we are seeing two side-effects of this PR (for the small number of us who work with develop atlas and not a tagged release):

  • the interface change mesh.nb_partitions -> mesh.nb_parts
  • some runtime errors; have not yet fully investigated.

At the moment we mostly need to investigate more. Perhaps the runtime errors are on our side, or perhaps we've found a defect in this PR? Also, would you consider adding a version bump associated with this PR so that we can more gracefully transition between over from mesh.nb_partitions -> mesh.nb_parts?

This post is mostly as an FYI, I'm not asking for any changes right now. First need to learn more...

@wdeconinck
Copy link
Member Author

wdeconinck commented Sep 25, 2023

Hi @fmahebert, thanks for the feedback. This will be released as part of 0.35.0 I will bump the version, and add a deprecated transition for mesh.nb_partitions.

@wdeconinck
Copy link
Member Author

This is now done. The version 0.35.0 will be tagged tomorrow or day after latest.
If it appears other things needed, we can create a 0.35.1 hotfix.

@fmahebert
Copy link
Contributor

fmahebert commented Oct 4, 2023

Hi @wdeconinck — thanks for making the tag and the deprecated transition. This will ease things on our side.

All the other bugs I've found have been cases of atlas becoming more robust and crashing when we wrote bad code. At the same time, the increased consistency across different FunctionSpaces will be really helpful in designing code. Very nice!

(As an example, we had a case of creating a PointCloud with a Grid on rank 0 and an empty list of PointXY on rank 1+... this worked (apparently?) on atlas 0.34 but with 0.35 atlas is crashing. The fix, clearly, is to use a SerialPartitioner to do the assignment of the grid points onto rank 0. A good mistake to identify and address.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Progress This is currently being worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants