-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Mesh partitioners are not deterministic across different Ensemble
members
#3866
Comments
Provided that we start from a serial or parallel mesh (
|
That's good to know, thanks.
What do you mean by this? |
That means that all ensemble members start from the same representation of the conceptually the same mesh. In your case, all ensemble members start from the same serial (undistributed) mesh, I think, but the mesh you start from does not need to be serial in the proposed approach. |
I still don't think I understand. The issue is that the ensemble members don't start from the same mesh. from firedrake import *
ensemble = Ensemble(COMM_WORLD, M=2)
mesh = UnitSquareMesh(16, 16, comm=ensemble.comm) I think I'd like something like the following: from firedrake import *
ensemble = Ensemble(COMM_WORLD, M=2)
root = (ensemble.ensemble_comm.rank == 0)
if root:
mesh = UnitSquareMesh(16, 16, comm=ensemble.comm)
mesh = ensemble.bcast_mesh(mesh if root else None, root=root) |
Following is a rough sketch of the idea:
|
Ok, I hadn't realised that the meshes were actually made in parallel and then distributed in that way. |
Describe the bug
Standard use of
Ensemble
is to create a topologically identical mesh on eachensemble.comm
. The parallel partition of the mesh on each member should be identical, otherwise the communications overensemble.ensemble_comm
will be mismatched, which will lead to errors/incorrect results/parallel hangs.However, the mesh partitioners do not appear to be guaranteed to be deterministic across different ensemble members in the same run, so different partitions can be created.
This bug was observed in #3385 and "fixed" in the
Ensemble
tests in #3730 by specifying thesimple
partitioner type, which is suboptimal but deterministic.Probably the best real fix would be to add a method to
Ensemble
to allow broadcasting a mesh from oneensemble.comm
to all others. This would require first broadcasting the DMPlex (possibly quite involved), then broadcasting the coordinates (very straightforward).Steps to Reproduce
The partitioners will usually produce identical partitions, so reproducing the issue reliably is difficult.
Expected behavior
Some provided mechanism for ensuring the same partition on different ensemble members.
Error message
Parallel hand.
Environment:
Any?
Additional Info
The text was updated successfully, but these errors were encountered: