BUG: Mesh partitioners are not deterministic across different `Ensemble` members #3866

JHopeCollins · 2024-11-18T09:55:37Z

Describe the bug
Standard use of Ensemble is to create a topologically identical mesh on each ensemble.comm. The parallel partition of the mesh on each member should be identical, otherwise the communications over ensemble.ensemble_comm will be mismatched, which will lead to errors/incorrect results/parallel hangs.
However, the mesh partitioners do not appear to be guaranteed to be deterministic across different ensemble members in the same run, so different partitions can be created.

This bug was observed in #3385 and "fixed" in the Ensemble tests in #3730 by specifying the simple partitioner type, which is suboptimal but deterministic.

Probably the best real fix would be to add a method to Ensemble to allow broadcasting a mesh from one ensemble.comm to all others. This would require first broadcasting the DMPlex (possibly quite involved), then broadcasting the coordinates (very straightforward).

Steps to Reproduce
The partitioners will usually produce identical partitions, so reproducing the issue reliably is difficult.

Expected behavior
Some provided mechanism for ensuring the same partition on different ensemble members.

Error message
Parallel hand.

Environment:
Any?

Additional Info

The text was updated successfully, but these errors were encountered:

ksagiyam · 2024-11-18T11:45:24Z

Provided that we start from a serial or parallel mesh (plexA) that all ensemble members agree with, we could probably take the following steps:

One ensemble member call DMPlexDistribute() on plexA and get distribution_sf as an output,
Broadcast distribution_sf across the ensemble members,
The ensemble members who received distribution_sf call DMPlexMigrate() (https://petsc.org/release/manualpages/DMPlex/DMPlexMigrate/) with plexA using distribution_sf.

JHopeCollins · 2024-11-18T12:38:03Z

That's good to know, thanks.

a serial or parallel mesh (plexA) that all ensemble members agree with

What do you mean by this?

ksagiyam · 2024-11-18T12:51:47Z

That means that all ensemble members start from the same representation of the conceptually the same mesh. In your case, all ensemble members start from the same serial (undistributed) mesh, I think, but the mesh you start from does not need to be serial in the proposed approach.

JHopeCollins · 2024-11-18T13:39:08Z

I still don't think I understand. The issue is that the ensemble members don't start from the same mesh.
A fix for this would need to guarantee that the ensemble members in the following all have the same mesh partition (not guaranteed at the moment):

from firedrake import *
ensemble = Ensemble(COMM_WORLD, M=2)
mesh = UnitSquareMesh(16, 16, comm=ensemble.comm)

I think I'd like something like the following:

from firedrake import *
ensemble = Ensemble(COMM_WORLD, M=2)
root = (ensemble.ensemble_comm.rank == 0)

if root:
    mesh = UnitSquareMesh(16, 16, comm=ensemble.comm)

mesh = ensemble.bcast_mesh(mesh if root else None, root=root)

ksagiyam · 2024-11-18T13:59:47Z

mesh = UnitSquareMesh(16, 16, comm=ensemble.comm) is serial in the very beginning, but it gets distributed using plex.distribute() in the mesh constructor. Here, all ensemble members agree on the serial mesh.

Following is a rough sketch of the idea:

...
if root:
    mesh = UnitSquareMesh(...)
    # Grab distribution_sf somehow
else:
    mesh = UnitSquareMesh(..., distribution_parameters=do_not_distribute)

# Broadcast distribution_sf

if not root:
    plex = mesh.topology_dm
    new_plex.migrate(distribution_sf)  # migrate is actually not wrapped in petsc4py
    mesh = Mesh(new_plex)

JHopeCollins · 2024-11-18T14:45:37Z

Ok, I hadn't realised that the meshes were actually made in parallel and then distributed in that way.
I'll have look at broadcasting the SF and how it could be wrapped up away from the user.

JHopeCollins added the bug label Nov 18, 2024

JHopeCollins self-assigned this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Mesh partitioners are not deterministic across different `Ensemble` members #3866

BUG: Mesh partitioners are not deterministic across different `Ensemble` members #3866

JHopeCollins commented Nov 18, 2024 •

edited

Loading

ksagiyam commented Nov 18, 2024

JHopeCollins commented Nov 18, 2024

ksagiyam commented Nov 18, 2024

JHopeCollins commented Nov 18, 2024

ksagiyam commented Nov 18, 2024

JHopeCollins commented Nov 18, 2024

BUG: Mesh partitioners are not deterministic across different Ensemble members #3866

BUG: Mesh partitioners are not deterministic across different Ensemble members #3866

Comments

JHopeCollins commented Nov 18, 2024 • edited Loading

ksagiyam commented Nov 18, 2024

JHopeCollins commented Nov 18, 2024

ksagiyam commented Nov 18, 2024

JHopeCollins commented Nov 18, 2024

ksagiyam commented Nov 18, 2024

JHopeCollins commented Nov 18, 2024

BUG: Mesh partitioners are not deterministic across different `Ensemble` members #3866

BUG: Mesh partitioners are not deterministic across different `Ensemble` members #3866

JHopeCollins commented Nov 18, 2024 •

edited

Loading