Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Cross mesh interpolation breaks for parallel nested meshes #3151

Closed
pbrubeck opened this issue Oct 11, 2023 · 12 comments · Fixed by #3293
Closed

BUG: Cross mesh interpolation breaks for parallel nested meshes #3151

pbrubeck opened this issue Oct 11, 2023 · 12 comments · Fixed by #3293
Assignees
Labels

Comments

@pbrubeck
Copy link
Contributor

Describe the bug
Cross mesh interpolation fails for certain combinations of number of cells on the two meshes and number of MPI ranks.
We get the following error message:

RuntimeError: Some points have been claimed by a cell that we cannot see, but which we think we own. This should not happen.

Steps to Reproduce
Steps to reproduce the behavior:

  1. Code/test/example that was run '...' (as minimal as possible)
from firedrake import *
from firedrake.petsc import PETSc
s = PETSc.COMM_WORLD.getSize()
nx = 2 * s
mx = 3 * nx
mh = [UnitCubeMesh(nx, nx, nx),
            UnitCubeMesh(mx, mx, mx)]

family = "Lagrange"
degree = 1
Vc = FunctionSpace(mh[0], family, degree=degree)
Vf = FunctionSpace(mh[1], family, degree=degree)

uc = Function(Vc)
uf = interpolate(uc, Vf)
print("success", flush=True)
  1. Command executed '...'
    The code breaks if we run with more than 2 MPI ranks
mpiexec -n 3 python bug.py
mpiexec -n 4 python bug.py

Expected behavior
Describe what you expected to happen, in the case of a regression,
include details of a setup where this behaviour was last seen.

Error message
Add error message with full backtrace, or log.
Please add these as text using three backticks (`) for highlighting.
Please do not add screenshots unless the bug is purely graphical.

Environment:

  • OS: [eg: Linux, MacOS, WSL (Windows Subsystem for Linux)] add this as a label too!
  • Python version: [eg: 3.9.7]
  • Output of firedrake-status
  • Any relevant environment variables or modifications [eg: PYOP2_DEBUG=1]

Additional Info
Add any other context about the problem here.

@pbrubeck pbrubeck added the bug label Oct 11, 2023
@ReubenHill
Copy link
Contributor

I think there's a bug somewhere in cell location - I'm seeing nans returned for the reference cell locations which trigger the error. This will need some debugging in places like locate.c

@ReubenHill
Copy link
Contributor

This may also be related to #3089

@ReubenHill
Copy link
Contributor

This is at least in part due to insufficient robustness in the vertex-only mesh voting algorithm.

2 things need to be done.

Where multiple ranks claim the same $L^1$ distance (in one case here, $0.0$), but disagree about rank, as happens in rare occurrences where a point is exactly on a mesh cell vertex at a parallel boundary which is visible to 3 or more ranks, the cell location algorithm should be triggered again.
The cell that was initially identified should be disregarded, and it should be told to produce the next best cell: when we find a cell with identical $L^1$ distance and rank as given by the voting algorithm, the search stops.
Whilst the parent mesh cell may not be identical, the point will certainly be given the correct parent mesh PyOP2 label.
If no such cell is found, we can safely treat the point as being not-locally visible.
This will ensure halo points are assigned correctly.

In these rare cases, the voting algorithm can also claim that a rank which has identified a point as self-owned will be told by a rank of higher number that a rank of lower number in fact owns the point: if that lower rank does not think it owns it then the above procedure would cause the point to be lost.
At present, when this occurs, an error is triggered.
To fix this, if there is a tie after minimising $L^1$ distance, a point which a rank has identified as self-owned should be chosen before choosing the lowest numbered rank.

@ReubenHill
Copy link
Contributor

@pbrubeck I think #3293 fixes this. I haven't checked the performance implications but I think the algorithm is sound.

@pbrubeck
Copy link
Contributor Author

pbrubeck commented Dec 15, 2023

@pbrubeck I think #3293 fixes this. I haven't checked the performance implications but I think the algorithm is sound.

Great. We could simply compare the runtime of the relevant VOM tests?

@connorjward
Copy link
Contributor

connorjward commented Oct 11, 2024

I have been thinking about this issue and I think that there is a simpler solution.

The problem

Consider the following example:

global mesh:
    0    |    1
o------A |-------o
|      | |       |
o------o |-------o
         |

rank 0 sees:
         |
o------A | - - - o
|      | |       |
o------o | - - - o
   owned | ghost

rank 1 sees:
         |
o- - - A |-------o
|      | |       |
o - - -o |-------o
  ghost  |  owned

Vertex A is equally distant from both cells and so both ranks will allocate A to either cell effectively randomly.

  1. If both ranks agree on the point ownership then everything is good.
  2. If both ranks claim to own the point then everything is good. The voting algorithm will pick a winner and the loser will drop/reassign the point.
  3. However, if both ranks think that the other owns the point then things break. The voting algorithm would choose rank 0 as the winning rank but rank 0 does not think it owns the point, leading to the error Some points have been claimed by a cell that we cannot see, but which we think we own..

Solution?

I think that the solution to this is to preferentially assign points to cells that are owned over ghost cells.
This will effectively mean we always have condition (2) instead of condition (3).

@dham
Copy link
Member

dham commented Oct 11, 2024

Yes. How did we not think of this years ago.

@ReubenHill
Copy link
Contributor

ReubenHill commented Oct 11, 2024 via email

@ReubenHill
Copy link
Contributor

ReubenHill commented Oct 11, 2024

Ah I see I missed the context of your comment. Indeed this issue was fixed but the branch was never merged because of a minor test failure that I didn't have time to get to the bottom of. I think your solution should also work, but it's possible that the voting algorithm in master is broken such that (2.) wont actually work at the moment (I'm afraid I no longer recall all the details!). My suggestion is to get #3293 merged first and then implement the preference for owned over ghost cells after that.

@ReubenHill
Copy link
Contributor

ReubenHill commented Oct 11, 2024

For reference, the voting algorithm as implemented in #3293 is described in section 7.5 of my thesis
Nixon-Hill-RW-2024-PhD-Thesis.pdf.pdf

@connorjward
Copy link
Contributor

Good to hear from you Reuben!

Ah I see I missed the context of your comment. Indeed this issue was fixed but the branch was never merged because of a minor test failure that I didn't have time to get to the bottom of. I think your solution should also work, but it's possible that the voting algorithm in master is broken such that (2.) wont actually work at the moment (I'm afraid I no longer recall all the details!). My suggestion is to get #3293 merged first and then implement the preference for owned over ghost cells after that.

AIUI the current voting algorithm works correctly in master and only runs into issues under condition (3).

Before merging #3293 I would like to try my approach out first. I think it would constitute a much smaller change to the code. Effectively we need to modify locate.c so that it does:

best_cell = None
best_cell_is_owned = False
closest_dist = np.inf

for cell in candidate_cells:
  l1_dist = get_l1_dist(cell)

  if is_owned(cell):
    if l1_dist <= 0:
      # found cell, stop here
      best_cell = cell
      break
    elif l1_dist < tolerance and l1_dist < closest_dist:
      # found a candidate
      best_cell = cell
      best_cell_is_owned = True
      closest_dist = l1_dist
  else:  # ghost cell
    if best_cell_is_owned:
      # do not consider ghost cells if an owned cell already claims to own the point
      continue
    elif l1_dist < tolerance and l1_dist < closest_dist:
      best_cell = cell
      closest_dist = l1_dist

return best_cell

I believe that this is sufficient to resolve the issue.

Checking is_owned is easy because we number cells first owned then ghost.

@connorjward
Copy link
Contributor

connorjward commented Oct 18, 2024

I have done this in #3804. It seems to fix the issue for me, though it would be good to get other people to try it out.

Bother. Needs more work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants