Checkpoint question #3053

stephankramer · 2023-08-01T17:29:12Z

stephankramer
Aug 1, 2023
Collaborator

If I have two firedrake scripts, say A and B, that both read in the same mesh and each write out a checkpoint file with some function result. Then I have a third script in which I want to read in both checkpoints and combine them in some further Firedrake computation, for which I need both functions to be based on the same domain. Should I:
1.

with CheckpointFile('A.h5', 'r') as f:
   meshA = f.load_mesh('...')
   funA = f.load_function(meshA, 'A')
with CheckpointFile('B.h5', 'r') as f:
   meshB = f.load_mesh('...')
   funB_on_meshB = f.load_function(meshB, 'B')
V = FunctionSpace(meshA, "...same element-spec as funB...")
funB = Function(V)
funB.dat.data[:] = funB_on_meshB.dat.data[:]

or simply 2.

with CheckpointFile('A.h5', 'r') as f:
   meshA = f.load_mesh('...')
   funA = f.load_function(meshA, 'A')
with CheckpointFile('B.h5', 'r') as f:
   funB = f.load_function(meshA, 'B')

Is 2. allowed/safe? It doesn't complain. But is it guaranteed to work in parallel, or say the two scripts are run on a different system?

Context is Thetis, where we write out elevations and velocities in separate checkpoint files - so these are at least written in the same run - but also often have other checkpoints created in some separate pre-processing script with bathymetry, viscosity, etc. See thetisproject/thetis#336

ksagiyam · 2023-08-01T23:15:17Z

ksagiyam
Aug 1, 2023
Maintainer

I think the cleanest solution is to somehow save all functions in a single file.

1 and 2 are both based on the same assumption that meshA and meshB have identical data, so I think they are as safe/unsafe as each other. (2 works as CheckpointFile does not distinguish two meshes with identical data.)

I think Mesh construction is mostly deterministic, but I am not sure about partitioners. They may give different partitions depending on the version, or they could be stochastic due to the round-off error, in which case I think 1 and 2 could both fail. So, ideally, once we save the mesh in fileA, we want to load that mesh from fileA (using the same number of processes due to current restriction) every time we work with the mesh, and start from that loaded mesh -> define functions -> save functions.

0 replies

stephankramer · 2023-08-02T11:11:44Z

stephankramer
Aug 2, 2023
Collaborator Author

Thanks @ksagiyam Yeah, unfortunately that doesn't really work well with a workflow that's quite common with Thetis users.

Just to make sure I understand things correctly. Would the following also be unsafe:

I run the following in sequence:

run A:

mesh = Mesh('gmsh.msh')
V = FunctionSpace(V, "CG", 1)
u = Function(V)
# now put some values in u
with CheckPointfile('A.h5', 'w') as f:
      f.save_mesh(mesh)
      f.save_function(u, 'u')

Then run B:

with CheckPointfile('A.h5', 'w') as f:
     mesh = f.load_mesh('firedrake_default')
     u = f.load_function(mesh, 'u')
V = FunctionSpace(mesh, "CG", 1)
v = Function(V)
v.assign(u)
with CheckPointfile('B.h5', 'w') as f:
      f.save_function(v, 'v')

And finally run C:

with CheckPointfile('A.h5', w') as f:
     mesh = f.load_mesh('firedrake_default')
     u = f.load_function(mesh, 'u')
with CheckPointfile('B.h5', w') as f:
     v = f.load_function(mesh, 'v')
V = FunctionSpace(mesh, "CG", 1)
w = Function(V)
w.assign(u+v)

Would this also be unsafe? What about if I use reorder=False in load_mesh in B and C? Does that not guarantee I end up with the same mesh (same ordering and decomposition). This is assuming I run all on the same number of processes, so that presumably A would have come up with a decent decomposition.

In the Thetis context A would be a preprocessing script, that preprocesses some fields that don't change over time like bathymetry, a viscosity field etc. Then B would be a cold start Thetis run which produces a series of checkpoints, from which the model can be restarted. The decision from which of these checkpoints to restart is often made afterwards - i.e. it's not necessarily just about restarting from the last checkpoint, but might be rerunning some time window with slightly modified parameters for instance. The reason the preprocessing is often separated is because: 1) it may take considerable time and the result is reused for a number of runs B 2) may involve vary large input files that are interpolated from and combined into a single interpolated field after some smoothing, solving some distance to coast Eikonal equation, etc., and these input files may be impractical to copy around. Another solution would be to redesign the checkpointing in Thetis and write out all input fields (including the ones produced by A) every time we checkpoint, but that would be quite wasteful in disk space.

0 replies

ksagiyam · 2023-08-02T12:46:28Z

ksagiyam
Aug 2, 2023
Maintainer

I think that sequence is safe (if the process count is the same in A, B, and C).
Firedrake by default saves mesh distribution and entity permutation (or reordering), and, if the loading process count is the same, it by default uses the saved distribution (instead of a partitioner) to distribute the mesh and uses the saved permutation (instead of computing a new one) to determine the entity ordering. So mesh constructed in A is guaranteed to have the exact same distribution as mesh loaded in B and C, and u constructed in A is guaranteed to have the exact same DoF ordering as u and v in B and u, v, and w in C.
(If we set reorder=False instead of reorder=None (default) in load_mesh, it would tell Firedrake not to do the default things, so we do not want to do that.)

0 replies

stephankramer · 2023-08-02T13:42:30Z

stephankramer
Aug 2, 2023
Collaborator Author

Ok, great thanks for the clarification. I think that means there is no issue for us: B is already forced to read the mesh from checkpoint (rather than recreating it) if it needs to use the results from A (which is I think the typical scenario for us) and thus its checkpoint files should be safe to be read by C using the mesh from checkpoint A. Many thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint question #3053

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Checkpoint question #3053

stephankramer Aug 1, 2023 Collaborator

Replies: 4 comments

ksagiyam Aug 1, 2023 Maintainer

stephankramer Aug 2, 2023 Collaborator Author

ksagiyam Aug 2, 2023 Maintainer

stephankramer Aug 2, 2023 Collaborator Author

stephankramer
Aug 1, 2023
Collaborator

ksagiyam
Aug 1, 2023
Maintainer

stephankramer
Aug 2, 2023
Collaborator Author

ksagiyam
Aug 2, 2023
Maintainer

stephankramer
Aug 2, 2023
Collaborator Author