Checkpoint question #3053
Replies: 4 comments
-
I think the cleanest solution is to somehow save all functions in a single file. 1 and 2 are both based on the same assumption that meshA and meshB have identical data, so I think they are as safe/unsafe as each other. (2 works as I think Mesh construction is mostly deterministic, but I am not sure about partitioners. They may give different partitions depending on the version, or they could be stochastic due to the round-off error, in which case I think 1 and 2 could both fail. So, ideally, once we save the mesh in fileA, we want to load that mesh from fileA (using the same number of processes due to current restriction) every time we work with the mesh, and start from that loaded mesh -> define functions -> save functions. |
Beta Was this translation helpful? Give feedback.
-
Thanks @ksagiyam Yeah, unfortunately that doesn't really work well with a workflow that's quite common with Thetis users. Just to make sure I understand things correctly. Would the following also be unsafe: I run the following in sequence: run A:
Then run B:
And finally run C:
Would this also be unsafe? What about if I use In the Thetis context A would be a preprocessing script, that preprocesses some fields that don't change over time like bathymetry, a viscosity field etc. Then B would be a cold start Thetis run which produces a series of checkpoints, from which the model can be restarted. The decision from which of these checkpoints to restart is often made afterwards - i.e. it's not necessarily just about restarting from the last checkpoint, but might be rerunning some time window with slightly modified parameters for instance. The reason the preprocessing is often separated is because: 1) it may take considerable time and the result is reused for a number of runs B 2) may involve vary large input files that are interpolated from and combined into a single interpolated field after some smoothing, solving some distance to coast Eikonal equation, etc., and these input files may be impractical to copy around. Another solution would be to redesign the checkpointing in Thetis and write out all input fields (including the ones produced by A) every time we checkpoint, but that would be quite wasteful in disk space. |
Beta Was this translation helpful? Give feedback.
-
I think that sequence is safe (if the process count is the same in A, B, and C). |
Beta Was this translation helpful? Give feedback.
-
Ok, great thanks for the clarification. I think that means there is no issue for us: B is already forced to read the mesh from checkpoint (rather than recreating it) if it needs to use the results from A (which is I think the typical scenario for us) and thus its checkpoint files should be safe to be read by C using the mesh from checkpoint A. Many thanks! |
Beta Was this translation helpful? Give feedback.
-
If I have two firedrake scripts, say A and B, that both read in the same mesh and each write out a checkpoint file with some function result. Then I have a third script in which I want to read in both checkpoints and combine them in some further Firedrake computation, for which I need both functions to be based on the same domain. Should I:
1.
or simply 2.
Is 2. allowed/safe? It doesn't complain. But is it guaranteed to work in parallel, or say the two scripts are run on a different system?
Context is Thetis, where we write out elevations and velocities in separate checkpoint files - so these are at least written in the same run - but also often have other checkpoints created in some separate pre-processing script with bathymetry, viscosity, etc. See thetisproject/thetis#336
Beta Was this translation helpful? Give feedback.
All reactions