Skip to content

Commit

Permalink
Merge pull request #39 from yut23/perlmutter-auto-checkpoint-note
Browse files Browse the repository at this point in the history
Add note about auto-checkpointing timing out
  • Loading branch information
zingale authored Oct 15, 2024
2 parents 453e9bf + e11ba8f commit 72c4e10
Showing 1 changed file with 7 additions and 0 deletions.
7 changes: 7 additions & 0 deletions sphinx_docs/source/nersc-workflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,13 @@ includes the restart logic to allow for job chaining.
``amrex.the_arena_init_size=0`` after ``${restartString}`` in the srun call
so AMReX doesn't reserve 3/4 of the GPU memory for the device arena.

.. note::

If the job times out before writing out a checkpoint (leaving a
``dump_and_stop`` file behind), you can give it more time between the
warning signal and the end of the allocation by adjusting the
``#SBATCH --signal=B:URG@<n>`` line at the top of the script.

Below is an example that runs on CPU-only nodes. Here ``ntasks-per-node``
refers to number of MPI processes (used for distributed parallelism) per node,
and ``cpus-per-task`` refers to number of hyper threads used per task
Expand Down

0 comments on commit 72c4e10

Please sign in to comment.