Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLI for slurm configuration #70

Merged
merged 6 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/src/experiment_setup_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ For the example experiment, `sphere_held_suarez_rhoe_equilmoist`, this is done b
`sbatch experiments/sphere_held_suarez_rhoe_equilmoist/generate_observations.sbatch`. This script runs the model, passes the output through the observation map, and saves the result.

Once the observations have been processed and saved, the actual calibration pipeline can be run via
`bash experiments/pipeline.sh sphere_held_suarez_rhoe_equilmoist 8`.
`bash pipeline.sh sphere_held_suarez_rhoe_equilmoist -n 10 -c 8`.

!!! note
The command line interface for `pipeline.sh` will change. For now, the first entry is the experiment id and the second is the number of tasks to use per ensemble member.
Expand Down
8 changes: 4 additions & 4 deletions docs/src/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,19 @@ By default, it runs 10 ensemble members for 3 iterations.
To run this experiment:
1. Log onto the Caltech HPC
2. Clone CalibrateAtmos.jl and `cd` into the repository.
3. Run: `bash experiments/pipeline.sh sphere_held_suarez_rhoe_equilmoist 8`. This will run the `sphere_held_suarez_rhoe_equilmoist` experiment with 8 tasks per ensemble member.
3. Run: `bash pipeline.sh -n 10 -c 8 sphere_held_suarez_rhoe_equilmoist`. This will run the `sphere_held_suarez_rhoe_equilmoist` experiment with 10 tasks per ensemble member.

## Local Machine
To run an experiment on your local machine, you can use the `experiments/pipeline.jl` script. This is recommended for more lightweight experiments, such as the `surface_fluxes_perfect_model` experiment, which uses the [SurfaceFluxes.jl](https://github.com/CliMA/SurfaceFluxes.jl) package to generate a physical model that calculates the Monin Obukhov turbulent surface fluxes based on idealized atmospheric and surface conditions. Since this is a "perfect model" example, the same model is used to generate synthetic observations using its default parameters and a small amount of noise. These synthetic observations are considered to be the ground truth, which is used to assess the model ensembles' performance when parameters are drawn from the prior parameter distributions. To run this experiment, you can use the following command from terminal to run an interactive run:
To run an experiment on your local machine, you can use the `pipeline.jl` script. This is recommended for more lightweight experiments, such as the `surface_fluxes_perfect_model` experiment, which uses the [SurfaceFluxes.jl](https://github.com/CliMA/SurfaceFluxes.jl) package to generate a physical model that calculates the Monin Obukhov turbulent surface fluxes based on idealized atmospheric and surface conditions. Since this is a "perfect model" example, the same model is used to generate synthetic observations using its default parameters and a small amount of noise. These synthetic observations are considered to be the ground truth, which is used to assess the model ensembles' performance when parameters are drawn from the prior parameter distributions. To run this experiment, you can use the following command from terminal to run an interactive run:

```bash
julia -i experiments/pipeline.jl surface_fluxes_perfect_model
julia -i pipeline.jl surface_fluxes_perfect_model
```

This pipeline mirrors the pipeline of the bash srcipts, and the same example can be run on the HPC cluster if needed:

```bash
bash experiments/pipeline.sh surface_fluxes_perfect_model 8
bash pipeline.sh surface_fluxes_perfect_model 8
```

The experiments (such as `surface_fluxes_perfect_model`) can be equally defined within the component model repos (in this case, `SurfaceFluxes.jl`), so that the internals of `CalibrateAtmos.jl` do not explicitly depend on component models.
56 changes: 0 additions & 56 deletions experiments/pipeline.sh

This file was deleted.

File renamed without changes.
51 changes: 51 additions & 0 deletions pipeline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
set -euo pipefail
export MODULEPATH=/groups/esm/modules:$MODULEPATH
module purge
module load climacommon/2024_03_18

source slurm/parse_commandline.sh
if [ ! -d $output ] ; then
mkdir -p $output
fi

# Initialize the project and setup calibration
init_id=$(sbatch --parsable \
--output=$logfile \
--partition=$partition \
slurm/initialize.sbatch $experiment_id)
echo -e "Initialization job_id: $init_id\n"

# Loop over iterations
dependency="afterok:$init_id"
for i in $(seq 0 $((n_iterations - 1)))
do
echo "Scheduling iteration $i"
format_i=$(printf "iteration_%03d" "$i")

ensemble_array_id=$(
sbatch --dependency=$dependency --kill-on-invalid-dep=yes --parsable \
--job=model-$i \
--output=/dev/null \
--array=1-$ensemble_size \
--time=$slurm_time \
--ntasks=$slurm_ntasks \
--partition=$partition \
--cpus-per-task=$slurm_cpus_per_task \
--gpus-per-task=$slurm_gpus_per_task \
slurm/model_run.sbatch $experiment_id $i)

dependency=afterany:$ensemble_array_id
echo "Iteration $i job id: $ensemble_array_id"

update_id=$(
sbatch --dependency=$dependency --kill-on-invalid-dep=yes --parsable \
--job=update-$i \
--output=$logfile \
--open-mode=append \
--partition=$partition \
slurm/update.sbatch $experiment_id $i)

dependency=afterany:$update_id
echo -e "Update $i job id: $update_id\n"
done
3 changes: 2 additions & 1 deletion experiments/initialize.sbatch → slurm/initialize.sbatch
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
#!/bin/sh
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --cpus-per-task=8
#SBATCH --job init_calibration

experiment_id=$1
JULIA_NUM_PRECOMPILE_TASKS=8

echo "Initializing calibration for experiment: $experiment_id"
julia --color=no --project=experiments/$experiment_id -e 'using Pkg; Pkg.instantiate(;verbose=true)'
Expand Down
3 changes: 0 additions & 3 deletions experiments/model_run.sbatch → slurm/model_run.sbatch
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G

# Extract command-line arguments
experiment_id=$1
Expand Down
82 changes: 82 additions & 0 deletions slurm/parse_commandline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Default arguments
slurm_time="2:00:00"
slurm_ntasks="1"
slurm_cpus_per_task="1"
slurm_gpus_per_task="0"

help_message="Usage:
./pipeline.sh [options] experiment_id

Options:
-t, --time=HH:MM:SS: Set max wallclock time (default: 2:00:00).
-n, --ntasks: Set number of tasks to launch (default: 1).
-c, --cpus_per_task: Set CPU cores per task (mutually exclusive with -g, default: 8).
-g, --gpus_per_task: Set GPUs per task (mutually exclusive with -c, default: 0).
-h, --help: Display this help message.

Arguments:
experiment_id: A unique identifier for your experiment (required)."

# Parse arguments using getopt
VALID_ARGS=$(getopt -o h,t:,n:,c:,g: --long help,time:,ntasks:,cpus_per_task:,gpus_per_task: -- "$@")
if [[ $? -ne 0 ]]; then
exit 1;
fi

eval set -- "$VALID_ARGS"

# Process arguments
while [ : ]; do
case "$1" in
-t | --time)
slurm_time="$2"
shift 2
;;
-n | --ntasks)
slurm_ntasks="$2"
shift 2
;;
-c | --cpus_per_task)
slurm_cpus_per_task="$2"
shift 2
;;
-g | --gpus_per_task)
slurm_gpus_per_task="$2"
shift 2
;;
-h | --help)
printf "%s\n" "$help_message"
exit 0
;;
--) shift; break ;; # End of options
esac
done

experiment_id="$1"
if [ -z $experiment_id ] ; then
echo "Error: No experiment ID provided."
exit 1
fi

# Get values from EKP config file
ensemble_size=$(grep "ensemble_size:" experiments/$experiment_id/ekp_config.yml | awk '{print $2}')
n_iterations=$(grep "n_iterations:" experiments/$experiment_id/ekp_config.yml | awk '{print $2}')
output=$(grep "output_dir:" experiments/$experiment_id/ekp_config.yml | awk '{print $2}')
logfile=$output/experiment_log.out

# Set partition
if [[ $slurm_gpus_per_task -gt 0 ]]; then
partition=gpu
else
partition=expansion
fi

# Output slurm configuration
echo "Running experiment: $experiment_id"
indent=" └ "
printf "Slurm configuration (per ensemble member):\n"
printf "%sTime limit: %s\n" "$indent" "$slurm_time"
printf "%sTasks: %s\n" "$indent" "$slurm_ntasks"
printf "%sCPUs per task: %s\n" "$indent" "$slurm_cpus_per_task"
printf "%sGPUs per task: %s\n" "$indent" "$slurm_gpus_per_task"
echo ""
1 change: 1 addition & 0 deletions experiments/update.sbatch → slurm/update.sbatch
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ julia --color=no --project=experiments/$experiment_id -e '
JLD2.save_object(joinpath(iter_path, "observation_map.jld2"), G_ensemble)
CalibrateAtmos.update_ensemble(experiment_id, i)
'
echo "Update step for iteration $i complete"
Loading