Skip to content

Commit

Permalink
Add initial CLI w/ GPU support
Browse files Browse the repository at this point in the history
  • Loading branch information
nefrathenrici committed Mar 18, 2024
1 parent f4b476a commit f3c2442
Show file tree
Hide file tree
Showing 9 changed files with 146 additions and 65 deletions.
2 changes: 1 addition & 1 deletion docs/src/experiment_setup_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ For the example experiment, `sphere_held_suarez_rhoe_equilmoist`, this is done b
`sbatch experiments/sphere_held_suarez_rhoe_equilmoist/generate_observations.sbatch`. This script runs the model, passes the output through the observation map, and saves the result.

Once the observations have been processed and saved, the actual calibration pipeline can be run via
`bash experiments/pipeline.sh sphere_held_suarez_rhoe_equilmoist 8`.
`bash pipeline.sh sphere_held_suarez_rhoe_equilmoist -n 10 -c 8`.

!!! note
The command line interface for `pipeline.sh` will change. For now, the first entry is the experiment id and the second is the number of tasks to use per ensemble member.
Expand Down
8 changes: 4 additions & 4 deletions docs/src/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,19 @@ By default, it runs 10 ensemble members for 3 iterations.
To run this experiment:
1. Log onto the Caltech HPC
2. Clone CalibrateAtmos.jl and `cd` into the repository.
3. Run: `bash experiments/pipeline.sh sphere_held_suarez_rhoe_equilmoist 8`. This will run the `sphere_held_suarez_rhoe_equilmoist` experiment with 8 tasks per ensemble member.
3. Run: `bash pipeline.sh -n 10 -c 8 sphere_held_suarez_rhoe_equilmoist`. This will run the `sphere_held_suarez_rhoe_equilmoist` experiment with 10 tasks per ensemble member.

## Local Machine
To run an experiment on your local machine, you can use the `experiments/pipeline.jl` script. This is recommended for more lightweight experiments, such as the `surface_fluxes_perfect_model` experiment, which uses the [SurfaceFluxes.jl](https://github.com/CliMA/SurfaceFluxes.jl) package to generate a physical model that calculates the Monin Obukhov turbulent surface fluxes based on idealized atmospheric and surface conditions. Since this is a "perfect model" example, the same model is used to generate synthetic observations using its default parameters and a small amount of noise. These synthetic observations are considered to be the ground truth, which is used to assess the model ensembles' performance when parameters are drawn from the prior parameter distributions. To run this experiment, you can use the following command from terminal to run an interactive run:
To run an experiment on your local machine, you can use the `pipeline.jl` script. This is recommended for more lightweight experiments, such as the `surface_fluxes_perfect_model` experiment, which uses the [SurfaceFluxes.jl](https://github.com/CliMA/SurfaceFluxes.jl) package to generate a physical model that calculates the Monin Obukhov turbulent surface fluxes based on idealized atmospheric and surface conditions. Since this is a "perfect model" example, the same model is used to generate synthetic observations using its default parameters and a small amount of noise. These synthetic observations are considered to be the ground truth, which is used to assess the model ensembles' performance when parameters are drawn from the prior parameter distributions. To run this experiment, you can use the following command from terminal to run an interactive run:

```bash
julia -i experiments/pipeline.jl surface_fluxes_perfect_model
julia -i pipeline.jl surface_fluxes_perfect_model
```

This pipeline mirrors the pipeline of the bash srcipts, and the same example can be run on the HPC cluster if needed:

```bash
bash experiments/pipeline.sh surface_fluxes_perfect_model 8
bash pipeline.sh surface_fluxes_perfect_model 8
```

The experiments (such as `surface_fluxes_perfect_model`) can be equally defined within the component model repos (in this case, `SurfaceFluxes.jl`), so that the internals of `CalibrateAtmos.jl` do not explicitly depend on component models.
56 changes: 0 additions & 56 deletions experiments/pipeline.sh

This file was deleted.

File renamed without changes.
49 changes: 49 additions & 0 deletions pipeline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash
source slurm/parse_commandline.sh
if [ ! -d $output ] ; then
mkdir $output
fi

# Initialize the project and setup calibration
init_id=$(sbatch --parsable \
--output=$logfile \
--partition=$partition \
slurm/initialize.sbatch $experiment_id)
echo "Initialization job_id: $init_id"
echo ""

# Loop over iterations
dependency="afterok:$init_id"
for i in $(seq 0 $((n_iterations - 1)))
do
echo "Scheduling iteration $i"
format_i=$(printf "iteration_%03d" "$i")

ensemble_array_id=$(
sbatch --dependency=$dependency --kill-on-invalid-dep=yes --parsable \
--job=model-$i \
--output=/dev/null \
--array=1-$ensemble_size \
--time=$slurm_time \
--ntasks=$slurm_ntasks \
--partition=$partition \
--cpus-per-task=$slurm_cpus_per_task \
--gpus-per-task=$slurm_gpus_per_task \
slurm/model_run.sbatch $experiment_id $i
)

dependency=afterany:$ensemble_array_id
echo "Iteration $i job id: $ensemble_array_id"

update_id=$(
sbatch --dependency=$dependency --kill-on-invalid-dep=yes --parsable \
--job=update-$i \
--output=$logfile \
--open-mode=append \
--partition=$partition \
slurm/update.sbatch $experiment_id $i)

dependency=afterany:$update_id
echo "Update $i job id: $update_id"
echo ""
done
3 changes: 2 additions & 1 deletion experiments/initialize.sbatch → slurm/initialize.sbatch
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
#!/bin/sh
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --cpus-per-task=8
#SBATCH --job init_calibration

experiment_id=$1
JULIA_NUM_PRECOMPILE_TASKS=8

echo "Initializing calibration for experiment: $experiment_id"
julia --color=no --project=experiments/$experiment_id -e 'using Pkg; Pkg.instantiate(;verbose=true)'
Expand Down
3 changes: 0 additions & 3 deletions experiments/model_run.sbatch → slurm/model_run.sbatch
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G

# Extract command-line arguments
experiment_id=$1
Expand Down
89 changes: 89 additions & 0 deletions slurm/parse_commandline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
export MODULEPATH=/groups/esm/modules:$MODULEPATH
module load climacommon/2024_03_18

# Default arguments
slurm_time="2:00:00"
slurm_ntasks="1"
slurm_cpus_per_task="1"
slurm_gpus_per_task="0"

help_message="Usage:
./pipeline.sh [options] experiment_id
Options:
-t, --time=HH:MM:SS: Set max wallclock time (default: 2:00:00).
-n, --ntasks: Set number of tasks to launch (default: 1).
-c, --cpus_per_task: Set CPU cores per task (mutually exclusive with -g, default: 8).
-g, --gpus_per_task: Set GPUs per task (mutually exclusive with -c, default: 0).
-h, --help: Display this help message.
Arguments:
experiment_id: A unique identifier for your experiment (required).
Notes:
Cannot specify both CPU and GPU resources.
Script exits with error on missing arguments or invalid options."

# Parse arguments using getopt
VALID_ARGS=$(getopt -o h,t:,n:,c:,g: --long help,time:,ntasks:,cpus_per_task:,gpus_per_task: -- "$@")
if [[ $? -ne 0 ]]; then
exit 1;
fi

eval set -- "$VALID_ARGS"

# Process arguments
while [ : ]; do
case "$1" in
-t | --time)
slurm_time="$2"
shift 2
;;
-n | --ntasks)
slurm_ntasks="$2"
shift 2
;;
-c | --cpus_per_task)
slurm_cpus_per_task="$2"
shift 2
;;
-g | --gpus_per_task)
slurm_gpus_per_task="$2"
shift 2
;;
-h | --help)
printf "%s\n" "$help_message"
exit 0
;;
--) shift; break ;; # End of options
esac
done

experiment_id="$1"
if [ -z $experiment_id ] ; then
echo "Error: No experiment ID provided."
exit 1
fi

# Get values from EKP config file
ensemble_size=$(grep "ensemble_size:" experiments/$experiment_id/ekp_config.yml | awk '{print $2}')
n_iterations=$(grep "n_iterations:" experiments/$experiment_id/ekp_config.yml | awk '{print $2}')
output=$(grep "output_dir:" experiments/$experiment_id/ekp_config.yml | awk '{print $2}')
logfile=$output/experiment_log.out

# Set partition
if [[ $slurm_gpus_per_task -gt 0 ]]; then
partition=gpu
else
partition=expansion
fi

# Output slurm configuration
echo "Running experiment: $experiment_id"
indent=""
printf "Slurm configuration (per ensemble member):\n"
printf "%sTime limit: %s\n" "$indent" "$slurm_time"
printf "%sTasks: %s\n" "$indent" "$slurm_ntasks"
printf "%sCPUs per task: %s\n" "$indent" "$slurm_cpus_per_task"
printf "%sGPUs per task: %s\n" "$indent" "$slurm_gpus_per_task"
echo ""
1 change: 1 addition & 0 deletions experiments/update.sbatch → slurm/update.sbatch
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ julia --color=no --project=experiments/$experiment_id -e '
JLD2.save_object(joinpath(iter_path, "observation_map.jld2"), G_ensemble)
CalibrateAtmos.update_ensemble(experiment_id, i)
'
echo "Update step for iteration $i complete"

0 comments on commit f3c2442

Please sign in to comment.