Skip to content

Commit

Permalink
Add PBS controller, Derecho backend
Browse files Browse the repository at this point in the history
  • Loading branch information
nefrathenrici committed Jul 25, 2024
1 parent 25dab9d commit db5cb81
Show file tree
Hide file tree
Showing 16 changed files with 699 additions and 241 deletions.
2 changes: 1 addition & 1 deletion .buildkite/clima_server_test/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ steps:

- wait
- label: "SurfaceFluxes perfect model calibration"
command: julia --project=experiments/surface_fluxes_perfect_model test/slurm_backend_e2e.jl
command: julia --project=experiments/surface_fluxes_perfect_model test/hpc_backend_e2e.jl
artifact_paths: output/surface_fluxes_perfect_model/*

- label: "Slurm job controller unit tests"
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ steps:

- wait
- label: "SurfaceFluxes perfect model calibration"
command: julia --project=experiments/surface_fluxes_perfect_model test/slurm_backend_e2e.jl
command: julia --project=experiments/surface_fluxes_perfect_model test/hpc_backend_e2e.jl
artifact_paths: output/surface_fluxes_perfect_model/*

- label: "Slurm job controller unit tests"
Expand Down
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "ClimaCalibrate"
uuid = "4347a170-ebd6-470c-89d3-5c705c0cacc2"
authors = ["Climate Modeling Alliance"]
version = "0.0.1"
version = "0.0.2"

[deps]
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
Expand Down
16 changes: 5 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,21 @@
calibration pipelines using with minimal boilerplate.</strong>
</p>

[![docsbuild][docs-bld-img]][docs-bld-url]
[![dev][docs-dev-img]][docs-dev-url]
[![ghaci][gha-ci-img]][gha-ci-url]
[![codecov][codecov-img]][codecov-url]

[docs-bld-img]: https://github.com/CliMA/ClimaCalibrate.jl/workflows/Documentation/badge.svg
[docs-bld-url]: https://github.com/CliMA/ClimaCalibrate.jl/actions?query=workflow%3ADocumentation

[docs-dev-img]: https://img.shields.io/badge/docs-dev-blue.svg
[docs-dev-url]: https://CliMA.github.io/ClimaCalibrate.jl/dev/

[gha-ci-img]: https://github.com/CliMA/ClimaCalibrate.jl/actions/workflows/ci.yml/badge.svg
[gha-ci-url]: https://github.com/CliMA/ClimaCalibrate.jl/actions/workflows/ci.yml

[codecov-img]: https://codecov.io/gh/CliMA/ClimaCalibrate.jl/branch/main/graph/badge.svg
[codecov-url]: https://codecov.io/gh/CliMA/ClimaCalibrate.jl

The recommended Julia version is: Stable release v1.10.0
The recommended Julia version is: Stable release v1.10.4

This pipeline currently runs on the Resnick High Performance Computing Center.
We strive to support flexible and clearly documented calibration experiments.
Currently supported backends:
- [Resnick High Performance Computing Center](https://www.hpc.caltech.edu/)
- [NSF NCAR Supercomputer Derecho](https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/)
- CliMA's private GPU server

## Contributing

Expand Down
1 change: 0 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ makedocs(
"Getting Started" => "quickstart.md",
"ClimaAtmos Setup Guide" => "atmos_setup_guide.md",
"Emulate and Sample" => "emulate_sample.md",
"Precompilation" => "precompilation.md",
"API" => "api.md",
],
)
Expand Down
19 changes: 18 additions & 1 deletion docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,24 @@ ClimaCalibrate.observation_map
```@docs
ClimaCalibrate.get_backend
ClimaCalibrate.calibrate
ClimaCalibrate.sbatch_model_run
ClimaCalibrate.model_run
ClimaCalibrate.module_load_string
```

## Job Scheduler
```@docs
ClimaCalibrate.wait_for_jobs
ClimaCalibrate.log_member_error
ClimaCalibrate.kill_job
ClimaCalibrate.job_status
ClimaCalibrate.kwargs
ClimaCalibrate.slurm_model_run
ClimaCalibrate.generate_sbatch_script
ClimaCalibrate.generate_sbatch_directives
ClimaCalibrate.submit_slurm_job
ClimaCalibrate.pbs_model_run
ClimaCalibrate.generate_pbs_script
ClimaCalibrate.submit_pbs_job
```

## EnsembleKalmanProcesses Interface
Expand Down
5 changes: 2 additions & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
ClimaCalibrate.jl is a toolkit for developing scalable and reproducible model
calibration pipelines using CalibrateEmulateSample.jl with minimal boilerplate.

To use this framework, component models (and the coupler) define their own versions of the functions provided in the interface (`get_config`, `get_forward_model`, and `run_forward_model`).

Calibrations can either be run using pure Julia, the Caltech central cluster, or CliMA's GPU server.
To use this framework, component models (and the coupler) define their own versions of the functions provided in the interface.
Calibrations can either be run using just Julia, the Caltech central cluster, NCAR Derecho, or CliMA's GPU server.

For more information, see our Getting Started page.
1 change: 1 addition & 0 deletions src/ClimaCalibrate.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ module ClimaCalibrate
include("ekp_interface.jl")
include("model_interface.jl")
include("slurm.jl")
include("pbs.jl")
include("backends.jl")
include("emulate_sample.jl")

Expand Down
148 changes: 98 additions & 50 deletions src/backends.jl
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
export get_backend, calibrate
export get_backend, calibrate, model_run

abstract type AbstractBackend end

struct JuliaBackend <: AbstractBackend end
abstract type SlurmBackend <: AbstractBackend end

abstract type HPCBackend <: AbstractBackend end
abstract type SlurmBackend <: HPCBackend end

struct CaltechHPCBackend <: SlurmBackend end
struct ClimaGPUBackend <: SlurmBackend end

struct DerechoBackend <: HPCBackend end

"""
get_backend()
Expand All @@ -18,6 +23,8 @@ function get_backend()
(r"^clima.gps.caltech.edu$", ClimaGPUBackend),
(r"^login[1-4].cm.cluster$", CaltechHPCBackend),
(r"^hpc-(\d\d)-(\d\d).cm.cluster$", CaltechHPCBackend),
(r"derecho([1-8])$", DerechoBackend),
(r"dec(\d\d\d\d)$", DerechoBackend), # This should be more specific

Check warning on line 27 in src/backends.jl

View check run for this annotation

Codecov / codecov/patch

src/backends.jl#L26-L27

Added lines #L26 - L27 were not covered by tests
]

for (pattern, backend) in HOSTNAMES
Expand All @@ -28,12 +35,12 @@ function get_backend()
end

"""
module_load_string(T) where {T<:Type{SlurmBackend}}
module_load_string(backend)
Return a string that loads the correct modules for a given backend when executed via bash.
"""
function module_load_string(::Type{CaltechHPCBackend})
return """export MODULEPATH=/groups/esm/modules:\$MODULEPATH
return """export MODULEPATH="/groups/esm/modules:\$MODULEPATH"

Check warning on line 43 in src/backends.jl

View check run for this annotation

Codecov / codecov/patch

src/backends.jl#L43

Added line #L43 was not covered by tests
module purge
module load climacommon/2024_05_27"""
end
Expand All @@ -43,32 +50,14 @@ function module_load_string(::Type{ClimaGPUBackend})
module load julia/1.10.0 cuda/julia-pref openmpi/4.1.5-mpitrampoline"""
end

"""
calibrate(::Type{JuliaBackend}, config::ExperimentConfig)
calibrate(::Type{JuliaBackend}, experiment_dir::AbstractString)
Run a calibration in Julia.
Takes an ExperimentConfig or an experiment folder.
If no backend is passed, one is chosen via `get_backend`.
This function is intended for use in a larger workflow, assuming that all needed
model interface and observation map functions are set up for the calibration.
# Example
Run: `julia --project=experiments/surface_fluxes_perfect_model`
```julia
import ClimaCalibrate
# Generate observational data and load interface
experiment_dir = dirname(Base.active_project())
include(joinpath(experiment_dir, "generate_data.jl"))
include(joinpath(experiment_dir, "observation_map.jl"))
include(joinpath(experiment_dir, "model_interface.jl"))
function module_load_string(::Type{DerechoBackend})
return """export MODULEPATH="/glade/campaign/univ/ucit0011/ClimaModules-Derecho:\$MODULEPATH"

Check warning on line 54 in src/backends.jl

View check run for this annotation

Codecov / codecov/patch

src/backends.jl#L53-L54

Added lines #L53 - L54 were not covered by tests
module purge
module load climacommon
module list
"""
end

# Initialize and run the calibration
eki = ClimaCalibrate.calibrate(experiment_dir)
```
"""
calibrate(config::ExperimentConfig; ekp_kwargs...) =
calibrate(get_backend(), config; ekp_kwargs...)

Expand All @@ -86,9 +75,8 @@ function calibrate(
config::ExperimentConfig;
ekp_kwargs...,
)
initialize(config; ekp_kwargs...)
(; n_iterations, ensemble_size) = config
eki = nothing
eki = initialize(config; ekp_kwargs...)
for i in 0:(n_iterations - 1)
@info "Running iteration $i"
for m in 1:ensemble_size
Expand All @@ -103,75 +91,80 @@ function calibrate(
end

"""
calibrate(::Type{SlurmBackend}, config::ExperimentConfig; kwargs...)
calibrate(::Type{SlurmBackend}, experiment_dir; kwargs...)
calibrate(::Type{AbstractBackend}, config::ExperimentConfig; kwargs...)
calibrate(::Type{AbstractBackend}, experiment_dir; kwargs...)
Run a full calibration, scheduling the forward model runs on Caltech's HPC cluster.
Takes either an ExperimentConfig or an experiment folder.
Available Backends: CaltechHPCBackend, ClimaGPUBackend, DerechoBackend, JuliaBackend
# Keyword Arguments
- `experiment_dir: Directory containing experiment configurations.
- `model_interface: Path to the model interface file.
- `slurm_kwargs`: Dictionary of slurm arguments, passed through to `sbatch`.
- `verbose::Bool`: Enable verbose output for debugging.
- `hpc_kwargs`: Dictionary of resource arguments, passed to the job scheduler.
- `verbose::Bool`: Enable verbose logging.
# Usage
Open julia: `julia --project=experiments/surface_fluxes_perfect_model`
```julia
import ClimaCalibrate: CaltechHPCBackend, calibrate
using ClimaCalibrate
experiment_dir = dirname(Base.active_project())
experiment_dir = joinpath(pkgdir(ClimaCalibrate), "experiments", "surface_fluxes_perfect_model")
model_interface = joinpath(experiment_dir, "model_interface.jl")
# Generate observational data and load interface
include(joinpath(experiment_dir, "generate_data.jl"))
include(joinpath(experiment_dir, "observation_map.jl"))
include(model_interface)
slurm_kwargs = kwargs(time = 3)
eki = calibrate(CaltechHPCBackend, experiment_dir; model_interface, slurm_kwargs);
hpc_kwargs = kwargs(time = 3)
backend = get_backend()
eki = calibrate(backend, experiment_dir; model_interface, hpc_kwargs);
```
"""
function calibrate(
b::Type{<:SlurmBackend},
b::Type{<:HPCBackend},
experiment_dir::AbstractString;
slurm_kwargs,
hpc_kwargs,
ekp_kwargs...,
)
calibrate(b, ExperimentConfig(experiment_dir); slurm_kwargs, ekp_kwargs...)
calibrate(b, ExperimentConfig(experiment_dir); hpc_kwargs, ekp_kwargs...)

Check warning on line 134 in src/backends.jl

View check run for this annotation

Codecov / codecov/patch

src/backends.jl#L134

Added line #L134 was not covered by tests
end

function calibrate(
b::Type{<:SlurmBackend},
b::Type{<:HPCBackend},
config::ExperimentConfig;
experiment_dir = dirname(Base.active_project()),
model_interface = abspath(
joinpath(experiment_dir, "..", "..", "model_interface.jl"),
),
verbose = false,
slurm_kwargs = Dict(:time_limit => 45, :ntasks => 1),
reruns = 1,
hpc_kwargs,
ekp_kwargs...,
)
# ExperimentConfig is created from a YAML file within the experiment_dir
(; n_iterations, output_dir, ensemble_size) = config
@info "Initializing calibration" n_iterations ensemble_size output_dir
initialize(config; ekp_kwargs...)

eki = nothing
eki = initialize(config; ekp_kwargs...)

Check warning on line 153 in src/backends.jl

View check run for this annotation

Codecov / codecov/patch

src/backends.jl#L153

Added line #L153 was not covered by tests
module_load_str = module_load_string(b)
for iter in 0:(n_iterations - 1)
@info "Iteration $iter"
jobids = map(1:ensemble_size) do member
@info "Running ensemble member $member"
sbatch_model_run(
model_run(

Check warning on line 159 in src/backends.jl

View check run for this annotation

Codecov / codecov/patch

src/backends.jl#L159

Added line #L159 was not covered by tests
b,
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
slurm_kwargs,
hpc_kwargs,
)
end

Expand All @@ -182,14 +175,69 @@ function calibrate(
experiment_dir,
model_interface,
module_load_str;
slurm_kwargs,
hpc_kwargs,
verbose,
reruns,
)
report_iteration_status(statuses, output_dir, iter)
@info "Completed iteration $iter, updating ensemble"
G_ensemble = observation_map(iter)
save_G_ensemble(config, iter, G_ensemble)
eki = update_ensemble(config, iter)
end
return eki
end

# Dispatch on backend type to unify `calibrate` for all HPCBackends
# Scheduler interfaces should not depend on backend struct
"""
model_run(backend, iter, member, output_dir, experiment_dir; model_interface, verbose, hpc_kwargs)
Construct and execute a command to run a single forward model on a given job scheduler.
Dispatches on `backend` to run [`slurm_model_run`](@ref) or [`pbs_model_run`](@ref).
Arguments:
- iter: Iteration number
- member: Member number
- output_dir: Calibration experiment output directory
- experiment_dir: Directory containing the experiment's Project.toml
- model_interface: File containing the model interface
- module_load_str: Commands which load the necessary modules
- hpc_kwargs: Dictionary containing the resources for the job. Easily generated using [`kwargs`](@ref).
"""
model_run(

Check warning on line 208 in src/backends.jl

View check run for this annotation

Codecov / codecov/patch

src/backends.jl#L208

Added line #L208 was not covered by tests
b::Type{<:SlurmBackend},
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
hpc_kwargs,
) = slurm_model_run(
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
hpc_kwargs,
)
model_run(

Check warning on line 226 in src/backends.jl

View check run for this annotation

Codecov / codecov/patch

src/backends.jl#L226

Added line #L226 was not covered by tests
b::Type{DerechoBackend},
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
hpc_kwargs,
) = pbs_model_run(
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
hpc_kwargs,
)
6 changes: 3 additions & 3 deletions src/ekp_interface.jl
Original file line number Diff line number Diff line change
Expand Up @@ -171,10 +171,10 @@ function env_model_interface(env = ENV)
return string(env[key])
end

function env_iter_number(env = ENV)
key = "CALIBRATION_ITER_NUMBER"
function env_iteration(env = ENV)
key = "CALIBRATION_ITERATION"
haskey(env, key) || error(
"Iteration number not found in environment. Ensure that env variable \"CALIBRATION_ITER_NUMBER\" is set.",
"Iteration number not found in environment. Ensure that env variable \"CALIBRATION_ITERATION\" is set.",
)
return parse(Int, env[key])
end
Expand Down
Loading

0 comments on commit db5cb81

Please sign in to comment.