-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kestrel #405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rHorsey Here's my "working" Kestrel implementation. A couple notes:
pre-commit and black
I'm using pre-commit and black for auto formatting now. Recommend doing a
pre-commit install
after pip installing. If you don't, it's not the end of the world, CI will run black and fix things for you.
Avoid /shared-projects
for now
They're still copying stuff over from Eagle and the permissions are all messed up. The top of my testing project file looks like
schema_version: '0.3'
os_version: 3.6.1
os_sha: bb9481519e
buildstock_directory: ../ # Relative to this file or absolute
project_directory: project_national # Relative to buildstock_directory
output_directory: /scratch/nmerket/national_baseline2
# weather_files_url: https://data.nrel.gov/system/files/156/BuildStock_TMY3_FIPS.zip
weather_files_path: /scratch/nmerket/weather/BuildStock_TMY3_FIPS.zip
sys_image_dir: /scratch/nmerket/images
You'll need to copy those files to /scratch or /projects as necessary. Also, create the environment on Kestrel not in /shared-projects (the default) at this point. I have instructions for that below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed this file from eagle.py
➡️ hpc.py
.
@@ -54,11 +54,12 @@ def get_bool_env_var(varname): | |||
return os.environ.get(varname, "0").lower() in ("true", "t", "1", "y", "yes") | |||
|
|||
|
|||
class EagleBatch(BuildStockBatchBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the code is the same between the Eagle and Kestrel implementation, so I separated it out into a base class I'm calling SlurmBatch
that EagleBatch
and KestrelBatch
both inherit from.
cores_per_node = 36 | ||
minutes_per_sim = eagle_cfg["minutes_per_sim"] | ||
walltime = math.ceil(math.ceil(n_sims_per_job / cores_per_node) * minutes_per_sim) | ||
minutes_per_sim = hpc_cfg["minutes_per_sim"] | ||
walltime = math.ceil(math.ceil(n_sims_per_job / self.CORES_PER_NODE) * minutes_per_sim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than hardcode the number of cores and the like, it's a constant on each of the sub classes.
@@ -677,6 +716,68 @@ def rerun_failed_jobs(self, hipri=False): | |||
self.queue_post_processing(job_ids, hipri=hipri) | |||
|
|||
|
|||
class EagleBatch(SlurmBatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the Eagle specific implementation. It's mostly constant, defaults, and validation that only apply to Eagle.
buildstockbatch/hpc.py
Outdated
|
||
|
||
class KestrelBatch(SlurmBatch): | ||
DEFAULT_SYS_IMAGE_DIR = "/kfs2/shared-projects/buildstock/singularity_images" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're supposed to have this fixed by next week, but they're not ready with /shared-projects
on Kestrel yet. They're still copying stuff over from Eagle and the permissions are all messed up. I'd recommend doing your testing from /scratch
or /projects
for now.
pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "df -i; df -h" | ||
|
||
$MY_PYTHON_ENV/bin/dask scheduler --scheduler-file $SCHEDULER_FILE &> $OUT_DIR/dask_scheduler.out & | ||
pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "$MY_PYTHON_ENV/bin/dask worker --scheduler-file $SCHEDULER_FILE --local-directory /tmp/scratch/dask --nworkers ${NPROCS} --nthreads 1 --memory-limit ${MEMORY}MB" &> $OUT_DIR/dask_workers.out & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't working for me when the python environment was on /kfs2/shared-projects/envs
. There are some permissions things messed up there. Like the groups weren't being passed down to the compute nodes or something. Supposedly they're working on it. I recommend creating your virtualenv on /scratch
or /projects
for testing.
@@ -6,6 +6,7 @@ weather_files_url: str(required=False) | |||
sampler: include('sampler-spec', required=True) | |||
workflow_generator: include('workflow-generator-spec', required=True) | |||
eagle: include('hpc-spec', required=False) | |||
kestrel: include('hpc-spec', required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add a kestrel
key like you have the eagle
one to your project file. Adjust the number of jobs and file locations and stuff. It's all the same structure and format, though.
module load python apptainer | ||
source "$MY_PYTHON_ENV/bin/activate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll notice I abandoned conda as our python package and environment manager. There was too much trouble between it and pip when installing buildstockbatch. I opted to go with the system installed python (3.11) and use a venv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to figure out if we actually still used ruby native outside of the container but it looks like not...
create_kestrel_env.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To install, it defaults to /shared-projects
so you'll want to override that. Also we're using python venv now for environments instead of conda, so activating is a little different.
module load git # yes, really
git clone git@github.com:NREL/buildstockbatch.git
cd buildstockbatch
git checkout kestrel
mkdir -p /scratch/$USER/envs
./create_kestrel_env.sh -e /scratch/$USER/envs -d mybsb
source /scratch/$USER/envs/mybsb/bin/activate
buildstock_kestrel path/to/project_file.yml
"buildstock_eagle=buildstockbatch.hpc:eagle_cli", | ||
"buildstock_kestrel=buildstockbatch.hpc:kestrel_cli", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a separate buildstock_kestrel
cli.
Minimum allowed coverage is Generated by 🐒 cobertura-action against 3461316 |
It just occurred to me that a venv created by one user might not be able to be used by another user (which was possible with conda). We should check that and ensure it works, and if not, switch back to conda. |
This would be a nice to have feature: Issue 171 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I successfully ran a test 100 datapoint project (timeseries_frequency=none
). Annual results were postprocessed/uploaded as expected.
Fixes #313
A start at getting it to work on Kestrel.
Checklist
Not all may apply
minimum_coverage
in.github/workflows/ci.yml
as necessary.singularity
toapptainer
.