-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add GPU DYAMOND runs #659
add GPU DYAMOND runs #659
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we could request an H100 for this job only? I don't think the allocation enhancements will be addressed anytime soon. If that's possible, I would suggest commenting out the regular CI job for now, but retaining the longrun one, which we only run once a week on Sundays.
ClimaAtmos has a separate buildkite pipeline that runs target GPU simulations on clima (see the runs and the pipeline.yml itself). I can implement the same thing for us |
764c0ae
to
bc282ea
Compare
Does this allow us to specify the hardware for just one run though? |
No, it would be a separate pipeline where this job would be run. I think this will be useful for GPU scaling runs too |
4feeb75
to
2b7b7c1
Compare
d7ec7c8
to
fc75f4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you, @juliasloan25. Just had a question about the sim length.
monthly_checkpoint: false | ||
run_name: "gpu_dyamond_target" | ||
start_date: "19790301" | ||
t_end: "1days" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a long run, could we run it for longer (e.g. 50 days) or does the simulation crash? 👀
Purpose
closes #658
Only adding a longrun, no shortrun.
This run exceeds the memory available on P100s. Caltech's V100s have 16GB and 32 GB options, neither of which is large enough for this job, according to https://www.hpc.caltech.edu/resources. Instead of running on central like the rest of the longruns, this job will run on clima (which has A100s with 80GB of memory). I've opened an issue to address the allocations seen in this run: #683
view run on buildkite here: https://buildkite.com/clima/climacoupler-longruns/builds/480#_
Content
config/longrun_configs/dyamond_target.yml
for longrunanim: false
for gpu-compatibility