-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add coupler summary table #706
Conversation
663c3b0
to
581b0f9
Compare
b2534c3
to
ee86b48
Compare
52870d1
to
bc42d38
Compare
In some runs, atmos outputs an SYPD that's much larger than the SYPD calculated here by timing the coupling loop. For example, in the "GPU slabplanet: albedo from function" run here, atmos prints an SYPD around 17, and we calculate it to be 8.15. I'm not sure where this difference is coming from since I'm using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for implementing this, @juliasloan25 ! A super useful PR! 🎉 I just had a few comments / suggestions.
test/component_model_tests/climaatmos_standalone/atmos_driver.jl
Outdated
Show resolved
Hide resolved
I'm also a little surprised about the SYPD for all runs reported on Slack. For example, this build finished running 0.25 sim years in ~0.25 wallclock days (including initialization), so SYPD should be around 1. I think the config should be the same, but the slack results imply that the benchmark runs are 100x slower. 🤔 |
After speaking with Charlier, it sounds like this might be because we're using |
58e514c
to
d893b17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you! 🚀 I have a few comments and the parsed args need a fix. Then, presuming all CI runs are passing without change from the main, I think we can merge this.
experiments/AMIP/coupler_driver.jl
Outdated
## run the coupled simulation | ||
solve_coupler!(cs); | ||
## run the coupled simulation for one timestep to precompile everything before timing | ||
cs.tspan[2] = Δt_cpl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume Base.precompile(solve_coupler!, (typeof(cs),))
doesn't work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried it! I can give that a go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I've just noticed that the latest SYPD is still outputting the older estimate (including the precompilation), so we may need to revert previous version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this some more and chatted with Charlie (see conversation on slack here), and it sounds like precompile
only works if it can infer all the types, and even then still may not actually compile all the code. I opened an issue for us to fix the type inference and change to using precompile
(#775), but for now I'll just keep it as running for 2 timesteps before running the full simulation
f11da89
to
b7a8beb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving this since the table works, but please do check the discrepancy between atmos/coupler allocations and that we're not seeing any behavioral changes on Buildkite once CI is fixed. Thanks for the great work, @juliasloan25 ! 🚀
aa9e574
to
902bf36
Compare
e10648c
to
f5b066c
Compare
I fixed CI and all the plots look the same as main 😄 I also changed the functions we're using to calculate allocations. Now there's one metric that's the same between CPU and GPU runs, which calculates the maximum CPU memory used over the course of the simulation (max RSS). The numbers don't look exactly like what I would expect (i.e. there are more allocations for atmos-only than coupled), but I think this is something we can easily change later on if we decide this isn't a reliable metric, and I don't want it to hold up this whole PR. I'll see if I can figure out what's going on with the allocation measurement tomorrow, but if nothing comes up I'm inclined to merge this as-is and investigate later on as we trigger more runs Update: I added a |
156e40d
to
828487b
Compare
Purpose
closes #705
status 4/23
view passing build in coupler benchmarks pipeline here
view output to slack here
To do
To do in separate PR(s)
note: I was using
CUDA.memory_status
to get the GPU memory usage, but this varied wildly between runs (3-30GB). I'm not sure why this is, but it doesn't seem to be a reliable metric so I won't be using it going forward. Here is the code block calling that function, in case we want to return to it in the future: