-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate GPU conservation #607
Comments
These might be relevant builds (ClimaAtmos - standalone test cases which compare CPU v GPU runs) |
Stand-alone Moist held-suarez atmos runs differ too: CliMA/ClimaAtmos.jl#2876 |
|
Thanks to #737, we can confirm that our GPU longruns run as well as CPU longruns (build). Conservation logging issue (20% error) was addressed in #735. The GPU runs are systematically less conservative than the CPU runs (discussion in #735) but the difference is quite small (~1%), so we can revisit this in the future (as part of #594) when we can track energy sinks and sources to the precision of sqrt(eps) - this requires work in ClimaAtmos (see CliMA/ClimaAtmos.jl#2658 and CliMA/ClimaAtmos.jl#2568), and close down this issue. @juliasloan25, would you agree? |
completed |
Our GPU runs have higher water and energy conservation than CPU runs with identical setups. The CPU error tends to be around
1e-5
, while the GPU error is around1e-3
or1e-4
. We should look into why the conservation is worse on GPU, and try to improve it.For now (as of #589), we've set the GPU runs to soft fail if the conservation error is larger than
1e-3
, but we want them to be able to pass this threshold (and even a smaller one, ideally).part of #390
PR #614
Partially done in #735
Approaches to try
Bucket info
ClimaLand has standalone global bucket runs on CPU and GPU that we compare. For all 3 albedo options, comparing the mean values of the states from CPU and GPU runs gives a difference on the order of 1e-15. For the temporal map albedo case, which runs for 50 days, we get:
The functional and static map albedo cases show similar discrepancies, and run for 7 days each. These differences are much smaller than what we see in coupled runs, so the difference is probably not coming from the bucket model.
example RSE values seen in #589 (for comparison between CPU/GPU runs)
functional albedo
CPU
rse[end] = 1.6532404532423364e-5
rse[end] = 0.0005462468417393023
GPU
rse[end] = 7.841693994284924e-5
rse[end] = 0.0005103673696554503
static map
CPU
rse[end] = 1.5598104032642762e-5
rse[end] = 7.682827509990757e-5
GPU
rse[end] = 0.00012549802174549964
rse[end] = 0.0015490000882394841
temporal map
CPU
rse[end] = 1.7424794137292602e-5
rse[end] = 0.00033606441246576436
GPU
rse[end] = 0.00012708219201050294
rse[end] = 0.000943450817881417
The text was updated successfully, but these errors were encountered: