Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug GPU conservation #614

Closed
wants to merge 31 commits into from
Closed

debug GPU conservation #614

wants to merge 31 commits into from

Conversation

juliasloan25
Copy link
Member

@juliasloan25 juliasloan25 commented Feb 12, 2024

Purpose

Debug conservation difference between CPU and GPU #607

To-do

  • save sim states to HDF5 daily
  • only run CPU/GPU slabplanet sims on buildkite

Content


  • I have read and checked the items on the review checklist.

@juliasloan25 juliasloan25 force-pushed the js/gpu-debug branch 3 times, most recently from bcb054c to 051a84d Compare February 20, 2024 19:46
@juliasloan25 juliasloan25 force-pushed the js/gpu-debug branch 2 times, most recently from 4947bb5 to aa50f3a Compare March 19, 2024 15:30
@LenkaNovak
Copy link
Collaborator

Note, small degradation of conservation total at radiation timestep on CPU (FT64, so small but not negligible):

Screen Shot 2024-03-21 at 7 05 51 AM Screen Shot 2024-03-21 at 7 06 38 AM

@LenkaNovak
Copy link
Collaborator

CPU (left) vs GPU (right) based on @juliasloan25's investigation:

  • significant deviations in states of all models between GPU and CPU after 1 day, e.g. for atmos:
    Screen Shot 2024-03-21 at 7 25 35 AM
  • given there was no deviation in states after 1 dt, let's try to keep radiation constant (dt_rad > 1day) as the next step.

@juliasloan25
Copy link
Member Author

juliasloan25 commented Mar 21, 2024

Note, small degradation of conservation total at radiation timestep on CPU (FT64, so small but not negligible):
...

Which runs are these plots from? It looks like we have dt_rad: 1hours for most of the runs here (or at least the AMIP runs - slabplanet is using whatever our default is), so I just to make sure we're attributing the conservation change to the right thing :)

@LenkaNovak
Copy link
Collaborator

Note, small degradation of conservation total at radiation timestep on CPU (FT64, so small but not negligible):
...

Which runs are these plots from? It looks like we have dt_rad: 1hours for most of the runs here (or at least the AMIP runs - slabplanet is using whatever our default is), so I just to make sure we're attributing the conservation change to the right thing :)

The plots above are was for slabplanet with static albedo. Thanks for setting off the new runs... 🤞

@LenkaNovak
Copy link
Collaborator

Hmm, keeping the rad constant doesn't seem to help. Would it be possible to run for 400s and add more debug plots in the coupling loop (with t appended to the plots' titles). Hopefully that'll help us narrow this down.

@juliasloan25
Copy link
Member Author

Hmm, keeping the rad constant doesn't seem to help. Would it be possible to run for 400s and add more debug plots in the coupling loop (with t appended to the plots' titles). Hopefully that'll help us narrow this down.

Sure! Do you think 400s will be enough? I could also run for something like 6 hours and plot every hour

@LenkaNovak
Copy link
Collaborator

Hmm, keeping the rad constant doesn't seem to help. Would it be possible to run for 400s and add more debug plots in the coupling loop (with t appended to the plots' titles). Hopefully that'll help us narrow this down.

Sure! Do you think 400s will be enough? I could also run for something like 6 hours and plot every hour

I would try 400s it for now (in case the first time step is overridden by the initial step! and reinit, but hourly plots like you suggest would be the next thing to try ;-)

@LenkaNovak
Copy link
Collaborator

LenkaNovak commented Mar 21, 2024

I'm seeing differences in conservation but no visible difference in the states. 🤔
Screen Shot 2024-03-21 at 10 38 42 AM

It is likely a round off error. The good thing is that we can see the divergence after 400s, but to see where exactly it is coming from, we can

    1. print the extrema of the plotted fields with more decimal places
    1. save and exactly compare some key variable arrays (I would start with coupler fields and land fields)

@juliasloan25
Copy link
Member Author

I've added tests to compare the states of each component model simulation at the end of the simulation between CPU and GPU. Looking at slabplanet with static albedo run for 400s, we see that the land and ocean states are approximately equal between CPU and GPU, but the land state is not. The max difference in specific atmos variables is shown below. Given that the only state with differences after 400s is the atmosphere, it will be helpful to set up CPU/GPU conservation tests in atmos.

slabplanet: albedo from static map (CPU vs GPU) - build
atmos_ρe_tot

julia> abs(maximum(cpu_atmos_state[:,1] .- gpu_atmos_state[:,1]))
0.005922795255173696

atmos_ρq_tot

julia> abs(maximum(cpu_atmos_state[:,2] .- gpu_atmos_state[:,2]))
1.5653893756352792e-10

atmos_ρ

julia> abs(maximum(cpu_atmos_state[:,3] .- gpu_atmos_state[:,3]))
2.7223778786833464e-7

atmos_uₕ[1]

julia> abs(maximum(cpu_atmos_state[:,4] .- gpu_atmos_state[:,4]))
0.17414431556971977

atmos_uₕ[2]

julia> abs(maximum(cpu_atmos_state[:,5] .- gpu_atmos_state[:,5]))
0.0863372049825557

@@ -0,0 +1,111 @@
import DelimitedFiles as DLM
using Statistics
import ClimaCoupler
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this

@juliasloan25
Copy link
Member Author

this has been resolved in #733

@juliasloan25 juliasloan25 mentioned this pull request May 17, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants