You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've seen this test fail for many days: ERS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-small_kernels--scream-output-preset-5
and ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-bfbhash--scream-output-preset-6
Trying to narrow down the issue, I see that it looks like it's the scream-output-preset-5 that is likely the culprit. And possibly also shows with scream-output-preset-6. The test with preset 1,2,3,4 have not seen error.
Additionally, not all attempts hit this error. So there is a chance of getting cuda error with this testmod.
I verified can get same behavior with just SMS: (ie all of these also have failure in most attempts) SMS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
as well as a DEBUG test: SMS_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
and with only 1 thread: SMS_PMx1_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
The cuda error does not present at the same timestep either.
YAML_FILES=$(ls -1 | grep 'eamxx_.*_output.yaml')
for fname in ${YAML_FILES}; do
$YAML_EDIT_SCRIPT -f $fname --vertical-remap-file \${DIN_LOC_ROOT}/atm/scream/maps/vrt_remapping_p_levs_20230926.nc
done
For the other conus test, I can reproduce with something as simple as: SMS_P8x1_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-output-preset-5
ie, use SMS, DEBUG, and only use 2 nodes (default is 8) without threading.
Directory where I made many attempts: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se00-apr23
The tests seem to pass on pm-cpu (ie I tried those that fail, but not as extensively as above)
The text was updated successfully, but these errors were encountered:
ndkeen
changed the title
cudaErrorIllegalAddress with pm-gpu cdash test (with scream-output-preset-5)
cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5)
Apr 25, 2024
ndkeen
changed the title
cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5)
cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5 and 6 -- both have vertical remap)
Apr 25, 2024
ndkeen
changed the title
cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5 and 6 -- both have vertical remap)
cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5 and 6 -- both have vertical remap yaml outputs)
Apr 26, 2024
We've seen this test fail for many days:
ERS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-small_kernels--scream-output-preset-5
and
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-bfbhash--scream-output-preset-6
Trying to narrow down the issue, I see that it looks like it's the
scream-output-preset-5
that is likely the culprit. And possibly also shows withscream-output-preset-6
. The test with preset 1,2,3,4 have not seen error.Additionally, not all attempts hit this error. So there is a chance of getting cuda error with this testmod.
I verified can get same behavior with just SMS: (ie all of these also have failure in most attempts)
SMS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
as well as a DEBUG test:
SMS_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
and with only 1 thread:
SMS_PMx1_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
The cuda error does not present at the same timestep either.
Sorta points to the issue being in
vremap
For the other conus test, I can reproduce with something as simple as:
SMS_P8x1_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-output-preset-5
ie, use SMS, DEBUG, and only use 2 nodes (default is 8) without threading.
Directory where I made many attempts:
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se00-apr23
The tests seem to pass on pm-cpu (ie I tried those that fail, but not as extensively as above)
The text was updated successfully, but these errors were encountered: