Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5 and 6 -- both have vertical remap yaml outputs) #2798

Open
ndkeen opened this issue Apr 25, 2024 · 1 comment
Labels
bug Something isn't working GPU PRs that make changes specifically for GPUs

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Apr 25, 2024

We've seen this test fail for many days:
ERS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-small_kernels--scream-output-preset-5
and
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-bfbhash--scream-output-preset-6

Trying to narrow down the issue, I see that it looks like it's the scream-output-preset-5 that is likely the culprit. And possibly also shows with scream-output-preset-6. The test with preset 1,2,3,4 have not seen error.

Additionally, not all attempts hit this error. So there is a chance of getting cuda error with this testmod.

I verified can get same behavior with just SMS: (ie all of these also have failure in most attempts)
SMS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5

as well as a DEBUG test:
SMS_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5

and with only 1 thread:
SMS_PMx1_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5

The cuda error does not present at the same timestep either.

perlmutter-login06% pwd
/global/cfs/cdirs/e3sm/ndk/repos/se00-apr23/components/eamxx/cime_config/testdefs/testmods_dirs/scream/output/preset
perlmutter-login06% grep hremap_to_ne4 */*
3/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
4/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
6/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
perlmutter-login06% grep vremap */*
5/shell_commands:. $SCRIPTS_DIR/vremap/shell_commands
6/shell_commands:. $SCRIPTS_DIR/vremap/shell_commands

Sorta points to the issue being in vremap

YAML_FILES=$(ls -1 | grep 'eamxx_.*_output.yaml')
for fname in ${YAML_FILES}; do
  $YAML_EDIT_SCRIPT -f $fname --vertical-remap-file \${DIN_LOC_ROOT}/atm/scream/maps/vrt_remapping_p_levs_20230926.nc
done

For the other conus test, I can reproduce with something as simple as:
SMS_P8x1_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-output-preset-5
ie, use SMS, DEBUG, and only use 2 nodes (default is 8) without threading.

Directory where I made many attempts: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se00-apr23

The tests seem to pass on pm-cpu (ie I tried those that fail, but not as extensively as above)

@bartgol bartgol added the bug Something isn't working label Apr 25, 2024
@ndkeen ndkeen changed the title cudaErrorIllegalAddress with pm-gpu cdash test (with scream-output-preset-5) cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5) Apr 25, 2024
@ndkeen ndkeen changed the title cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5) cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5 and 6 -- both have vertical remap) Apr 25, 2024
@ndkeen ndkeen added the GPU PRs that make changes specifically for GPUs label Apr 25, 2024
@ndkeen ndkeen changed the title cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5 and 6 -- both have vertical remap) cudaErrorIllegalAddress with pm-gpu cdash tests (with scream-output-preset-5 and 6 -- both have vertical remap yaml outputs) Apr 26, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented May 4, 2024

Both of the failing tests are passing in @bartgol branch bartgol/eamxx/use-only-scorpio-clib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GPU PRs that make changes specifically for GPUs
Projects
None yet
Development

No branches or pull requests

2 participants