-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Bug fix for Issue #851 (renaming NET
to NET_default
)
#868
[develop] Bug fix for Issue #851 (renaming NET
to NET_default
)
#868
Conversation
…bles that got the _default prefix appended by PR #) where necessary to address bug in Issue ufs-community#851 (the WE2E test "MET_ensemble_verification_only_vx_time_lag" is failing).
NET
to NET_default
renamingNET
to NET_default
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, this even includes a change that I missed in my PR. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, these changes look good to me! I was able to successfully run the MET_ensemble_verification_only_vx_time_lag
WE2E test and also the Hera GNU WE2E coverage tests successfully passed as well.
However, after noting the additional changes that you made to ush/config_defaults.yaml
, I decided to run the Hera Intel WE2E coverage tests. The grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
test failed with the following message:
FATAL from PE 5: compute_qs: saturation vapor pressure table overflow, nbad= 1
When I removed the _default from NET_default on lines 2453, 2454, and 2455 of ush/config_defaults.yaml
, the test ran without issue.
Since you made these changes in preparation for introducing ecflow into the SRW App for the AQM implementation, should these three lines also include the _default
, or should they be NET
? Thanks!
ush/config_defaults.yaml
Outdated
FCST_SUBDIR_TEMPLATE: '{% if user.RUN_ENVIR == "nco" %}${NET_default}.{init?fmt=%Y%m%d?shift=-${time_lag}}/{init?fmt=%H?shift=-${time_lag}}{% else %}{init?fmt=%Y%m%d%H?shift=-${time_lag}}{% if global.DO_ENSEMBLE %}/${ensmem_name}{% endif %}/postprd{% endif %}' | ||
FCST_FN_TEMPLATE: '${NET_default}.t{init?fmt=%H?shift=-${time_lag}}z{% if user.RUN_ENVIR == "nco" and global.DO_ENSEMBLE %}.${ensmem_name}{% endif %}.prslev.f{lead?fmt=%HHH?shift=${time_lag}}.${POST_OUTPUT_DOMAIN_NAME}.grib2' | ||
FCST_FN_METPROC_TEMPLATE: '${NET_default}.t{init?fmt=%H}z{% if user.RUN_ENVIR == "nco" and global.DO_ENSEMBLE %}.${ensmem_name}{% endif %}.prslev.f{lead?fmt=%HHH}.${POST_OUTPUT_DOMAIN_NAME}_a${ACCUM_HH}h.nc' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While manually running the coverage.hera.intel.nco WE2E tests, I'm encountering the following failure in the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
test, in the run_fcst task:
FATAL from PE 5: compute_qs: saturation vapor pressure table overflow, nbad= 1
When I removed the _default
from NET_default
on lines 2453, 2454, and 2455, the test once again passes.
If you would like to run the Hera Intel coverage WE2E tests, please set the following environmental variables:
setenv SRW_PLATFORM "hera"
setenv SRW_COMPILER "intel"
setenv WORKSPACE "/path/to/your/ufs-srweather-app"
setenv SRW_WE2E_COMPREHENSIVE_TESTS "False"
setenv SRW_PROJECT "enter_your_Hera_allocation_account_here"
With the above variables set, you should be able to go into .cicd/scripts
, then run ./srw_build.sh
to build the SRW App, then ./srw_test.sh
to run the Hera Intel coverage tests.
Alternatively, you should be able to launch the tests without the environment variables, but instead use the following command from tests/WE2E
:
./setup_WE2E_tests.sh hera Hera_allocation_account intel coverage --expt_basedir=/path/to/where/you/want/your/expt_dirs --ops_root=/path/to/where/you/want/your/nco_dirs
where Hera_allocation_account
in the above command line is your account to run jobs on Hera.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelLueken That's strange because those 3 variables are used only in verification as far as I know (I created them), and the test grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
doesn't run any vx tasks.
I will run this test and see what happens. I normally run the WE2E tests using the run_WE2E_tests.py
script. Should that method give the same result as using the setup_WE2E_tests.sh
(i.e. using your method)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsketefian I don't think that running the coverage.hera.intel.nco
would work the same way if you ran it through ./run_WE2E_tests.py
. The Hera Intel coverage tests forces all of the WE2E tests to run in NCO-mode. In order to mimic this outside of manually running the Jenkins tests or running the ./setup_WE2E_tests.sh
script, you would need to set run_envir: nco
for all tests in the coverage.hera.intel.nco
machine_suites
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelLueken Thanks for the explanation. Since I'm more familiar with run_WE2E_tests.py
, I first wanted to make sure the test grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
runs successfully using this script. So:
- I ran this test (in community mode) using
run_WE2E_tests.py
. The test completed successfully. You can see the output at:/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_SRW_phys_tendencies/expt_dirs/NET_02/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
- To repeat the same test except in NCO mode, I created a new test in
named
ufs-srweather-app/tests/WE2E/test_configs/grids_extrn_mdls_suites_nco
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
. The only difference between this and the original test is that it hasRUN_ENVIR
set to "nco", i.e. in the config file, we haveuser: RUN_ENVIR: nco
I ran this again usingrun_WE2E_tests.py
. This test completed successfully. You can see the output at:and at/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_SRW_phys_tendencies/expt_dirs/NET_02/nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_SRW_phys_tendencies/expt_dirs/nco_dirs.works01
This makes me think the problem is in the way setup_WE2E_tests.sh
runs the test and/or due to something that was changed by PR #847 since my clone is one PR behind that. I'll try it with setup_WE2E_tests.sh
. If that works, I'll update my hash to PR #847 and retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelLueken I tried the way you suggested of rebuilding using .cicd/scripts/srw_build.sh
, then running .cicd/scripts/srw_test.sh
. To make things simpler, I only included grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
in coverage.hera.intel.nco
. The test succeeded. You can find the experiment directory at
/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_SRW_phys_tendencies/ufs-srweather-app/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2.hash_f9696e1b
and the nco_dirs
directory at
/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_SRW_phys_tendencies/ufs-srweather-app/nco_dirs.hash_f9696e1b
Ok, last thing to try is to update my hash (i.e. merge latest develop
into my branch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelLueken As a final test, I merged the latest develop (with PR #847) into my branch and reran srw_test.sh
as above (I didn't rebuild since none of the external repo hashes changed). The test succeeded. You can find the experiment directory at
/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_SRW_phys_tendencies/ufs-srweather-app/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2.hash_6e5b4275
and the nco_dirs directory at
/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_SRW_phys_tendencies/ufs-srweather-app/nco_dirs.hash_6e5b4275
So I don't know why your tests failed but mine work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsketefian I've found in my testing that running just the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
test also lead to the test passing. The only time that the test failed was when it was being run with the full suite of coverage tests. When the _default
was removed, then all of the coverage tests passed.
Since the updated MET_ensemble_verification_only_vx_time_lag
test is passing and the updated coverage.hera.gnu.com
suite is passing, and your changes look fine to me, I will go ahead and approve these changes and let the automated testing run. I'll let you know if the Hera Intel test continues to fail and will direct you to the output. If the test continues to fail, then these changes will likely need to be removed and a new issue will need to be opened to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsketefian As noted previously, the updated MET_ensemble_verification_only_vx_time_lag
WE2E test and the Hera GNU coverage tests were all ran and all successfully passed. The changes look good and make sense. I will go ahead and approve this PR and will run the automated Jenkins tests. I'll let you know if the Hera Intel test continues to fail.
@gsketefian The coverage tests failed on Hera Intel:
The
Please see |
Adding the DO_NOT_MERGE label until the Hera Intel coverage tests successfully pass. |
I've run several additional tests and I think we can move forward with your changes once PR #853 has been merged. When I apply the changes you made to If the rest of the tests pass, I will go ahead and merge PR #853. You will need to merge develop into your branch at that time (there will be conflicts in |
Manual tests on Hera Intel using the work from PR #853 with adding the modifications made to
Similarly, merging PR #853 into the
I will go ahead and merge PR #853 now. @gsketefian - Please merge the latest develop into your |
@MichaelLueken I just merged the latest |
Thanks, @gsketefian! I've kicked off the automated Jenkins tests on Hera now. |
The Cheyenne Intel WE2E coverage tests were manually ran on Hera Intel and all successfully passed:
|
The Cheyenne GNU tests were manually ran on Hera GNU and all successfully passed:
Additionally, the reruns of the Jenkins automated WE2E coverage tests on Hera successfully passed -
Hera GNU:
Removing DO_NOT_MERGE label and moving forward with merging this PR now. |
@MichaelLueken Thanks for testing and merging. |
DESCRIPTION OF CHANGES:
The renaming of the configuration variable
NET
toNET_default
by PR #828 broke the WE2E testMET_ensemble_verification_only_vx_time_lag
(see Issue #851). This PR fixes the bug and adds that WE2E test to the list of coverage tests on Hera (currently that test is only in the comprehensive tests list).Type of change
TESTS CONDUCTED:
Ran the WE2E test
MET_ensemble_verification_only_vx_time_lag
as well as the set of fundamental tests on Hera with Intel. All passed.DEPENDENCIES:
None needed.
ISSUE:
Fixes Issue #851.
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR: