-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfsmetp jobs do not generate stats files on Hera and Hercules #2759
Comments
Confirming that I am seeing the same thing in my C384 experiments, both hybrid ATM-only and 3DVar S2S. Those experiments use workflows from July 5 and July 3 respectively. |
Added the following lines back to the
Rerun of Hera 2024050400 gfsmetpg2g1. This time the job generated metplus stats files.
For details see The |
@RussTreadon-NOAA Interesting. The issue appears to reside in export nproc=${npe_node_metp_gfs:-1} This should probably be looking for |
@DavidHuber-NOAA , your solution works. I made the recommended changes in a working copy of g-w on Hera:
The prjedi 2024050400 gfsmetp jobs were rewound and rebooted. Each successfully ran to completion with metplus stats files generated. |
One thing that concerns me is that the metp jobs did not produce any stat files, yet the jobs did not fail and were marked "SUCCEEDED". Is there an easy way we can improve the scripts to catch this in the future? |
I agree @CatherineThomas-NOAA . The The failed gfsmetp jobs contain warnings that no stats files were produced. For example the 2024050400 gfsmetpg2g1.log.3 contains
Script Tagging @malloryprow for awareness. |
Hi @RussTreadon-NOAA, I think we had the gfsmetp jobs fail silently so they didn't hold up the rest of the workflow if things failed. I think this is something that was discussed a long time ago. Did you find why the jobs failed? |
@malloryprow , yes @DavidHuber-NOAA found the reason for the failure (see above). The danger with silent failures is that a developer could run a parallel assuming gfsmetp success means stats are generated only to find that the stats are missing when (s)he goes to plot them. Of course, generating the stats after the fact isn't hard. It just takes additional time. |
Ah missed those! I definitely get not wanting the silent failure. It should be something easily fixed. If no stat files were copied, exit with an error code. Does that sound appropriate? |
I have a couple of ideas on how to resolve this and maintain the ability to push past failed METplus jobs:
<sh shell="/bin/sh"><cyclestr>&HOMEgfs;/ush/check_metp.sh &ARCDIR;/metplus_data/by_VSDB/grid2grid/anom/@HZ/&pslot;/&pslot;_@Y@m@d.stat</cyclestr></sh>
filename=${1:-""}
[ -z "${filename}" ] && echo "METplus filename is an empty string" && exit 0
[ ! -f "${1}" ] && echo "METplus output file $1 does not exist!" && exit 0
[ -f "$1" ] && echo "METplus output file $1 is zero-sized!" && exit 0 |
I can see that being a helpful avenue. Is that something that would run at the end of every cycle? I think checks would be helpful too for the grid2obs and precip stat files too. |
@malloryprow Yes, it would run at the end of every cycle after all jobs for a particular |
I think it may be better to include in copy_stat_files.py. The metp task don't produce anything for the gdas cycles or every gfs cycle. |
OK, noted. I could code it so that it only runs for |
Having to dust some cobwebs off my brain here since it has been a good while since I have ran anything with the global workflow...if only one gfs cycle is being run it will run for that one every time, if more than one it should all run at the 00Z cycle. Kind of weird but when I first put the gfsmetp tasks in this is how VSDB was set up to run so I copied that "style". |
Interesting. In that case, I agree that putting something in the |
@RussTreadon-NOAA Would you be able to copy /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/get_data_files.py on Hera into the needed location, and then rerun the gfsmetp tasks in the configuration it was when it failed? |
@malloryprow , I no longer have a g-w configuration which fails. My local copy of g-w contains the changes @DavidHuber-NOAA recommended. |
Ah okay. The changes I made worked when running standalone but wanted confirmation that the desired behavior also happens in the global workflow. |
@malloryprow I still have my test that I can set to run another few cycles. I'll copy that file and let you know how it goes. |
@malloryprow: Is it get_data_files.py or copy_stat_files.py that I should be copying? The file listed is the same as the version in my clone. |
@malloryprow . I broke my configuration by reverting @DavidHuber-NOAA 's changes. I reran the 2024050500 gfsmetpg2g1. Warning messages were printed but the job still completed with I then realized that your comment refers to I copied
I looked at the changes you made to Your modified script checks the destination directory for stat files. Since my previous run with Dave's fixes worked, the stat files are already in the destination directory. I moved the existing stat files to a new name and reran the job. The gfsmetpg2g1 still finished with I looked again at your modified This time gfsmetpg2g1 failed with
We should change the logic in @DavidHuber-NOAA recommends a change to |
Ah sorry @CatherineThomas-NOAA. You're right it should be copy_stat_files.py |
@RussTreadon-NOAA Thanks for testing. I see what you are saying. Let me add that in and update you. And yes I see
|
@RussTreadon-NOAA @CatherineThomas-NOAA /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/copy_stats_files.py updated! Apologies for stating the wrong file earlier. |
@malloryprow , your updated
I added |
Ah! Fixed that. Glad it detected the missing file. |
@malloryprow , the local modifications to EMC-verif made in light of the above discussion are on Hera in
@DavidHuber-NOAA , the change to |
Thanks @RussTreadon-NOAA! I got the changes into the develop branch of EMC_verif-global at NOAA-EMC/EMC_verif-global@7118371. @KateFriedman-NOAA updated the hash for EMC_verif-global :) The new hash is different from what I linked because it includes changes for running EMC_verif-global standalone on Orion following the Rocky upgrade. The hash has the needed changes. |
@malloryprow I left a comment in the commit you referenced. |
What is wrong?
gfsmetp jobs run in Hera
testprepobs
and Herculestestprepobsherc
(see issue #2694) finish withstatus=0
(success) but no metplus stats files are generated. For example, Hera2021122200/gfsmetpg2g1.log
containsWhat should have happened?
gfsmetp jobs run in WCOSS2 (Dogwood)
testprepobsherc
(see issue #2694) finish withstatus=0
(success) AND create non-zero length metplus stats files. for example,Non-zero size stats files should also be created in the Hera and Hercules parallels.
What machines are impacted?
Hera, Hercules
Steps to reproduce
Set up and run the parallels described in issue #2694
Additional information
gfsmetp*log files on Hera and Hercules both contain
srun: error
messages. For example,Do you have a proposed solution?
Why is the
srun: error
message generated? Would fixing this error result in metplus stats files being generated?The text was updated successfully, but these errors were encountered: