Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigation of unexpected behavior of ctest rrfs_3denvar_rdasens #766

Open
TingLei-NOAA opened this issue Jul 2, 2024 · 14 comments
Open
Assignees

Comments

@TingLei-NOAA
Copy link
Contributor

As Peter Johnsen via orion help desk suggested and @RussTreadon-NOAA helped , the behavior of regional GSI after the orion upgrading is being investigate in relation to the issues, found on hercules, of the netcdf error (when I_MPI_EXTRA_FILESYSTEM) /or unproducible issues (#697),
it is found rrfs_3denvar_rdasens_loproc_updat would become idle (not finished in 1 hour 30 min) using 4 nodes and ppn=5. I have to follow the recent set up: 3 nodes , ppn=40 on hera given by @hu5970 and the job could finish successfully.
It is not clear to me what caused this and if it is an spontaneous issue (since on other complaints on this up to now) and this issue is to facilitate collaborative investigation into this issue.
In addition GSI developers mentioned in the above, I 'd also like to bring this to the attention of @ShunLiu-NOAA @DavidHuber-NOAA .

@RussTreadon-NOAA
Copy link
Contributor

Thank you @TingLei-NOAA for opening this issue. This is a known problem. Please see discussions in

RDHPCS ticket #2024062754000098 has also been opened.

GSI PR #764 was merged into develop at EIB's request.

@ShunLiu-NOAA
Copy link
Contributor

@TingLei-NOAA and @RussTreadon-NOAA Thank you for the head-up. Since there is a RDHPCS ticket, we can wait for the further action from RDHPCS.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for those info. I will study updates with those issues carefully first.
@ShunLiu-NOAA I begin to think , maybe this issue is not specific to orion, since I see the similar set up ( more than 100 mpi processes are needed while the nodes numbers are maybe smaller ( so seems not memory issue) are made for other machines like hera/wcoss2. It is also found if the "fed" obs is not used and fed model fields are not included in the control/state variables, this rrfs test works "normally" (using similar mpi task setup as hafs and previous fv3lam test) .
I will do some further digging and see what I could get.

@TingLei-NOAA
Copy link
Contributor Author

TingLei-NOAA commented Jul 3, 2024

It is confirmed the same behavior on hera (when ppn=5; nodes=4) , the rrfs_3denvar_rdasens_lopupdat became idle.
Seems the issue occurs in the parallel reading of physvar files (dbz and fed). One mpi process failed to finish processing all levels assigned to them.

@RussTreadon-NOAA RussTreadon-NOAA changed the title investigation of unexpected behavior of ctest rrfs_3denvar_rdasens on orion investigation of unexpected behavior of ctest rrfs_3denvar_rdasens Jul 3, 2024
@TingLei-NOAA
Copy link
Contributor Author

An update: It is confirmed this ctest rrfs_3denvar_rdasens would pass using 20 mpi tasks on wcoss2.
(while it would fail on both hera and orion with the newer compiler (upgraded Rocky 9)
Using 20 tasks, GSI would become idle on the 9 th mpi rank when it began to deal with fed variables of the level 1 (https://github.com/TingLei-daprediction/GSI/blob/dd341bb6b3e5aca403f9f8ea0a03692a397f29e9/src/gsi/gsi_rfv3io_mod.f90#L2894) after successfully reading in a few levels of dbz variables.

For being now, we could use the similar task numbers as on hera to let this ctest pass. But i think further investigation will be helpful. I will have more discussions (some off-line) with colleagues while I might submit a ticket for this problem).

@TingLei-NOAA
Copy link
Contributor Author

An ticket with orion had been opened. A self-contained test case on hera to reproduce this issue was created and sent to R. Reddy at the helpdesk (Thanks a lot!)

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , what is the status of this issue?

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA I will follow on this and come back when I have more updates to share.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-daprediction , what is the status of this issue? PR #788 is a workaround, not a solution.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Experts on RDHPCS helps desk haven't made progresses on this. We agreed that their work on this could be on hold with that ticked open and I will keep them posted if I have any new findings.
I will find chances to deeper investigation into this issue if it work for other GSI developers.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @TingLei-NOAA . We periodically cycle through open GSI issues and PRs asking developers for updates. Developer feedback helps with planning and coordinating. Sometimes we even find issues which can be closed or PRs abandoned.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Really appreciate your help on all those issues/problems we encountered in this "transition period"!

@RussTreadon-NOAA
Copy link
Contributor

Problems with the rrfs_3denvar_rdasens test now occur on Gaea, Jet, and Hera. The patch, thus far, is to alter the job configuration. The underlying cause for the hangs remains identified, confirmed, and resolved.

Is this an accurate assessment, @TingLei-NOAA ? If not, please update this issue with where we currently stand on this issue.

@ShunLiu-NOAA
Copy link
Contributor

@RussTreadon-NOAA Ting is on leave for two weeks. He will work on it when he returns to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants