Gaea C6 support for UFSWM #2448

BrianCurtis-NOAA · 2024-10-02T17:48:48Z

Commit Queue Requirements:

Fill out all sections of this template.
All sub component pull requests have been reviewed by their code managers.
Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
Commit 'test_changes.list' from previous step

Description:

This PR will bring in all changes necessary to provide Gaea C6 support for UFSWM

Commit Message:

* UFSWM - Gaea C6 Support

Priority:

Normal

Git Tracking

UFSWM:

Closes Enable ufs-weather-model on Gaea-C6 #2407
None

Sub component Pull Requests:

None

UFSWM Blocking Dependencies:

Blocked by #
None

Changes

Regression Test Changes (Please commit test_changes.list):

No Baseline Changes. (just adds logs for Gaea C6)

Input data Changes:

None.

Library Changes/Upgrades:

No Updates

Testing Log:

BrianCurtis-NOAA · 2024-10-02T17:53:59Z

cpld_control_p8 intel fails for timing out, so there's work to tweak the configs to better match the C6 hardware.

I think there's still lots of other items to check here, this is just a placeholder for now. Please feel free to send PR's to my fork/branch to add/adjust/fix any issues etc...

…into gaeac6

BrianCurtis-NOAA · 2024-10-02T17:55:49Z

Also, once things start falling into place, we'll need to make sure intelllvm support is available for c6.

RatkoVasic-NOAA · 2024-10-04T00:50:55Z

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

sanAkel · 2024-10-04T01:01:09Z

@BrianCurtis-NOAA Shall I re-try building with these modulefiles/ufs_gaeac6.intel.lua in this PR?

tests/compile.sh

BrianCurtis-NOAA · 2024-10-04T14:12:49Z

cpld_control_p8 fails with:

  5: MPICH ERROR [Rank 5] [job id 207188364.0] [Fri Oct  4 13:33:08 2024] [c6n0210] - Abort(941244175) (rank 5 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
  5: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffe81f20fe0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffe81f2113c) failed
  5: MPID_Win_create(89).......:
  5: MPIDIG_mpi_win_create(872):
  5: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

and control_p8 runs to completion:

0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . 
  0: *****************RESOURCE STATISTICS*******************************
  0: The total amount of wall time                        = 853.216145
  0: The total amount of time in user mode                = 216.242551
  0: The total amount of time in sys mode                 = 410.041583
  0: The maximum resident set size (KB)                   = 1720560
  0: Number of page faults without I/O activity           = 131391
  0: Number of page faults with I/O activity              = 173
  0: Number of times filesystem performed INPUT           = 1024
  0: Number of times filesystem performed OUTPUT          = 0
  0: Number of Voluntary Context Switches                 = 16903
  0: Number of InVoluntary Context Switches               = 9006
  0: *****************END OF RESOURCE STATISTICS*************************

BrianCurtis-NOAA · 2024-10-04T14:51:32Z

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

DusanJovic-NOAA · 2024-10-04T15:03:16Z

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

Yes.

ulmononian · 2024-10-16T17:21:29Z

@BrianCurtis-NOAA @jkbk2004 @FernandoAndrade-NOAA i believe EPIC now has full access to the bil-fire8 project (disk space and compute resources). i was able to run a control_c48 test using this allocation in /gpfs/f6/bil-fire8/scratch/role.epic/ufs-wm_2448 with run_dir at /gpfs/f6/bil-fire8/scratch/role.epic/RT_RUNDIRS/role.epic/FV3_RT/rt_1552059, but i had to create new baselines since they are not yet staged on c6. seems like rocoto should be installed on c6 as well (@natalie-perlin).

jkbk2004 · 2024-10-16T19:59:59Z

@BrianCurtis-NOAA can you sync up branch? I think I am able to create baseline on c6: /gpfs/f6/bil-fire8/world-shared/role.epic/UFS-WM_RT/NEMSfv3gfs.

jkbk2004 · 2024-10-17T13:00:25Z

Continue to see failures with various cases.

atmaero_control_p8_intel failed in run_test
cpld_bmark_p8_intel failed in run_test
cpld_control_ciceC_p8_intel failed in run_test
cpld_control_p8_faster_intel failed in run_test
cpld_control_p8_intel failed in run_test
cpld_control_p8_mixedmode_intel failed in run_test
cpld_control_p8.v2.sfc_intel failed in run_test
cpld_debug_p8_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test
regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:

- cpld_bmark_p8_intel:
 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required
- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:
592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
192: forrtl: error (78): process killed (SIGTERM)
- regional_atmaq_debug_intel:
srun: error: c6n0014: tasks 0-191: Killed
srun: Terminating StepId=207205194.0
327: forrtl: error (78): process killed (SIGTERM)
327: Image              PC                Routine            Line        Source
327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown
327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown
327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown
- all other failed cases :
 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

aerorahul · 2024-10-17T14:10:36Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA
Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6.
Having no delimiter would be even better as in gaeac5 and gaeac6 Most
MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous.
Thanks for your consideration.

RatkoVasic-NOAA · 2024-10-17T14:55:10Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

Any combination is OK, as long as they are same length.

ulmononian · 2024-10-17T16:26:28Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

BrianCurtis-NOAA · 2024-10-17T18:00:26Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.
@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

I'll be going with gaeac6 and gaeac5, FYI. I'll make those changes at some point tomorrow.

RatkoVasic-NOAA · 2024-10-17T18:29:48Z

@BrianCurtis-NOAA @ulmononian @jkbk2004
Since Gaea C5, and Gaea C6 are almost identical, I suggest you expand this PR to include changes to C5 as well.

Changes in rt.sh:
    export LD_PRELOAD=/usr/lib64/libstdc++.so.6
    module load PrgEnv-intel/8.5.0
    module load intel-classic/2023.2.0
    module load cray-mpich/8.1.28
    module load python/3.9.12
Change in ./modulefiles/ufs_gaea.intel.lua:
    stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
    load(pathJoin("stack-intel", stack_intel_ver))
    stack_cray_mpich_ver=os.getenv("stack_cray_mpich_ver") or "8.1.28"
    load(pathJoin("stack-cray-mpich", stack_cray_mpich_ver))
Change in ./tests/run_test.sh:
-    module load stack-intel/2023.1.0 stack-cray-mpich/8.1.25
+    module load stack-intel/2023.2.0 stack-cray-mpich/8.1.28

Also adding in ./tests/fv3_conf/fv3_slurm.IN_gaea:
export FI_VERBS_PREFER_XRC=0

ulmononian · 2024-10-17T18:49:08Z

Continue to see failures with various cases.


atmaero_control_p8_intel failed in run_test

cpld_bmark_p8_intel failed in run_test

cpld_control_ciceC_p8_intel failed in run_test

cpld_control_p8_faster_intel failed in run_test

cpld_control_p8_intel failed in run_test

cpld_control_p8_mixedmode_intel failed in run_test

cpld_control_p8.v2.sfc_intel failed in run_test

cpld_debug_p8_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test

regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:


- cpld_bmark_p8_intel:

 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required

- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:

592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...

592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

192: forrtl: error (78): process killed (SIGTERM)

- regional_atmaq_debug_intel:

srun: error: c6n0014: tasks 0-191: Killed

srun: Terminating StepId=207205194.0

327: forrtl: error (78): process killed (SIGTERM)

327: Image              PC                Routine            Line        Source

327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown

327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown

327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown

- all other failed cases :

 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:

 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

please try what @RatkoVasic-NOAA has suggested in your job cards, before fv3.exe is run: export FI_VERBS_PREFER_XRC=0.

this is a known issue inherent to the c5 system. may also try for c6.

RatkoVasic-NOAA · 2024-10-17T22:09:19Z

@jkbk2004 @BrianCurtis-NOAA
I just ran one of the tests that was failing on C6 (atmaero_control_p8_intel) and used export FI_VERBS_PREFER_XRC=0 in the job card. It passed on C5 (/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_3061724/atmaero_control_p8_intel/)
Can you try it on C6 as well?
It was up to new system installation, and @ulmononian found fix from admins' notes.

RatkoVasic-NOAA · 2024-10-18T03:34:15Z

@BrianCurtis-NOAA @jkbk2004 @ulmononian
All tests passed on Gaea C5:

/gpfs/f5/epic/scratch/Ratko.Vasic/WM-1.6.0/ufs-weather-model/tests
/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_432914

ECFLOW Tasks Remaining: 0/231
rt_utils.sh: ECFLOW tasks completed, cleaning up suite
rt.sh: Generating Regression Testing Log...

Performing Cleanup...
REGRESSION TEST RESULT: SUCCESS
******Regression Testing Script Completed******

If there is need more work on Gaea C6, I can make PR now. There are only 4 files that needed change, provided here.
Did you have time to try same fix for C6?

BrianCurtis-NOAA · 2024-10-18T12:07:44Z

Let me put all of this together and update this PR.

DeniseWorthen · 2024-12-17T15:35:36Z

This is not up-to-date for either CMEPS or CDEPS.

theurich · 2024-12-17T17:36:16Z

It will be interesting to see how ESMF version 8.8.0 will affect things on Gaea C6 with ESMF-managed threading. The latest feature frozen beta for 8.8.0 is v8.8.0b10. A lot of work for 8.8.0 was in one of the areas of the framework that affects high core count ESMF-managed threading runs. @GeorgeVandenberghe-NOAA reported some positive effects (using the earlier snapshot v8.8.0b09) for large core count runs on Gaea C5.

GeorgeVandenberghe-NOAA · 2024-12-17T19:41:30Z

I was going to build v8.8.b902 but should I instead just make b8.8.0b10 available everywhere I can and freeze on that for the next few months for large core count runs? I can build it on hercules, orion, hera (irrelevant there) , gaeaC5 and GaeaC6. I am forbidden of course from building it on WCOSS2. I build in my private stack outside of spack-stack before spack-stack is ready to include it.

…

On Tue, Dec 17, 2024 at 5:36 PM Gerhard Theurich ***@***.***> wrote: It will be interesting to see how ESMF version 8.8.0 will affect things on Gaea C6 with ESMF-managed threading. The latest feature frozen beta for 8.8.0 is v8.8.0b10 <https://github.com/esmf-org/esmf/releases/tag/v8.8.0b10>. A lot of work for 8.8.0 was in one of the areas of the framework that affects high core count ESMF-managed threading runs. @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> reported some positive effects (using the earlier snapshot v8.8.0b09) for large core count runs on Gaea C5. — Reply to this email directly, view it on GitHub <#2448 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FXW6GI732CO4RHHPVT2GBOKTAVCNFSM6AAAAABPIFNEJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBZGEZTENRXGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

theurich · 2024-12-17T19:53:33Z

I was going to build v8.8.b902 but should I instead just make b8.8.0b10 available everywhere I can and freeze on that for the next few months for large core count runs? I can build it on hercules, orion, hera (irrelevant there) , gaeaC5 and GaeaC6. I am forbidden of course from building it on WCOSS2. I build in my private stack outside of spack-stack before spack-stack is ready to include it.

@GeorgeVandenberghe-NOAA it's probably worth some coordination with the spack-stack folks on the UFS side, like @AlexanderRichert-NOAA and @RatkoVasic-NOAA. Spack-stack is moving forward with the latest ESMF beta tag v8.8.0b10: JCSDA/spack-stack#1409

BrianCurtis-NOAA · 2024-12-17T19:58:01Z

FYI: WCOSS2 won't accept a beta snapshot, so if we want to get the latest ESMF in WCOSS2, it will need an official release at some point soon. Also since the process has typically been slow, we will want to try getting that started as soon as there is an official release.

theurich · 2024-12-17T20:00:33Z

FYI: WCOSS2 won't accept a beta snapshot, so if we want to get the latest ESMF in WCOSS2, it will need an official release at some point soon. Also since the process has typically been slow, we will want to try getting that started as soon as there is an official release.

@BrianCurtis-NOAA The official ESMF 8.8.0 release date is planned for early/mid January.

theurich · 2024-12-17T20:03:29Z

But of course we do need the beta testing, so we understand how 8.8.0 will be doing in the field.

GeorgeVandenberghe-NOAA · 2024-12-17T21:41:01Z

Yep. Won't happen on WCOSS2 by policy 😡

…

On Tue, Dec 17, 2024 at 8:03 PM Gerhard Theurich ***@***.***> wrote: But of course we do need the beta testing, so we understand how 8.8.0 will be doing in the field. — Reply to this email directly, view it on GitHub <#2448 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FVDA4QYC7GUAW2O5AD2GB7STAVCNFSM6AAAAABPIFNEJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBZGQ4TMNJRGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA · 2024-12-17T21:43:09Z

I will likely be ahead of spack-stack in making this available. You've answered the question.. I will go with v8.8.0b10. Probably be available on Gaea C5 under my stack tomorrow and Orion/Hercules on Thursday.

…

On Tue, Dec 17, 2024 at 7:54 PM Gerhard Theurich ***@***.***> wrote: I was going to build v8.8.b902 but should I instead just make b8.8.0b10 available everywhere I can and freeze on that for the next few months for large core count runs? I can build it on hercules, orion, hera (irrelevant there) , gaeaC5 and GaeaC6. I am forbidden of course from building it on WCOSS2. I build in my private stack outside of spack-stack before spack-stack is ready to include it. @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> it's probably worth some coordination with the spack-stack folks on the UFS side, like @AlexanderRichert-NOAA <https://github.com/AlexanderRichert-NOAA> and @RatkoVasic-NOAA <https://github.com/RatkoVasic-NOAA>. Spack-stack is moving forward with the latest ESMF beta tag v8.8.0b10: JCSDA/spack-stack#1409 <JCSDA/spack-stack#1409> — Reply to this email directly, view it on GitHub <#2448 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FR2STURI2ZSR6ITUST2GB6NLAVCNFSM6AAAAABPIFNEJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBZGQ3TOOJVGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

NickSzapiro-NOAA

There are some slow compile times (s2swa_32bit_pdlib_sfs_intel, s2swa_debug_intel, s2s_intel, s2swa_faster_intel), may be worth monitoring if they persist

dpsarmie

Agree with Nick, we definitely need to look into those compile times for C6.

BrianCurtis-NOAA · 2024-12-18T17:22:10Z

A lot of thanks to EPIC group in helping to get this PR to the finish line.

GeorgeVandenberghe-NOAA · 2024-12-18T17:35:39Z

Is ESMF/8-8-10 in spack-stack on Gaea C6 or should I try to build it there. I haven't focused much on C6 because ufs-weather-model didn't work there until just now and due to policies NCEP does not have much footprint on C6. I can cobble together build and run systems but it's great I will no longer need to and I dream of being able to sunset all of my private hacks forever.

…

On Wed, Dec 18, 2024 at 5:22 PM Brian Curtis ***@***.***> wrote: A lot of thanks to EPIC group in helping to get this PR to the finish line. — Reply to this email directly, view it on GitHub <#2448 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FRCW4IXG66EARKW67T2GGVNVAVCNFSM6AAAAABPIFNEJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRHA4DINBWGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

RatkoVasic-NOAA · 2024-12-19T16:50:00Z

@GeorgeVandenberghe-NOAA not for now. UFS-WM is using spack-stack@1.6.0 (with esmf 8.6.0) and latest spack-stack (1.8.0) is installed with esmf@8.6.1. It can be added as chained environment, like @AlexanderRichert-NOAA did it on Hercules:

hercules: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs> ll -d esmf-8.8.0b0*
drwxr-sr-x+ 7 role-epic epic 16384 Nov 14 13:59 esmf-8.8.0b04-intel-2021.9.0
drwxr-sr-x+ 7 role-epic epic 16384 Nov 26 12:10 esmf-8.8.0b06-intel-2021.9.0
drwxr-sr-x+ 7 role-epic epic 16384 Dec 10 18:34 esmf-8.8.0b09-intel-2021.9.0

initial testing to get UFSWM working on Gaea C6

b968b96

Merge branch 'develop' of github.com:ufs-community/ufs-weather-model …

efe342e

…into gaeac6

jkbk2004 mentioned this pull request Oct 4, 2024

Enable ufs-weather-model on Gaea-C6 #2407

Closed

BrianCurtis-NOAA added 4 commits October 4, 2024 08:53

gaea->gaea-c5 and gaeac6->gaea-c6

7476837

Fixed linter issue

742a7c2

Update to 192 cores on Gaea-c6

5bee5b2

Update tests to gaea-c5 and added gaea-c6 where necessary

63a56ac

DusanJovic-NOAA reviewed Oct 4, 2024

View reviewed changes

tests/compile.sh Outdated Show resolved Hide resolved

Remove MOM6SOLO from compile.sh

bb83396

Merge branch 'develop' into gaeac6

532f418

RatkoVasic-NOAA mentioned this pull request Oct 18, 2024

Gaea C5 lib issue #2472

Closed

DeniseWorthen mentioned this pull request Dec 11, 2024

HRv4 hangs on orion and hercules #2486

Open

RatkoVasic-NOAA and others added 2 commits December 16, 2024 20:49

Merge branch 'develop' into gaeac6

ca2c0f1

sync up against develop branch

3b57be0

sync up cmeps/cdeps hashes

b193747

jkbk2004 added 3 commits December 17, 2024 19:42

add gaea c6 RT log: passed

0990c81

revert back complier version update on gaeac5: RT pased

dba7cbf

sanity check on hera: RT passed and no impact on other machines

bdb2a6a

jkbk2004 requested review from DusanJovic-NOAA and NickSzapiro-NOAA December 18, 2024 15:56

jkbk2004 approved these changes Dec 18, 2024

View reviewed changes

jkbk2004 requested a review from RatkoVasic-NOAA December 18, 2024 16:00

jkbk2004 added No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. labels Dec 18, 2024

NickSzapiro-NOAA approved these changes Dec 18, 2024

View reviewed changes

jkbk2004 requested a review from dpsarmie December 18, 2024 16:30

DusanJovic-NOAA approved these changes Dec 18, 2024

View reviewed changes

dpsarmie approved these changes Dec 18, 2024

View reviewed changes

jkbk2004 merged commit e119370 into ufs-community:develop Dec 18, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaea C6 support for UFSWM #2448

Gaea C6 support for UFSWM #2448

BrianCurtis-NOAA commented Oct 2, 2024 •

edited by jkbk2004

Loading

BrianCurtis-NOAA commented Oct 2, 2024

BrianCurtis-NOAA commented Oct 2, 2024

RatkoVasic-NOAA commented Oct 4, 2024

sanAkel commented Oct 4, 2024

BrianCurtis-NOAA commented Oct 4, 2024

BrianCurtis-NOAA commented Oct 4, 2024

DusanJovic-NOAA commented Oct 4, 2024

ulmononian commented Oct 16, 2024 •

edited

Loading

jkbk2004 commented Oct 16, 2024

jkbk2004 commented Oct 17, 2024

aerorahul commented Oct 17, 2024 •

edited

Loading

RatkoVasic-NOAA commented Oct 17, 2024

ulmononian commented Oct 17, 2024

BrianCurtis-NOAA commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 17, 2024 •

edited

Loading

ulmononian commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 18, 2024

BrianCurtis-NOAA commented Oct 18, 2024

DeniseWorthen commented Dec 17, 2024

theurich commented Dec 17, 2024

GeorgeVandenberghe-NOAA commented Dec 17, 2024 via email

theurich commented Dec 17, 2024

BrianCurtis-NOAA commented Dec 17, 2024

theurich commented Dec 17, 2024

theurich commented Dec 17, 2024

GeorgeVandenberghe-NOAA commented Dec 17, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 17, 2024 via email

NickSzapiro-NOAA left a comment

dpsarmie left a comment

BrianCurtis-NOAA commented Dec 18, 2024

GeorgeVandenberghe-NOAA commented Dec 18, 2024 via email

RatkoVasic-NOAA commented Dec 19, 2024

Gaea C6 support for UFSWM #2448

Gaea C6 support for UFSWM #2448

Conversation

BrianCurtis-NOAA commented Oct 2, 2024 • edited by jkbk2004 Loading

Commit Queue Requirements:

Description:

Commit Message:

Priority:

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

Changes

Regression Test Changes (Please commit test_changes.list):

Input data Changes:

Library Changes/Upgrades:

Testing Log:

BrianCurtis-NOAA commented Oct 2, 2024

BrianCurtis-NOAA commented Oct 2, 2024

RatkoVasic-NOAA commented Oct 4, 2024

sanAkel commented Oct 4, 2024

BrianCurtis-NOAA commented Oct 4, 2024

BrianCurtis-NOAA commented Oct 4, 2024

DusanJovic-NOAA commented Oct 4, 2024

ulmononian commented Oct 16, 2024 • edited Loading

jkbk2004 commented Oct 16, 2024

jkbk2004 commented Oct 17, 2024

aerorahul commented Oct 17, 2024 • edited Loading

RatkoVasic-NOAA commented Oct 17, 2024

ulmononian commented Oct 17, 2024

BrianCurtis-NOAA commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 17, 2024 • edited Loading

ulmononian commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 18, 2024

BrianCurtis-NOAA commented Oct 18, 2024

DeniseWorthen commented Dec 17, 2024

theurich commented Dec 17, 2024

GeorgeVandenberghe-NOAA commented Dec 17, 2024 via email

theurich commented Dec 17, 2024

BrianCurtis-NOAA commented Dec 17, 2024

theurich commented Dec 17, 2024

theurich commented Dec 17, 2024

GeorgeVandenberghe-NOAA commented Dec 17, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 17, 2024 via email

NickSzapiro-NOAA left a comment

Choose a reason for hiding this comment

dpsarmie left a comment

Choose a reason for hiding this comment

BrianCurtis-NOAA commented Dec 18, 2024

GeorgeVandenberghe-NOAA commented Dec 18, 2024 via email

RatkoVasic-NOAA commented Dec 19, 2024

BrianCurtis-NOAA commented Oct 2, 2024 •

edited by jkbk2004

Loading

ulmononian commented Oct 16, 2024 •

edited

Loading

aerorahul commented Oct 17, 2024 •

edited

Loading

RatkoVasic-NOAA commented Oct 17, 2024 •

edited

Loading